That roll of pre-made cookie dough in the refrigerated aisle of the supermarket harbors a dirty secret: There's a good chance it will be eaten before it ever reaches the oven.
The numbers don't lie. Of roughly 400 people who bought refrigerated cookie dough in a recent two-month period and blogged about it, more than 60 percent consumed the product unbaked. The guilty were quick to confess. "I ate a roll of raw cookie dough -- again." "Bought another roll of cookie dough. Couldn't wait to get home to eat it. Spooned it into my mouth as I drove." "I hate myself. I've had 12 rolls of raw cookie dough this month."
"It was just amazing. I would almost venture to guess they had some eating issues," says David Howlett, vice president of client services for Umbria, the Boulder-based market-research company that uncovered the gluttonous trend. "It's like, you are going to keel over and die!"
Finding such obscure and potentially lucrative consumer trends is Umbria's specialty. The company uses the blogosphere and similar Internet phenomena -- the world of MySpace, Wikipedia, YouTube, Flickr and RSS feeds that information pundits label as "user-generated content," "consumer-generated media" or "social media," though many prefer the imprecise yet catchy "blogosphere" -- as a perpetual, globe-spanning focus group. As people blog about the new iPhone, what they thought of Borat and how they take their cookie dough, they provide a wealth of unsolicited opinions that can be mined for valuable information about how a target audience thinks -- and consumes.
Denver Outlaws / Major League Lacrosse All Star Game
TicketsSat., Dec. 29, 6:00pm
There's only one problem: The blogosphere is a mess, full of colloquial, unorganized, factually questionable rants, rambles and rumors, and that mess is growing by the second. Sorting through it all to find reliable proof of, say, an untapped population of dough gobblers is anything but a piece of cake.
Dry-erase boards have a short lifespan in Ted Kremer's corner office. There are just too many complicated thoughts bubbling out of his energy-drink-fueled, spiky-haired and goateed head. The only way Umbria's chief technology officer can explain them intelligently is to continuously illustrate with scribbles of multi-colored flow charts, multigraphs and Venn diagrams. No amount of erasing will suffice; the faint remnants of countless circles, arrows and equations become permanently tattooed in the white surface like a wall-spanning watermark. "Every now and then I just throw it away and buy a new one," Kremer remarks in his intense, rapid voice, between scribbles. Markers don't fare much better; the one in his hand is just about dead. "My black pen is falling apart. We are going to switch colors, but there is no meaning to the change."
For Kremer, however, there's really meaning to everything, fundamental patterns and underlying significance beneath the unruly pandemonium of the world. "I find the chaotic aspects of human nature fascinating," he says. "That doesn't mean I am not going to try to find order in the chaos." The 35-year-old has made a career of doing just that. He spent his middle- and high-school vacations writing code for his father's East Texas accounting-software company before majoring in computer engineering at the University of Houston. He built software that analyzes the habits of cell-phone users to predict when they were likely to switch carriers. He developed computer systems that allow doctors to quickly make sense of digital mammogram X-rays compiled from hospitals across the nation. In the summer of 2003, he helped found Umbria with Howlett and Howard Kaushansky. Today's he's in charge of the technology that allows the company to decipher and organize the huge amounts of information constantly uploaded onto the Internet through blogs, message boards, web forums and the like -- one of the most unusual, chaotic and rapidly expanding data sources imaginable.
The scale and complexity of Umbria's task is well beyond the scope of any human, and also beyond the capacity of most computers. What typical search engines do -- scan the blogosphere and find the most relevant mentions of a particular topic -- is hard enough. What Umbria's computers have to do -- find and categorize every mention of a topic in the blogosphere -- is far trickier. "Search engines are looking for one thing only," says Kremer. "We pick up where search leaves off."
If, for example, Budweiser wants to know what bloggers everywhere are saying about its product, Umbria can't just hand over the ten most relevant blog postings that contain the terms "Bud" and "beer." The company has to locate every posting that mentions "Bud" and "beer," remove false positives such as those describing "drinking beer with my bud," and then make sense of it all: who's drinking Bud, what they think of it, and how the beer can be better. To do that, a computer can't just search blogs; it has to process and understand them. In a sense, it has to read them.
And getting computers to read is basically impossible. "It's almost a magical event how we learn language," says James Martin, an Umbria advisor and computer science professor at the Center for Spoken Language and Institute of Cognitive Science at the University of Colorado at Boulder. "When it comes to getting a computer to learn language the way a three-year-old does, with almost no instruction whatsoever, we are stumped." While a properly programmed computer can easily find every mention of the word "red" in a thousand-page document and translate them into every language on Earth, since the computer cannot see or thoroughly understand how the world works, there's no way it can comprehend what "red" actually means.
Even if a computer somehow manages to digest the professionally edited, staid text of the Wall Street Journal, that doesn't mean it will be able to digest the screwball lexicon of bloggers. "It's the wild, wild West," Martin says of the blogosphere. "It is just shocking how different it is from normal written text. It's almost like a new form of communication." To understand it, a computer must not only master traditional spelling and grammar, but also recognize innumerable misspellings; disregard lack of punctuation and capitalization; have a thorough knowledge of colloquialisms, emoticons and newly coined words such as "truthiness"; and be up to date on all of pop culture.
Since Kremer can't actually teach Umbria's computers to read and understand blogs, he settles for teaching them enough rules about the text and make-up of the blogosphere so that the computers can act like they're reading. Called natural language processing, this is the process of turning words into a medium that can be understood by hopelessly illiterate computers. But programming a computer to recognize every rule that a human subconsciously follows to read a single sentence, much less the entire blogosphere, would be a monstrous task. "You can look at an e-mail subject line and know it's spam," says Martin. "If I sat you down and asked you to give me a set of rules you use to make those judgments, that would be hard." It's simpler to rely on what's called machine learning: letting a computer program these rules on its own.
"In the traditional approach to building an intelligent computer, a human programs in rules to tell it how to behave in any circumstance using their own knowledge," says Michael Mozer, a computer science professor at CU's Institute of Cognitive Science who's also an Umbria advisor. "There is so much knowledge inside our heads, people realized it's easier to get the computer to learn using largely the same data we have to learn with. Rather than you programming the system, you let the system learn by giving it various examples of how it should behave."
At Umbria, some of those examples come from Jacob Wagner. In his cubicle, he looks at porn all the time -- and it's part of his job. He's charged with weeding through one of the most meddlesome aspects of the blogosphere: spam blogs, or splogs, advertising-related, automatically generated fake blogs that clog up the Internet much the same way spam floods an inbox. Umbria constantly receives copies of new postings on blogs, message boards and other online social media; to "clean" these records of splogs, Wagner scans the sites listed, looking for telltale signs of junk. Most splogs are easy to spot, with web addresses such as www.movie-blogs/buy-movie-tickets-online.com, or posts titled "Free Windows XP software." Many are filled with randomly generated words: "Reading three sentences is a real headache," says Wagner. And if all links lead to German amateur porn, that's a dead giveaway.
Wagner can glance through tens of thousands of blog postings a day -- but that doesn't come close to covering the new splogs that appear in that time. To keep up with it all, Wagner's findings are fed into a computer so that the machine can learn to do in nanoseconds what takes Wagner hours. The computer will look for tell-tale patterns in the wordings of these spam blogs. Possible patterns could be as simple as no first-person statements in a post, or they could be much more complicated. These patterns are the rules that Wagner uses consciously or unconsciously to figure out that a blog is a splog, and the computer will use these same rules to weed out splogs itself -- playacting as Wagner, but at much, much faster speeds. "We are changing artificial intelligence," says Wagner. "We are taking my answers and helping the computer teach itself."
Umbria's 29-member staff includes a team of human blog annotators like Wagner. Some determine the age and gender of blog authors; others assess the sentiment, positive or negative, of posts; still others identify new colloquialisms and other lexicographical errata. All of their work goes into the company's computers so that the machines can learn to do the same thing.
Although these computers are getting better at "reading" the blogosphere all the time, they will always be Wagner's students. "In my job, you will read things that will make you cry," he says. And since the mechanical data-crunchers are very far from learning that, he's got job security.
Kremer can see the entire blogosphere on his computer screen. It looks like a big, brightly colored sphere, like an extra-large bouncy ball.
At least that's how così, Umbria's search-and-discovery tool, depicts the blogosphere. Short for Cicero Opinion Search Interface -- not to mention Italian for "thus" -- così is the user-friendly gateway to the machine-learning-taught, syntax-parsing, splog-discarding, pop-culture-grokking computerized blog "readers" designed by Kremer and his colleagues. When Kremer directs così to read certain parts of the blogosphere, the bouncy ball on his screen divides into smaller bouncy balls, then each of these balls is broken up, and so on. What results is a bunch of very specific -- and, for Kremer, very informative -- bouncy balls. "It's all segmentation at the end of the day," he explains. "Divide and conquer."
To demonstrate, he pulls up on così all of the written comments that one of Umbria's clients, a financial company, collected during a multi-year, online customer survey. (Like most of Umbria's client list, the firm's identity is confidential.) On his screen, the hundreds of thousands of responses are illustrated by a big bouncy ball. There's way too much information here for Kremer to analyze himself, so he has the computer do it for him. He instructs così to find just the responses mentioning "ATM" -- a smaller, different color bouncy ball appears -- and then tells the computer to find, in these 931 results, just the ones also mentioning "convenient." With this, a very manageable ball of five responses appears: "I love to deposit without any writing on the slip at the ATM. It's very convenient," "I would like more convenient locations."
This is così at its most primitive, performing a lot like a basic search engine. But it can also do much more. Kremer can have it list all the verbs in the "ATM" comments so that he can see how customers are using the ATMs -- are they withdrawing funds, depositing money, or printing account statements? He can tabulate the most frequently mentioned terms to see what the majority of people are saying about the ATMs, and then track how these terms change over time as opinions shift. Often, he can even figure out whether the comments are generally flattering or critical, since così is boning up on the tricky subject of sentiment: It knows that "shitty" connotes criticism, while "is the shit" definitely does not.
The computer, however, doesn't necessarily need Kremer's instructions; così can think for itself. To prove it, Kremer lets così decide what it considers interesting topics among all the ATM comments, which it does by locating terms that appear repeatedly in a few comments and nowhere else. In theory, this should provide Kremer with areas of discussion that he might never search for. In this case, così comes up with five topics it considers noteworthy, five balls it labels "Installation," "Add," "Drive," "Locations" and "Deposit." Clicking on "Deposit," Kremer skims through the comments. One reads, "All ATMs should take deposits."
"This is critical for me to know if I work at a bank," says Kremer. "This is something I can create an ad campaign around and attract new business."
And così has one other trick up its digital sleeve, something that Umbria's founders believe distinguishes the company from the handful of other tech companies jockeying to become the Nielsen ratings of blog-based market research: The program is becoming increasingly skilled at looking at a blog and, based on the language and make-up of the post, figuring out what type of person wrote it. Often it can automatically determine the gender of a blog's author, and not just those that discuss using the ladies' room or drinking beer at the monster truck rally. It's learned that men and women write blogs differently. "This is a huge generalization, but men use the blogosphere as a podium: ŒThis is what I think.' Women use it as a dialogue," says Janet Eden-Harris, Umbria's CEO. "The number of words that women use on a blog far exceeds that of men."
Age is another way così can slice and dice the blogosphere. If a blog is peppered with newfangled emoticons such as "o_O" (raised-eyebrow surprise), it's probably penned by a tech-savvy, peer-pressured member of Generation Y. If a blog features conventional emoticons such as ":-)" (happy), chances are it's helmed by a materialistic, debt-mired denizen of Generation X. If a blog is so embellished with embarrassingly worn-out uses of "phat" "bling" and "dawg" that there's no room for emoticons, it's got to be a boring old boomer.
Even if così flubs on a blogger's demographic profile, it's likely on target in terms of psychographics -- and in market research, that's frequently more important. For example, while Kremer is solidly entrenched in Gen X, così may incorrectly label him Gen Y because he digs Xbox games. But potential clients in the video-game biz probably won't mind: If Kremer thinks like a teenager, he may buy like one, too.
The market intelligence that così extracts from the blogosphere isn't perfect. The program is still learning to deal with ambiguous text attributes such as humor and sarcasm. And if it locates blog buzz around a product, that doesn't necessarily guarantee the product will be a hit -- as demonstrated by the sizable Internet chatter around the box-office bomb Snakes on a Plane. But così's findings, organized and polished by Umbria's research analysts, are remarkable enough to entice many major companies to pay up to hundreds of thousands of dollars to hear what it has to say.
Thanks to Umbria, CNN discovered that there were consistently more blog posts about Bush than Kerry in the weeks leading up to the 2004 election. Nike found out that despite being accused of sexual assault, Kobe Bryant was still very popular among online Gen Y males. And because a frustrated dog owner blogged that the only thing his sorry pooch would eat was frozen dog food, a pet-food company came up with a potential new product: doggie popsicles.
"That's one individual blogger, and the impact he could have is pretty incredible," says Umbria's Howlett.
Così has a knack for shattering conventional wisdom. It found that hip-hop slang migrated haphazardly from one age group to the next: "Aight" "boo-yaa" and "fo shizzle" hopped from Gen Y to boomers before they infiltrated Gen X (see page 18). It discovered that of all the twenty-somethings chattering about specific beer brands on the Internet, women were doing most of the talking (and, like their male counterparts, they were talking about Guinness much of the time). It also determined that 39 percent of all online discussions about video games were coming from women, a stat that appeared to be at odds with the idea of an excessively male-dominated game industry until, months later, Nielsen Media Research did a general survey of the gamer population to discover what fraction was female -- and found it was 39 percent.
Some of così's insights are downright eerie. While searching the blogosphere for a sports-drink client a few years ago, così came up with what it considered a promising area of discussion. It labeled this subject "alcohol" and listed mentions of partiers concocting sports-drink cocktails. After così spotted the trend, the craze soon spread across college campuses far and wide.
Sometimes it seems that così can predict the future -- and that's exactly what Umbria is banking on.
Kremer can hardly contain himself when he talks about the next big thing in blogosphere-based market research, the ultimate in bringing order to the chaos. It's not studying blogs, he says. It's studying bloggers.
"Where it really gets interesting, is instead of analyzing around a subject area, we organize around a blog author area," he says. "What do NASCAR dads in South Florida care about in this point in time? For a market-research company, that is the Holy Grail. From a technology aspect, that is even more interesting."
Così can already determine who is saying what in the blogosphere. The next step is to figure out what this information reveals about everyday folks' interests, opinions and activities--and what these folks will be up to. "On the blogosphere, we have this enormous collection of ideas and conversation and joking and flirting, a full range of ways we interact socially," says David Weinberger, a fellow at the Harvard Berkman Center for the Internet and Society and advisor for Technorati, a blog search engine. "It was always some form of the elite whose works were saved. We haven't had this. What this will do for social scientists is hugely significant. History isn't going to look the same. Sociology isn't going to look the same, psychology, fiction."
Business and marketing aren't going to look the same, either, as computers modeling everyday existence continues to boom. Netflix is offering $1 million to anyone who can take its millions of customer movie reviews and come up with a more accurate movie recommendation system. Entertainment industry analysts are using computers to break down movies and songs into their individual visual, musical or narrative components, hoping to divine what makes a hit.
The significance of all this will be one of the subjects discussed at the first International Conference on Weblogs and Social Media to be held in Boulder in March -- an event co-sponsored by Umbria. "We turn the world of content into math," Kaushansky, president of Umbria, told BusinessWeek. "And we turn you into math."
Some critics don't like the sound of all those equations. Matthew Hurst, author of the "Data Mining: Text Mining, Visualization and Social Media" blog, doesn't believe that computers will be able to predict all the quirks of his way of life anytime soon -- but he can see a day when companies, using computer modeling, place him in a very narrow consumer category and market very specific products to him accordingly. "I think it's a bit of a sad eventuality that we will exist in an ecology where completely unexpected events are kept away from you," says Hurst, who is also the director of science and innovation at Nielsen BuzzMetrics, one of Umbria's competitors. "There are some things I may like that are completely unpredictable. If I was just exposed to things that I would obviously like, that would lead to a pretty dull existence."
But if bloggers don't like the idea of marketers corralling them into tidy consumer pigeonholes, that isn't stopping them from divulging online -- and in lurid detail -- all the personal information required to make it happen. "Why are these people doing the advertiser's work for them?" asks Michael Silberstein, a philosophy of science professor at Elizabethtown College at Maryland. "I see the blogosphere as really a double-edged sword. On one hand, it gives people a voice, it's a boost to democracy. But on the other hand, there's this weird voyeuristic and exhibitionistic side of it that kind of freaks me out."
Kremer, too, admits that he's sometimes flabbergasted by the intimate details -- from constipation problems to cookie-dough addictions -- he's found people broadcasting across the blogosphere. But Umbria is very discreet, he insists, and when the CIA came calling, the company refused to cooperate. Kremer will also remove any blog from così's search lists if its author asks him to do so; of the millions and millions of blogs Umbria's searched, he's received fewer than a dozen such requests.
"It's frightening what people will share online," he says. "Have they no shame?" Yes, his company is based on such candidness, but that doesn't make it any less disturbing. And no amount of computer analysis or diagrams on the dry-erase board can help him make sense of it.
After all, he admits, "I personally don't have a blog."
Get the ICYMI: Today's Top Stories Newsletter Our daily newsletter delivers quick clicks to keep you in the know
Catch up on the day's news and stay informed with our daily digest of the most popular news, music, food and arts stories in Denver, delivered to your inbox Monday through Friday.