More Than Words
The search industry is racing to come up with better tools to dig through the Web’s expanding trove of video and photos.
Type “romantic kiss” into Google and its nimble algorithm will comb billions of Web sites to retrieve 14,500,000 results. But what you get back are mentions of kisses written in words: a howstuffworks.com guide to the basics; a link on GetRomantic.com gives tips on suaver kissing; a Wikipedia entry traces the peck to the tale “Sleeping Beauty.”
Conspicuously missing is the really good stuff, the videos of sexy smooches. Heaven knows there are hundreds of such files on YouTube, AOL and other sites that store video uploaded by Web exhibitionists. Even on Google’s new video search service there’s only 44 clips to be found—and their relevance is iffy. That’s because the results are only as precise as the text, or “tags,” typed in by the creator to identify the who, what and where of the shot. A search for video of celebrities having their “red carpet moments,” for example, in Google Video turns up a very noncelebretastic, talking-head interview with Brady Bunch mom, Florence Henderson.
Digital video and photos are proliferating madly online, and the tools to find them have not kept pace. Yahoo indexes 35 million videoclips and 3 billion photos on the Web. Its Flickr site, where people upload personal shots, adds a million shots daily to its 250 million total and doubles every five to six months. Blinkx, which claims to be the world’s biggest video search machine, said in June it can search through 4 million hours of video. YouTube streams 100 million videoclips daily, and its users upload 65,000 new ones every day.
A technology arms race is on to create new and better ways to mine this content intelligently. Search giants Microsoft, Google and Yahoo are working on the problem, as are big Web portals such as AOL and a host of software upstarts and university researchers. They’re pulling together years of work on visual analysis algorithms, facial recognition and speech-to-text translation. Some of the more advanced video-analysis technology was developed in the 1980s for the government to monitor foreign newscasts.
“You can’t count anyone out,” says Timothy Tuttle, AOL’s vice president of video search. “AOL, Microsoft and Yahoo already existed when Google emerged as the Google of Web search.” Blaise AgÃ¼era y Arcas, a software architect at Microsoft, says: “Real image search is on the cusp of happening. Ultimately we’re trying to get computers to do what people do, readily.”
At stake are billions of dollars in potential advertising and e-commerce revenue. Online video is sure to steal a big slice of the $74 billion TV ad pie, and the search software that makes it easiest to find a clip or a scene in a show stands to gain a big share of that revenue. A search service that can find the hot pair of shoes you saw in an online photo of Scarlett Johansson could reap a portion of every purchase it generates at an online shoe store.
Today the precision of visual search is largely limited to the ability to parse metadata, the descriptive text attached to a photo or videoclip. Metadata is usually supplied by the creator of the content and might include a title, location, date or description of objects and people in the file. Closed-captioning data would be a great source of metadata, but few video clips ever include it. Users on social-networking sites such as Flickr can edit each other’s metadata tags.
But searching solely by metadata misses a lot of what happens in a videoclip or photo. If someone doesn’t type in a rich description, the clip or photo may never turn up. “Searching the metadata alone is like trying to follow the plotline of a novel by reading the library card,” says Suranga Chandratillake, the founder and chief technology officer of video searcher Blinkx in San Francisco. “It is fairly simplistic,” admits Peter Chane, group product manager of Google Video.
Metadata is also prone to abuse, as when folks use popular search words to plug their own unrelated videos. A search for “Zidane head butt” on video.google.com turns up on the fourth results page an amateur music video called “Drink it Down,” by Justin Nels. The song doesn’t have any lyrics about French soccer star Zinedine Zidane, but “Zidane head butt” appears in a description of the video, along with other crafty hot-button words like “Ronaldinho Ronaldo” and “soccer.” Similarly, a 2.5-minute YouTube clip of Sean Young and his friend Peter Domenici doing in-line skating stunts is listed on the first page of YouTube search results for “sex” because the clip is labeled “sex.” Whoever posted this is either confused or has mastered the loopholes of metadata search.
Video search is also less than ideal because, for now, much of the best stuff is not shared online. Blinkx says 75% of its searchable database comes from Web sites like YouTube, CNN and MTV that allow spiders to crawl into its web. The other 25% is through partnerships with the likes of Fox News and HBO, which keep their video files off-limits to the search robots. Chandratillake is encouraged by recent announcements from the big networks that they’re planning to put more of their shows online. In a trial in May ABC streamed such hits as Desperate Housewives and Lost.
Many in the search business think metadata will be enough to satisfy a general search audience, but a handful of players are moving beyond tags to analyze what’s inside moving and still images. Blinkx is developing a video searcher that uses speech-recognition software that “listens” to streaming video. It also captures spoken English phonemes, the building blocks of syllables and words, and guesses at their meaning using probability analysis and context from what it already knows (like details in metadata). Its speech analysis is 60% to 95% accurate, depending on the quality of the audio source. Blinkx can also detect significant scene shifts and deliver results at the most relevant point in a video. Blinkx is working on technology that “reads” text in a video, like a street sign or a name on a sports jersey. “Nobody can do that today,” he says.
Blinkx is an offshoot of U.K. search-software firm Autonomy, from which it licenses its “listening” technology. Autonomy created it for more specific purposes, like battlefield communications for the U.S. military, and to ensure that financial traders aren’t trading stock that a bank controls. Other firms that traditionally sold video and audio analysis technology to corporate and government buyers are moving into consumer search, including BBN Technologies in Cambridge, Mass., ShadowTV in New York and TVEyes in Fairfield, Conn.
Blinkx, backed with $12 million in money from its founders and angel investors, has licensed its software to Lycos and Totalvid.com and plans to generate ad revenue from sponsored links alongside search results. Chandratillake says he is not looking to be subsumed by a larger player like Google. “I hope that we do well enough to be the next Google,” he says.
Yahoo and AOL feel the same way and are adding new search tricks. For now they both largely rely on metadata to search their content and that of media partners (Yahoo has deals with MTV and VH1; AOL’s exclusives include old episodes of Welcome Back, Kotter, donated by its sibling Warner Bros.). Yahoo mines the Web for files with multimedia formats such as “avi,” “quicktime” or “flash.” AOL, which acquired two video-search firms in the last two years, looks at Web front pages the way a person would, spotting square displays, changing images and specific resolutions, any of which indicate a videoclip is present. “We find video no one else can,” claims AOL’s Tuttle.
Sophisticated photo searching may be a lot closer at hand. Munjal Shah, cofounder of image search startup Riya, offered a cool demonstration recently at the company’s office in San Mateo, Calif. Riya’s chief technology officer clicked on a digital photo of a 2-inch-heel gold sandal displayed on Amazon.com’s Web site. “The software is looking at the edges of the shoe for clues,” says Shah.Subscribe to Forbes and Save.
In a few seconds dozens of similar-looking shoes—some sandals, others sling-backs—begin popping up on screen. No text for “sandals” or “open-toed shoes” tipped off Riya’s software. Instead, it found shoes with like line, shape, color and patterns. A click on the red of a color palette replaces the gold shoes with reddish ones.
Riya’s technology can match rugs, apparel, handbags and watches. A similar technology does people-matching. “Online dating might be our killer opportunity,” says Shah, grinning widely. The idea: If you like Brad Pitt, take a look at these seven single guys on MySpace who look like him. (A site called MyHeritage.com does celebrity lookalike search, too.) “We’ll first tackle shopping and people, then we’ll bring image search to the world,” says Shah.
Shah cofounded Riya in August 2004 with two fellow Stanford graduate students. They’ve raised $19.5 million in venture funding from Bay Partners, First Round Capital, Leapfrog Ventures and Blue Run Ventures. To index a billion images in the next 12 months Riya will spend “less than $10 million,” one-eighth of what it would have cost in 2000. “Power is now our biggest expense, next to labor,” says Shah.
For now Riya’s only product is facial-recognition software for personal photo albums. You upload your photos to Riya and tag a few of the faces that appear frequently to “train” the software. It then automatically finds those people in all of your photos with what Shah says is a 70% success rate. It doesn’t do well when the lighting is dim or a face is at an angle. Key data points: ratios of distances between eyes, nose and lips; hair color; presence of beards and glasses. It takes roughly 20 seconds to create a 6,000-byte descriptor of a photo. An image comes up in the search results if its descriptor is a close match of the descriptor of the original. Riya uses the same technique for watches and rugs.
In the next two months Riya will launch object searches in five categories (rugs, shoes, handbags, jewelry and apparel), using images from online retailers. Riya will get a cut of the sale when it refers buyers to sites.
Google, which was rumored to have recently made a bid for Riya, ended up buying a face-recognition outfit called Neven Vision for an undisclosed amount in August. Google is mum on its plans to enhance its image search but will likely start by upgrading the power of its photo album software, Picasa. Don’t be surprised if the skills migrate to product-finding or some other commercial angle.
Sidebar: But What’s So Wrong With Metadata? The search industry’s search for better tools to dig through the Web’s trove of video and photos.
But What’s So Wrong With Metadata?
The hunt for the right videoclip on the Web today is almost entirely dependent on metadata, the tags and descriptions that identify the who, what and where. Many video site operators believe that metadata is the best tool for searching and may be all we need. But some new firms, like Blinkx, are creating different ways to search inside video files hoping for better results.
SPEECH RECOGNITION—FOX NEWS
How’d Katie do? A search on Blinkx for “CBS doubles ratings” turned up a clip of Fox News’ analysis of the anchor’s debut. These words were not in the metadata description of the clip, but Blinkx’s engine found them using speech recognition to transcribe the Fox show itself.
SPEECH RECOGNITION—BBC NEWS
A Blinkx search for “mauresmo winning slam” found this footage of France’s Amelie Mauresmo winning the 2006 Australian Open. Mauresmo’s name was in the metadata, but the other words were heard in the play-by-play. Blinkx turned up highlights of the match.
A search for “stupid pet tricks” unearthed this clip of Maggie the Counting Dog. The search terms were not in the metadata, and just the word “trick” came up once during the segment. Blinkx figured out its relevance by associating words like dog, pooch and unbelievable.