Keep up with the latest press releases and insights from RhythmOne.

Video search strives to be picture perfect

Could search engines improve their record on video by analysing sounds and images rather than relying on keywords?

Missed the winning goal in that crucial football match at the 2010 FIFA World Cup? Just get on the net and you’ll find hours of user generated video content of every moment in every match.

But with 650,000 - and counting - World Cup 2010 videos uploaded to YouTube alone, finding the right replay is a challenge. Existing video search tools struggle to deal with such a volume of content, but the search giants are on the case. Microsoft is sharpening the ability of its search engine, Bing, to find video content. Google, meanwhile, is set to launch an internet TV service later this year, using its video search technology to deliver the right footage.

The core strength of these engines has been in text search, but video search seems likely to move away from this approach. That’s because sorting video content using metadata - the keyword tags manually attached to videos - is like searching via an interpreter. Tags encapsulate one person’s judgement of a video’s content, and a tag-only search system will produce a lot of irrelevant results, says Suranga Chandratillake, chief executive of online video and audio search engine blinkx. “For video search to be really effective, you need better ways to understand what is going on in the actual footage.”

As well as metadata, blinkx uses speech recognition algorithms to interrogate a video directly. The transcripts it generates provide more data for the firm’s text-based search engine. blinkx’s algorithms attempt to parse a chunk of speech into phonemes the small sound segments that make up individual words. The speech recognition tools then attempt to reconstruct a sentence out of the phonemes. It is by no means a foolproof approach, however. “Two distinct sentences may contain indistinguishable phonemes:‘Chandratillake says. “So ‘recognise speech’ could be transcribed as ‘wreck a nice beach’.”

blinkx has been working on improving its speech recognition capabilities by building in feedback mechanisms. For instance, the user-added tags provide context to help decide which of two transcripts is most likely to be correct.

The drawback with this type of phonetic transcription analysis is that it is only suited to video with good quality sound, says David Gibbon at AT&T Labs Research in Middletown, New Jersey. “It encounters real problems with user-generated video, where the audio track may not be great,” he says- and such videos make up a sizeable chunk of online content.

Still, it might be possible to use the images themselves as part of the search. Next year, the US Defense Advanced Research Projects Agency (Darpa) will complete its $20 million Video and Image Retrieval and Analysis Tool (Virat) project, which uses computer vision algorithms to analyse surveillance footage for Significant events.

More modest academic projects hint at the approaches Darpa might adopt. It’s relatively easy to capture a series of stills that summarise a video, says Martin Halvey, a computer scientist at the University of Glasgow, UK. Image analysis tools can then search those stills for a target image by identifying objects, faces, textures letters and numerals. This is difficult on a large scale, however, because the processing power needed to compare one image with another becomes a problem when looking at huge numbers of files, Halvey says.

A different approach - semantic querying - could be the answer. It involves teaching a search engine to recognise semantic concepts, such as “grass”, “football” and “stadium”,using so-called supervised learning techniques, says Marcel Worring, a multimedia analysis researcher at the University of Amsterdam in the Netherlands. During a teaching phase, the system is fed with examples of the concept. Software algorithms define the concept by its colour, texture or shape to create models of each one.

“So with a new video, the model is applied and automatically a measure is given of how likely it is that the concept is present in that video,” says Worring.

The strength of the semantic querying approach is that it can work at multiple levels, so it can narrow the search more effectively. Worring and his colleague, Jun Wu, created a relatively simple two-layered algorithm that first distinguishes videos based on genre - news broadcast or sports footage, for instance. The system then goes on to refine the search results according to the style of the content - distinguishing, for example, a video packed with close-up action from one containing graphics.

Wu and Worring tested their system on over 200 clips ranging from 2 to 31minutes long, and genres including sport and pornography. It was able to classify the six genres it was trained to recognise, and identify seven semantic concepts with about 83 per cent accuracy. The researchers will present their work at the International Conference on Image and Video Retrieval in Xi’an, China, next week.

To search much larger video libraries, a good strategy might be to use keywords first to whittle down the number of results, then apply semantic querying to improve the quality and relevance of the videos finally presented to the searcher, says Worring.

Gibbon sounds a word of caution, though. The level of semantic detail that video search algorithms can actually recognise is still fairly limited, and a training session is required for each new concept. “I think there’s a long way to go before we can say we’re able to understand all that complexity,” he says.

If Gibbon is right, finding the ultimate video of that crucial goal might still be a problem even when the football party rolls into Brazil for the 2014 World Cup.