The BBC World Service Archive experiment

Yves Raimond, BBC R&D IRFS / @moustaki

The BBC Archive

Publishing our archive

The World Service archive

The missing metadata

Machine listening

Automated speech recognition

Automated transcripts

Automated tagging

So we now need to isolate these useful clues, and infer unambiguous topics of the programme from them. As a target vocabulary, we use Wikipedia and its data counterpart DBpedia -- unambiguous and enabling us to retrieve more information about each of these topics, for example the geolocation of a particular place. There's quite a lot of tools out there to go from a piece of text to a set of descriptive topics. Most of the existing automated tagging tools are designed to work on text that was manually written, and rely on punctuation, capitalisation, etc. None of which are present in our automated transcripts. We therefore developed our own tool. This tool first locates all possible clues from the automated transcripts. It then uses the structure of the Wikipedia graph to disambiguate and rank keywords spotted throughout the transcript. For example, if a programme mentions Paris and Tour Eiffeil a lot, we will pick Paris in France, as it is closer in the Wikipedia graph to the Tour Eiffeil. If a programme mentions Paris and Texas a lot, we will pick Paris in Texas, as it is closer in the Wikipedia graph to the Texas resource. For each programme, we get a ranked list of Wikipedia tags, describing what the programme is about.

Example results

Processing archives in the cloud

We now have a process which can derive for each programme within our archive a ranked list of descriptive tags, describing what each programme is about. However processing large archives remains a challenge. Tagging a programme can take around 90 minutes for a 60 minute programme on one CPU, meaning it would take more than 4 years to process the entire World Service archive. We therefore developed a framework to process very large archives using Amazon Web Services. A message queue distributes work between a number of independent `workers', hosted on AWS, and picking up new jobs as soon as they're up and ready to process data. Using AWS gives us a potentially infinite number of such workers, meaning that the only bottleneck to process a large archive is the bandwidth between our content servers and Amazon's servers. In our case it took around two weeks to process the whole archive (70,000 programmes), for a total cost of around $3000.

Noise

Algorithms and people

http://worldservice.prototyping.bbc.co.uk

Data validation

Speaker segmentation

Crowd-sourcing speaker names

Speakerthon

Propagating speaker names

Evaluating speaker identification

Refining our models

User activity

How good is the data?

Tags are a large and sparse space
When is a tag correct?
When is a programme tagged completely?
How do you measure crowdsourced data?

Who does the work?

Emerging shape of the archive

Visualising the archive

We've been experimenting with various ways of presenting the resulting data. In particular, the archive may hold programmes that could provide context for current news events. This visualisation is trying to tackle exactly that: surfacing archive programmes that relate to current news events. The big blue dots are topics that were discussed on BBC News in the last five minutes. The small dots are programmes within the archive that relate to those topics. The red-er a dot is, the more likely it is to be relevant to a particular news event. The visualisation in this slide was captured during the May 2013 Prime Ministerial election in Pakistan, involving Imran Khan, a politician and former cricketer. The red programmes in this visualisation include a 1990 Benazir Bhutto documentary and a 2003 Imran Khan interview, which could help provide more context around this particular election. We published a paper about this visualisation at ISWC this year.

Semantic Web Challenge 2013 - First prize!

Code

ClOud Marketplace for Multimedia Analysis

Conclusion

Thank you!

Photo credits:

http://www.flickr.com/photos/andyarmstrong/4402416306/
http://www.flickr.com/photos/nicecupoftea/8579975238/
http://www.flickr.com/photos/11561957@N06/5202870020/
http://www.flickr.com/photos/hubmedia/2141860216/
http://www.flickr.com/photos/allison_mcdonald/7604871594
http://www.flickr.com/photos/aayars/4072755936/