The BBC World Service archive prototype

Towards an alternative approach to publishing large archives?

Yves Raimond, BBC R&D IRFS / @moustaki

The BBC Archive

Cataloguing the archive

In Our Time archive

BBC Four Army collection

Tagging programmes

Linked Data

An alternative approach

The World Service archive

Unlocking the archive through machine listening

Automated speech recognition

Automated transcripts

Automated tagging

So we now need to isolate those useful clues, and infer what the topics of the programme from them. Most of the existing concept tagging tools are designed to work on text that was manually written, and rely on punctuation, capitalisation, etc. We therefore developed our own tool, locating all those useful clues from the automated transcripts. We use the structure of the DBpedia graph to disambiguate and rank keywords spotted throughout the transcript. For example, if a programme mentions Paris and Tour Eiffeil a lot, we will pick Paris in France, as it is closer in the DBpedia graph to the Tour Eiffeil. If a programme mentions Paris and Texas a lot, we will pick Paris in Texas, as it is closer in the DBPedia graph to the Texas resource. For each programme, we get a ranked list of DBpedia tags, describing what the programme is about.

Example results

Automated tagging evaluation

Dataset of 132 programmes manually tagged
TopN measure
Random baseline: 0.0002
Our algorithm: 0.209
Next best: 0.195
Dataset and evaluation script available on our Github
Core algorithm available on Github

We evaluated our algorithm on a dataset of 132 manually tagged programmes. We use a measure used for evaluating automated tagging system in the Music Information Retrieval world called TopN. This measure will be 1 if the top-N tags returned by our algorithm exactly match the N manually added tags. The measure will be lower the further down a manually applied tag is in the list of returned tags. We compared our algorithm with a baseline random tagger, and with a couple of off-the-shelves concept tagging tools and our algorithm performed best. This is understandable as those concept tagging tools are meant to work on manually written text, not noisy automated transcripts. We also published our evaluation dataset and scripts, including the results of our automated transcription, on our Github page. Other labs are currently working with this dataset to evaluate their automated tagging algorithms.

Processing archives in the cloud

Bootstrapping search and discovery

Noise

Data validation

Speaker segmentation

Crowd-sourcing speaker names

Propagating speaker names

Evaluating speaker identification

User activity

Emerging shape of the archive

Visualising the archive

Such large archives of content are holding a significant amount of 'institutional memories'. A large number of topics will be covered in some form by the programmes held in this archive. In particular, the archive may hold programmes that could provide context for current news events. For example a 1983 `Medical Programme' episode on techniques for measles immunisation could help put in context a recent epidemic. This visualisation is trying to tackle exactly that: surfacing archive programmes that relate to current news events. The big blue dots are topics that were discussed on BBC News in the last five minutes. The small dots are programmes within the archive that relate to those topics. The red-er a dot is, the more connected it is - so the more likely it is to be relevant to a particular news event.

http://worldservice.prototyping.bbc.co.uk

ClOud Marketplace for Multimedia Analysis

(COMMA)

Thank you!

Photo credits:

http://www.flickr.com/photos/andyarmstrong/4402416306/
http://www.flickr.com/photos/nicecupoftea/8579975238/
http://www.flickr.com/photos/11561957@N06/5202870020/
http://www.flickr.com/photos/hubmedia/2141860216/
http://www.flickr.com/photos/allison_mcdonald/7604871594
http://www.flickr.com/photos/aayars/4072755936/