Automated Semantic Tagging of Speech Audio
Yves Raimond, Chris Lowis, Roderick Hodgson, BBC R&D
Jonathan Tweed, MetaBroadcast
Demo Track, WWW, Lyon
18 April 2012
The BBC archive

Increasing the value of archives by interlinking
Automatically extracting topics from speech audio and identify them with Linked Data URIs
Workflow
- Automated speech recognition
- Term identification
- Term disambiguation
- Ranking
More details about each step in our LDOW paper
The World Service archive
- Around 70,000 programmes
- Covering 6 decades
- ~ 3 years of continuous audio
- Sparse metadata
- ~ 500 TB of content
Processing the World Service archive
- Workflow steps isolated in individual workers
- Computation-intensive workers on Amazon Web Services
- All managed by Message Queues and an API centralising the data
- Only bottleneck: bandwidth to Amazon's servers