Consuming Linked Data at the BBC

Yves Raimond, BBC R&D IRFS / @moustaki

Talk in two parts

Use of Linked Data at the BBC
Challenges around Linked Data consumption

Use of Linked Data at the BBC

Radio since 1922

TV since 1930

On the Web since 1994

Programme support

1000 to 1500 programmes per day across 70 channels

www.bbc.co.uk/programmes

The web site is the API

schema.org

Using Linked Data (external)

Using Linked Data (internal)

Towards a Linked Data Platform

World Cup 2010

Tagging articles

These automated aggregations for various World Cup 2010-related concepts are driven by journalists tagging articles with web identifiers available in a centralised triple store and denoting people, places, events etc. and sourced from multiple data providers (including Linked Open Data sources). The resulting relationships between news articles and these concepts are then pushed to the central triple store. A benefit of using web identifiers as tags is that they are unambigous, and that we can retrieve more information about these tags when needed. For example when news articles are tagged with places URIs, we can easily retrieve the geolocation of these places, enabling us to plot our articles on a map. Tagging articles with URIs enable us to tackle a wide range of unforeseen use-cases.

Dynamic aggregations

BBC London 2012

The same approach was scaled up and used to drive the London 2012 BBC web site, covering around 250 countries, 300 events, 36 sports, 10,000 athletes and 30 venues. This data is sourced from multiple places, both from Linked Open Data sources and commercial data providers. Once all this data is sourced and we have web identifiers for all these things in our centralised triple store, we can start annotating our content with them, in a similar way as done for the World Cup 2010 web site. We used this mechanism to build automated aggregation pages for each of these things - for each country, sport, athletes and so on, by querying the centralised triple store for all BBC items related to them. The BBC London 2012 web site has been hugely successfull, and has proved the feasibility of this approach, heavily based on Semantic Web technologies, using a triple store, queried through SPARQL, to store relationships between our content and domain knowledge.

BBC Ontologies

The BBC Archive

http://worldservice.prototyping.bbc.co.uk

Challenges

Challenge 1: public endpoints

I am sure lots of you came face to face with the DBpedia 'screen of death', with cryptic error messages, especially just before semantic web conference deadlines. DBpedia isn't the only one, of course, and it would be unfair to blame it in particular. The main challenge is that public endpoints are shared resources, and are sometimes overloaded. So it's difficult to build reliable applications on top of most of them. There is actually a paper about this problem at ISWC, as well as a new service from the OKFN monitoring the status of SPARQL end-points. This is a huge challenge for Linked Data. The main idea behind Linked Data is to publish and interlink data on the Web, and make it available for other people to reuse in unexpected ways, but if you actually do that, chances are that your application may break in unexpected ways. And the more data providers you end up hitting, the likelier it is of your app going down.

Mitigation

Caching
Local aggregations
Replication and syncing
Testing and monitoring

Thankfully there are ways to work around that. For example you could cache data coming from these endpoints very aggressively, or build local aggregations, storing all the data you need for your application. That's typically what we do for BBC Music, for example, where we just care about music artists that have been played on the BBC, which is a very small subset of all the information DBpedia has to offer. Another option is to replicate entire datasets, using for example RDF dumps. The challenge then is to keep track of updates, and keep your dataset in sync. There have been a couple of attempts at standardising sync feeds for RDF datasets, but I am not sure of their implementation status in current triple stores. When we have to do something like that, we often end up building custom solutions, which is not ideal. Once you have a local store under your control, before you can use it to build highly available applications you need to test it to death. As performances of SPARQL queries will vary a lot depending on the query and the dataset, all distinct queries need to be load-tested independently.

Quite a significant amount of work - can it be made generic?

Challenge 2: searching and indexing

Mitigation

Lots of regexp and FILTERs...
Full text search extensions in SPARQL end-points
Document store indexing the results of SPARQL queries

4store + ElasticSearch

Challenge 3: consuming Linked Data (hmm...)

Consuming Linked Data is (still) hard

rdflib.js

            
var kb = $rdf.graph();
var fetch = $rdf.fetcher(kb);
var FOAF = $rdf.Namespace("http://xmlns.com/foaf/0.1/");
var uri = 'http://bblfish.net/people/henry/card#me';
var person = $rdf.sym(uri);
var docURI = uri.slice(0, uri.indexOf('#'));
fetch.nowOrWhenFetched(docURI,undefined,function(ok, body) {
  var friends = kb.each(person, FOAF('knows'));
  console.log(friends[0].uri);
});

rdfstore.js

            
rdfstore.create(function(store) {
  store.execute('LOAD <http://musicontology.com/specification/index.ttl>', function() {
    store.execute('SELECT * WHERE { ?s ?p ?o }',
                   function(success, results) {
                     console.log(results);
                   });
    });
})

Another library I tried is rdfstore.js. It is a remarkably complete library, and suitable both for client and server-side use. It even includes a full (optionally persistent) triple store. However after quite a few attempts using it client-side I had to give up. I couldn't find any URI for which the above example worked - some times it's CORS headers missing (fine) but most other times it just fails (e.g. for dbpedialite), or returns no results (e.g. for musicontology). In terms of API it relies on SPARQL knowledge, which is interesting (having to learn one standard API rather than library-specific API is probaly better). But it does require SPARQL knowledge, so is only accessible to a minority of developers. Another example of an javascript API that mostly only requires some SPARQL knowledge and is very complete is rdfQuery.

What I ended up doing...

            
var sparql = "http://my-rww-triple-store.org/sparql";
$.post(sparql, "output=json&query=LOAD+<http://dbpedialite.org/titles/Irun>", function (load_data) {
  $.post(sparql, "output=json&query=" + query, function (data) {
    console.log(data['results']['bindings']);
  });
});

EasyRDF


$foaf = new EasyRdf_Graph("http://njh.me/foaf.rdf");
$foaf->load();
$me = $foaf->primaryTopic();
echo "My name is: ".$me->get('foaf:name')."\n";

JSON-LD

            
{
  "@context": "http://json-ld.org/contexts/person.jsonld",
  "@id": "http://dbpedia.org/resource/John_Lennon",
  "name": "John Lennon",
  "born": "1940-10-09",
  "spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
            
            
At this point I should also mention JSON-LD - which enables RDF data to 'look like'
proper JSON. Most of the translation layer (how to resolve all these keywords to URIs) can be held
in a separate 'context', which adds minimal overhead on the way data would be represented in JSON.
It means that JSON-LD feeds can be parsed as normal JSON and make sense, and can be mapped to RDF if needs be (i.e.
to follow links to other sources of data or to resolve vocabularies).
I truly hope that with JSON-LD now almost being a recommendation, we'll see more of it on the web, and that it will
make consuming linked data easier for people familiar with web development but not necessarily with RDF and
related technologies. There is a small risk though that something similar as for RDF/XML happen - applications parsing it as
JSON, the underlying feed changing with no actual changes in the RDF, and applications breaking. We've seen that
happen a lot at the BBC with RDF/XML. We will
need to manage this risk by building simple tools that are aware of this mapping to RDF.
JSON-LD is actually at the core of the approach we're planning for the LDP - JSON-LD feeds generated through SPARQL, which is
used 'as json' to build the user-facing pages.

RDF API Editor's Draft

            
var people = data.getProjections("rdf:type", "foaf:Person");

Challenge 4: RDF data mining

A wide range of data is available in RDF
Little OSS tools available to do data mining over RDF data...
... or machine learning
Or bridges to existing libraries (Weka, Mahout, scikit-learn)

rdfspace

            
from rdfspace.space import Space
space = Space('influencedby.nt', rank=50)
space.similarity('http://dbpedia.org/resource/JavaScript', 'http://dbpedia.org/resource/ECMAScript')
space.similarity('http://dbpedia.org/resource/Albert_Camus', 'http://dbpedia.org/resource/JavaScript')

Conclusion

We use lots of Linked Data at the BBC
But there are still lots of challenges to tackle to make its consumption easier, namely...
- Generic caching/replication layers
- Better support for search and indexing
- Accessible libraries for dealing with RDF data
- Better tools to learn from or mine RDF data

Thank you!

Photo credits:

http://www.flickr.com/photos/andyarmstrong/4402416306/
http://www.flickr.com/photos/nicecupoftea/8579975238/
http://www.flickr.com/photos/11561957@N06/5202870020/
http://www.flickr.com/photos/hubmedia/2141860216/
http://www.flickr.com/photos/allison_mcdonald/7604871594
http://www.flickr.com/photos/aayars/4072755936/