Talk in two parts
Use of Linked Data at the BBC
Challenges around Linked Data consumption
Use of Linked Data at the BBC
Radio since 1922
The BBC has been broadcasting radio from 1922...
TV since 1930
... and TV since 1930.
Since then it has grown to become one of the largest broadcaster in the world.
On the Web since 1994
The BBC had a web presence from quite early on as well. This
is a screenshot of the BBC web site in 1994.
Programme support
Since quite early on, programme broadcast on the BBC would
have a section on the BBC web site. However these different
'micro-sites' would be commissioned individually, for each programme,
causing a big disparity in terms of coverage, consistency and persistency.
A few programmes would get a big web presence, when a very long
tail of programmes would get no web presence at all.
1000 to 1500 programmes per day across 70 channels
As we broadcast between 1000 to 1500 programmes per day across around 70 channels,
this approach does not scale well, if we want all these programmes to have an online presence.
www.bbc.co.uk/programmes
We launched our BBC Programmes web site in 2007 to tackle this issue, by aggregating data from multiple
sources, such as commissioning data, archive data, data from playout systems,
and creating a persistent web presence for each of our programme. Each individual
programme has its own (persistent) URI within the BBC Programmes site.
All programmes will get some web presence, which can
be enriched by creating themes around the programme and adding more data around our core data.
Our core data effectively acts as a backbone for all that ancillary content.
The web site is the API
Another important aspect of BBC Programmes is that the data behind each page can be
accessed through content negotiation. So if I request an RDF representation of that same programme
I was looking at earlier I'll get the following. There is no separate API - the web site is its own API, and
provides the data in JSON, XML, RDF and RDFa. I think BBC Programmes was one of the first large 'corporate' Linked
Data site to published.
Exposing all this data has many advantages. People experiment with our data and give us
ideas of new sources of data to integrate or new types of user experiences.
schema.org
We are also working with schema.org in making it compatible with TV and radio-specific concepts.
Schema.org is an effort led by major search engines, in defining bits of semantic markup
that can be added to pages, and can be used by these search engines to enrich their search results.
Working with schema.org means that major search engines can extract
the RDFa we embed in our pages and surface more
information about our programmes. For example here we see what the Google's "Knowledge Graph"
has to say about the Eastenders BBC series, which includes information from Freebase and information
aggregated from schema.org markup from all over the web.
Using Linked Data (external)
We also link to external sources of Linked Data like Musicbrainz or DBpedia. This
enables us to use extra information held within the Linked Data cloud to enrich our pages. For example this page on Tom Waits
has a biography coming from Wikipedia and artist metadata coming from Musicbrainz. The only bit
of BBC data on this page is the playcount data (how much this artist was broadcast on our programmes)
and the album review data.
Using Linked Data (internal)
We also use the Linked Data we publish internally. For example on this page the
aggregation of programmes at the bottom is generated from Linked Data published on the BBC
Programmes web site. However in this particular example the integration was ad-hoc, directly
using the RDF data published by BBC programmes in this web application.
Towards a Linked Data Platform
We're therefore building towards a "Linked Data Platform" centralising all this RDF data across the BBC
and queried through SPARQL.
This platform would enable us to easily perform cross-domain queries, for example to
generate cross-domain aggregations, including
content from BBC sports, news, music, radio, tv, foord, etc..
Such cross-domain queries and aggregations were traditionally very difficult to make at the BBC,
due to the way our data was siloed and split across many different applications - tv and radio, news and sports,
knowledge and learning...
World Cup 2010
This approach was first tested for the World Cup 2010 website as a mean
to automate aggregation pages for example around specific teams
or footballers, that were previously manually put together and maintained.
Tagging articles
These automated aggregations for various World Cup 2010-related concepts
are driven by journalists tagging articles
with web identifiers available in a centralised triple store and denoting people, places,
events etc. and sourced from multiple data providers (including
Linked Open Data sources).
The resulting relationships between news articles and these concepts
are then pushed to the central triple store.
A benefit of using web identifiers as tags
is that they are unambigous, and that we can retrieve more information about these tags when needed.
For example when news articles are tagged with places URIs, we can easily retrieve the geolocation
of these places, enabling us to plot our articles on a map. Tagging articles with URIs
enable us to tackle a wide range of unforeseen use-cases.
Dynamic aggregations
Aggregation pages are then built by issuing SPARQL queries to that central
triple store.
For example the England page automatically
includes links to news articles that were tagged with
the England team URI using this tagging tool.
BBC London 2012
The same approach was scaled up and used to drive the London 2012 BBC web site, covering
around 250 countries, 300 events, 36 sports, 10,000 athletes and 30 venues. This data
is sourced from multiple places, both from Linked Open Data sources and commercial data providers.
Once all this data is sourced and we have web identifiers for all these things in our centralised triple
store, we can start annotating our content with them, in a similar way as done for the World Cup
2010 web site. We used this mechanism to build automated aggregation pages
for each of these things - for each country, sport, athletes and so on, by querying the
centralised triple store for all BBC items related to them.
The BBC London 2012 web site has been hugely successfull, and has proved the feasibility
of this approach, heavily based on Semantic Web technologies, using a triple store, queried
through SPARQL, to store relationships between our content and domain knowledge.
We're now extending that work beyond sports and to aggregate all sorts
of BBC and non-BBC data. This will enable us to easily create feeds such as 'all news articles about
Barak Obama', 'all videos about the place I am at', and to easily build cross-domain
aggregations.
BBC Ontologies
In order to support this approach we created a bunch of ontologies, modelling the domain knowledge held within the LDP
and often piggybacking on existing ones
(event ontology, FOAF, music ontology, geonames, etc.). They are all available on our website at bbc.co.uk/ontologies. They cover
programmes, wildlife, sports, learning and news and we use them as a backbone for the data in the LDP.
The BBC Archive
Most of the data available within the Linked Data Platform is created manually.
It is a suitable approach going forward, but would be difficult to scale going
backwards.
The BBC has been broadcasting from 1922
and has accumulated a very large archive -- radio and tv programmes,
news articles, pictures, sheet music, production notes... Some of which has been
catalogued, some of which hasn't.
I won't talk about it in a lot of details here (I'll talk about it on Friday, and
will demo it at the Semantic Web Challenge session), but we're using Linked Data
a lot for annotating archive content, as a target vocabulary
for describing our programmes and as a seed for building automated interlinking
algorithms.
http://worldservice.prototyping.bbc.co.uk
In particular we're investigating a combination of automated interlinking and crowdsourcing to
interlink large archives with the Linked Data cloud.
If you're interested you can take a look at our BBC World Service archive prototype,
experimenting with these technologies on a very large radio archive, and driven almost entirely
by Semantic Web technologies.
Challenges
We've just described various places, throughout the BBC web site, where Linked Data
is being consumed. I will now try to describe various challenges we encountered
while consuming Linked Data. Some of them are little annoyances, some of them are a little
more fundamental. I should start by saying that this is a non-exhaustive list.
Challenge 1: public endpoints
I am sure lots of you came face to face with the DBpedia 'screen of death', with cryptic error messages, especially
just before semantic web conference deadlines. DBpedia isn't the only one, of course, and it would be unfair to blame it in particular.
The main challenge is that public endpoints are shared resources, and are sometimes overloaded. So it's difficult
to build reliable applications on top of most of them.
There is actually a paper
about this problem at ISWC, as well as a new service from the OKFN monitoring the
status of SPARQL end-points.
This is a huge challenge for Linked Data. The main idea behind Linked Data is to publish and interlink data on the Web, and make
it available for other people to reuse in unexpected ways, but if you actually do that, chances are that your application
may break in unexpected ways. And the more data providers you end up hitting, the likelier it is of your app going down.
Mitigation
Caching
Local aggregations
Replication and syncing
Testing and monitoring
Thankfully there are ways to work around that. For example you could cache data coming from these endpoints very aggressively, or
build local aggregations, storing all the data you need for your application. That's typically what we do for BBC Music, for example,
where we just care about music artists that have been played on the BBC, which is a very small subset of all the
information DBpedia has to offer.
Another option is to replicate entire datasets, using for example RDF dumps. The challenge then is to keep track of updates, and keep
your dataset in sync.
There have been a couple of attempts at standardising sync feeds for RDF datasets, but I am not sure
of their implementation status in current triple stores. When we have to do something like that, we often end up building custom solutions,
which is not ideal.
Once you have a local store under your control, before you can use it to build highly available applications you need to test it to death.
As performances of SPARQL queries will vary a lot depending on the query and the dataset, all distinct queries need to be load-tested independently.
Quite a significant amount of work - can it be made generic?
Building local caches, maintaining them in sync and being confident of the performance of each query issued to the store requires
a lot of work. Could such a middleware be made generic?
Challenge 2: searching and indexing
Building search features on top of Linked Data is relatively challenging as well, however it remains
a feature most people are expecting to see. Another challenge is to index particular views on top of your
RDF data, in order to make your application fast enough.
Mitigation
Lots of regexp and FILTERs...
Full text search extensions in SPARQL end-points
Document store indexing the results of SPARQL queries
There are multiple ways you can deal with such issues. For search, you can use lots of regexp and FILTERs, although
that's probably going to push a lot of complexity on your SPARQL query, which might hinder performance.
Some triple stores also provide full text search extension, which in some cases are pretty fast, but non-standard.
Another approach is to create 'canned' SPARQL queries (e.g. give me all BBC programmes featuring a particular contributor)
and to index it in a document store, sitting on top of a generic SPARQL end-point. That way you keep the flexibility
of a SPARQL end-point (it's easy to generate new views on your data) and you make specific sections of the data
available from a very fast index.
4store + ElasticSearch
That's exactly how we handle search within the World Service archive prototype. We use a 4store SPARQL end-point, and
an ElasticSearch instance sitting on top of it. For a bunch of pre-defined SPARQL queries, we maintain an index in ElasticSearch,
which we can use for very fast facetted search. We ended up using the same approach for indexing most of the queries we'd otherwise issue
directly to 4store.
Challenge 3: consuming Linked Data (hmm...)
The next challenge is simply, well, to consume Linked Data.
Knowledge about Linked Data, RDF and SPARQL has increased dramatically in the last couple
of years, with more and more companies asking for it, and more and more universities
teaching it. However a side-effect of that may be that more and more libraries assume
quite an extensive amount of RDF knowledge, and aren't easy to pick up for newcomers.
Which means that, in most cases, consuming Linked Data remains something that's only accessible
to a minority.
Consuming Linked Data is (still) hard
Let me illustrate that with an example. I was recently writing an RDF tutorial.
Quite a lot of my team (and more and more people across the BBC) are working with javascript
extensively, both server and client-side. So I thought I would work my way through a couple
of javascript examples. So far so good.
rdflib.js
var kb = $rdf.graph();
var fetch = $rdf.fetcher(kb);
var FOAF = $rdf.Namespace("http://xmlns.com/foaf/0.1/");
var uri = 'http://bblfish.net/people/henry/card#me';
var person = $rdf.sym(uri);
var docURI = uri.slice(0, uri.indexOf('#'));
fetch.nowOrWhenFetched(docURI,undefined,function(ok, body) {
var friends = kb.each(person, FOAF('knows'));
console.log(friends[0].uri);
});
The first library I stumbled on is rdflib.js, used in the Tabulator. I am a big fan of the Tabulator so figured
that would be my first choice. The documentation is very sparse (actually, I couldn't find any), but using the examples
in the github repository I came up with the following snippet, which illustrates how to load a URI and access a simple bit of
data (one of Henry's friends). Now forget everything you know about RDF - how much of that makes sense? There's
even a bit of httpRange-14 hidden in there!
Now try the same thing with e.g. a DBpedia URI (or most other datasets in the linked data cloud).
It will silently fail, as CORS headers are very sparsely populated.
rdfstore.js
rdfstore.create(function(store) {
store.execute('LOAD <http://musicontology.com/specification/index.ttl>', function() {
store.execute('SELECT * WHERE { ?s ?p ?o }',
function(success, results) {
console.log(results);
});
});
})
Another library I tried is rdfstore.js. It is a remarkably complete library, and suitable both for client and server-side
use. It even includes a full (optionally persistent) triple store. However after quite a few attempts using it client-side I had to give up.
I couldn't find any URI for which the above example worked - some times it's CORS headers missing (fine) but most other times
it just fails (e.g. for dbpedialite), or returns no results (e.g. for musicontology). In terms of API it relies
on SPARQL knowledge, which is interesting (having to learn one standard API rather than library-specific API is probaly
better). But it does require SPARQL knowledge, so is only accessible to a minority of developers.
Another example of an javascript API that mostly only requires some SPARQL knowledge and is very complete is rdfQuery.
What I ended up doing...
var sparql = "http://my-rww-triple-store.org/sparql";
$.post(sparql, "output=json&query=LOAD+<http://dbpedialite.org/titles/Irun>", function (load_data) {
$.post(sparql, "output=json&query=" + query, function (data) {
console.log(data['results']['bindings']);
});
});
In the end, I ended up just using jquery, and doing everything through SPARQL. Issue a SPARQL/Update
query to a read-write end-point to load a resource, and then query the end-point.
The most depressing thing is that i do know about RDF and SPARQL, so that's probably why I
can do something like that. I wonder how we can expect anyone not familiar with RDF and SPARQL to be
able to consume linked data.
EasyRDF
$foaf = new EasyRdf_Graph("http://njh.me/foaf.rdf");
$foaf->load();
$me = $foaf->primaryTopic();
echo "My name is: ".$me->get('foaf:name')."\n";
What's frustrating is that, as a community, it's been ages since we realised it was
an issue, and there has been numerous attempt to solve that, mostly through libraries abstracting
the triple-based nature of RDF and mapping that to objects or to other paradigms. For example active-rdf in Ruby, SuRF in Python,
and EasyRDF in PHP. However EasyRDF is the only actively maintained one out of these three. I am not entirely
sure why that is, but my feeling is that we grew more confident as a community for a variety of
reasons and put our focus towards tools that require RDF and SPARQL knowledge, rather than making
them accessible outside of our community. Sadly it has the effect of alienating a large part of
the web community.
JSON-LD
{
"@context": "http://json-ld.org/contexts/person.jsonld",
"@id": "http://dbpedia.org/resource/John_Lennon",
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
At this point I should also mention JSON-LD - which enables RDF data to 'look like'
proper JSON. Most of the translation layer (how to resolve all these keywords to URIs) can be held
in a separate 'context', which adds minimal overhead on the way data would be represented in JSON.
It means that JSON-LD feeds can be parsed as normal JSON and make sense, and can be mapped to RDF if needs be (i.e.
to follow links to other sources of data or to resolve vocabularies).
I truly hope that with JSON-LD now almost being a recommendation, we'll see more of it on the web, and that it will
make consuming linked data easier for people familiar with web development but not necessarily with RDF and
related technologies. There is a small risk though that something similar as for RDF/XML happen - applications parsing it as
JSON, the underlying feed changing with no actual changes in the RDF, and applications breaking. We've seen that
happen a lot at the BBC with RDF/XML. We will
need to manage this risk by building simple tools that are aware of this mapping to RDF.
JSON-LD is actually at the core of the approach we're planning for the LDP - JSON-LD feeds generated through SPARQL, which is
used 'as json' to build the user-facing pages.
RDF API Editor's Draft
var people = data.getProjections("rdf:type", "foaf:Person");
A couple of years ago some work has been done on trying to define a higher-level API for interacting with RDF, for
example in javascript. I am not sure what the status of it is these days, and whether some implementations are
currently available.
Challenge 4: RDF data mining
A wide range of data is available in RDF
Little OSS tools available to do data mining over RDF data...
... or machine learning
Or bridges to existing libraries (Weka, Mahout, scikit-learn)
rdfspace
from rdfspace.space import Space
space = Space('influencedby.nt', rank=50)
space.similarity('http://dbpedia.org/resource/JavaScript', 'http://dbpedia.org/resource/ECMAScript')
space.similarity('http://dbpedia.org/resource/Albert_Camus', 'http://dbpedia.org/resource/JavaScript')
We do quite a lot of data mining over RDF data at the BBC, for example to derive similarities
between various topics or to do concept tagging of our programmes. The rdfspace package generates a vector space
from large RDF dumps, which you can then use to do very fast similarity measures. It does that by performing
an SVD on the adjacency matrix of the input RDF graph.
Conclusion
We use lots of Linked Data at the BBC
But there are still lots of challenges to tackle to make its consumption easier, namely...
Generic caching/replication layers
Better support for search and indexing
Accessible libraries for dealing with RDF data
Better tools to learn from or mine RDF data
Thank you!
Photo credits:
http://www.flickr.com/photos/andyarmstrong/4402416306/
http://www.flickr.com/photos/nicecupoftea/8579975238/
http://www.flickr.com/photos/11561957@N06/5202870020/
http://www.flickr.com/photos/hubmedia/2141860216/
http://www.flickr.com/photos/allison_mcdonald/7604871594
http://www.flickr.com/photos/aayars/4072755936/