Thoughts from our speakers: The Guardian’s Martin Belam on linked data

In the final two weeks before news:rewired we’ll be publishing some thoughts from our speakers on the subject of their session. We asked Guardian information architect Martin Belam about linked data and how it relates to his own work.

Martin’s session:

Linked data and the semantic web

  • An introduction to linked data and the semantic web – what should you know and what benefits can linked data offer journalists? A session looking at where media on the web is headed and what skills future journalists and communicators will need.

With: Simon Rogers, datablog/datastore editor, the Guardian; Martin Moore, director, Media Standards Trust; Martin Belam, information architect, the Guardian; Silver Oliver, senior information architect, BBC.

Earlier this year I helped organise a summit featuring people from a lot of media companies, to discuss how linked open data collaboration between us might be of mutual benefit. At the time I blogged about it for the Guardian – ‘What is the value of linked data to the news industry?’, saying:

“With the news industry facing structural change and a global advertising downturn, there is naturally an emphasis on whether any new tools and techniques can “make more money”. One way of making more money is in fact to “spend less money”. There may well be an economy of scale in agreeing to some linked data principles.”

There are certain areas, like politics and sports events, where we all rely on the same basic set of facts – who is standing in which constituency, or which teams feature in which league. I think we potentially waste a lot of money by all carefully building the same databases.

The Guardian has also worked towards integrating our content with the wider web using linked data principles. MusicBrainz is a wiki-style service which a unique ID for every music artist ever, so 2cd475bb-1abd-40c4-9904-6d4b691c752c represents Franz Liszt and 2aaf7396-6ab8-40f3-9776-a41c42c8e26 represents LCD Soundsystem. We have added these codes to the metadata of our articles, and so now you can query the API using them and be sure that you are getting disambiguated content about that specific artist.

MusicBrainz IDs and ISBN numbers are just two types of ‘reference’ we maintain within our content database, and we are hope that they are the first of many that we will be able to make accessible to the public through the Open Platform.

Our API returns data in JSON and XML formats, and there are strong feelings in the linked open data community about which are the appropriate formats to use. Some argue that unless you are using formats and technologies like RDFa, OWL and SPARQL then you are not really part of the community. I recently wrote an article that addressed this point for the Nodalities blog run by semantic technology company Talis.

“We try to work in a lightweight and agile way, and providing the data in this format was the simplest way to meet our immediate requirements. We are trying to concentrate on making more metadata available. If we were to decide to invest in triple-stores and implement a SPARQL endpoint first, then I’d wager that we would still be waiting to dip our toe into the water. I’m entirely agnostic about formats myself. What I think is most important is that we provide consistent, RESTful, predictable, persistent hooks into content, in as many ways as possible, with the right licence for re-use.”

You might be interested in …