Journalism Labs - Jonathan Austin

Muddy Boots

Jonathan Austin Jonathan Austin — Wed, 10 Dec 2008 12:00:00 +0000

We've been experimenting with the Semantic Web using a prototype called Muddy Boots. The question that we're trying to answer is: Can a computer reliably identify the people and organisations in news stories? This is still work in progress, but we have a prototype and an API that you're welcome to explore.

When a journalist refers to someone in a news story they usually give the person's full name and enough information so the reader can understand who they are talking about. If the full name is ambiguous they may have to add a title or give an explanation about who the person is. But sometimes, especially for household names, the reader is expected to infer the identity of the person from the context of the story and by applying a reasonable level of background knowledge.

Whilst a human reader takes for granted their abilities to pick up journalists' cues and understand context, a computer has to be programmed explicitly. It is difficult to design a system that can identify people from text and disambiguate them. It is even harder to build a system that meets editorial standards of accuracy. However, in theory, it should be possible. So we've been experimenting to develop an approach that could lead to a system that reliably identifies people (and organisations) in stories and marks up their textual names with semantic information. There are four key challenges:

Build working prototypes
Write tests for the prototypes that express editorial standards
Refine the prototypes to reach defined levels of reliability
Express the information usefully through semantic mark-up

Prototypes

The prototypes are ready to share with you. They have been built for us by a company called Rattle Research based in Sheffield. They were a successful participant in the BBC Innovation Labs.

There are two systems available. They are both based on DBpedia (the structured version of Wikipedia) which provides the controlled vocabulary of people and organisations. Therefore, in these prototypes each person in a news story is described by their Wikipedia entry. Potentially, Wikipedia is a good controlled vocabulary source for news because it has wide scope, is open and dynamic. It is certainly useful for prototyping.

The first method is called "Muddy". It works by extracting proper names from the story text and then matches them to entries in DBpedia. If a term is ambiguous, the system uses various strategies based on Wikipedia's disambiguation pages and the structure of DBpedia to resolve the conflict. More information can be found on Rattle's website here
The second method is called "conText". It was initially proposed by Chris Sizemore and is described in detail in his blog post here. This method uses search technology (Google and Lucene) to enhance the results further.

The good news for anyone who is not an expert in term or knowledge extraction is that Rattle implemented both methods behind a common abstract API. In effect we can treat both methods like black boxes. We don't need to know how they work to use them and evaluate their ability to identify people.

In addition, Rattle implemented some visualisations so that we can get a feel for how the systems work. Below are some sample stories that have had people and organisations identified. You can also submit additional stories by following the final link.

Testing

It doesn't take long to see that neither prototype is perfect. Sometimes they miss people and sometimes they get them wrong. But that is the point of the research. How good are they really and can they be improved? Our next step is to measure them against our editorial standards.

So currently we are working with another Innovation Labs entrant ThinkTankMaths to develop some tests. We're going to compare the performance of both systems (and any system that implements Rattle's API) to the performance of human beings. We will also be proposing measures that evaluate the systems from an editorial point of view. For example, is it editorially more acceptable for the system to fail to spot the name of cat owner whose cat gets stuck in a tree than the Prime Minister? And what should the system do when the name of that cat owner is Gordon Brown?

We will post more about this and our initial findings in the New Year, but in the meantime we'd like to hear your thoughts and feel free to have a look at the API and the prototypes.

Results of the BBC News Story Links Trial

Jonathan Austin Jonathan Austin — Wed, 03 Dec 2008 12:00:00 +0000

Last August, the BBC trialled a new service called Apture on a hand picked selection of news stories (e.g. this story on the Cassini mission). Apture allows journalists to add links to the text of published stories and displays the link content in popover windows within the story page. This caused a bit of a buzz at the time on the BBC Internet blog. So we thought we'd start our new blog by sharing our findings and set out what we're going to do next in this area.

Throughout the trial, we asked for feedback from the audience using a link at the bottom of each Apture window. This opened a pop-up window where people could state whether they found the service useful and then leave a text comment.

Your Feedback

Over 1200 people used this to give us feedback during the two weeks, which we think is quite a high response for such a limited piece of work. Over 90% of those that responded said that they found Apture useful. This was an unexpected result, even considering that the opt-in nature of the trial favoured early adopters.

Over 854 people left a positive comment. Whilst over half of those gave generic praise such as "I liked it", there were three key themes that dominated the comments:

177 people stated that they appreciated being able to explore background material without leaving the page or searching.
119 people said that they appreciated the editorial content in the links and made a specific reference to the fact the content was added manually.
98 people approved of the design of the system. This included that links were inline and could be turned off and on.

We also received 306 negative comments. There were a diverse range of concerns. The blog discussion focussed on the issues of simple hyperlinks, accessibility and web citizenship and we received 32 comments related to these issues. The top three concerns raised were:

65 people objected to the use of icons in the inline links. In particular, the use of a "W" to indicate a link to Wikipedia caused problems. Apture have now changed this, but there still questions about how to signal the content of links
55 people were unsure whether the BBC should link to Wikipedia, raising issues of accuracy, perhaps echoing the controversy that followed the 2005 evaluation by Nature.
50 people found that security restrictions caused a problem. This was often down to a local firewall that allowed access to the BBC, but blocked third party content. Also for some people Flash was not installed on their machine. This is a timely reminder that there can be problems when people move to rich AV and mashups.

Overall, we thought there was a lot that was interesting about these results. Clearly many people appreciate high quality, crafted background links in news stories. And, for the right kind of background information, some people appreciate being able to see it without having to navigate away from the page. The key challenges for us are two-fold:

Can we deliver in-page navigation using baked in hyperlinks, maintain high standards of accessibility and serve those who prefer traditional linking behaviour?
Can we provide journalists with the right tools so they can be at their most creative editorially and use their skills to provide the most relevant interesting links?

So what are we doing next?

We've decided to do go back to basics and look again at the fundamentals of linking in news stories. When the BBC News website started in 1997 we placed background links to the side of the article instead of inline, for technical and user experience reasons. We haven't revisited that decision in any significant way until now.

In 2009, we're going to be refreshing how we markup our stories. We will discuss the details of this project in more detail here later. But this will give us a rare opportunity to improve the way we express links. We want to get that right. In the meantime, we're talking to Apture to explore whether it's possible to extend their product to deliver the functionality you liked and to answer your concerns.

There is plenty more about this trial that we could share and we'd really be interested to hear your thoughts and about what interests you the most. So leave a comment and we'll try to answer your questions.

Sample Stories