Technology + Creativity at the BBC Feed

Fighting misinformation: An embedded media provenance specification

Charlie Halford — Mon, 04 Oct 2021 15:51:46 +0000

For the last few years, the BBC has had a project running in its technology division looking at technology solutions to various problems in the domain of news disinformation. Part of that effort, called Project Origin, is working to make it easier to understand where the news you consume online really comes from so that you can decide how credible it is. You can find some history on this in Laura Ellis' excellent "Project Origin: one year on" blog.

Part of Project Origin has been working in collaboration with major media and tech companies, most recently, with the Coalition for Content Provenance and Authenticity (C2PA), which it helped form. This group recently released a draft version of an embedded media provenance specification. This spec tackles the problem of missing, trusted provenance information in images / video / audio consumed on the internet. For example, where a video of elections in one country from 10 years ago is being presented as video from recent elections in another. This is an overview of how that specification is intended to work.

Embedding

The C2PA specification works primarily by defining mechanisms for embedding additional data into media assets to indicate their authentic origin. An essential aspect of this data is "assertions" - statements about when and where media was produced. The embedded information is then digitally signed so that a consumer knows who is making the statements.

While the C2PA specification also includes mechanisms for locating this provenance data remotely (e.g. hosted somewhere on the internet), I'll focus on the use case where all data is embedded directly in the asset itself.

Data model

The C2PA specification uses a few different mechanisms for embedding and storing data. Embedding is done with JUMBF, a container format, and structured data storage is done with a combination of JSON-LD and CBOR (which is a binary format based on JSON).

Container - The "Manifest Store"

Similar to XMP, the C2PA specification defines several embedding points in a selection of media formats to place a "Manifest Store" in JUMBF format, which is the container for the various pieces of provenance data. Once you've identified where and how a manifest store is embedded in your favourite media format, most of the specification is format-agnostic.

What is JUMBF?

JUMBF (JPEG universal metadata box format) is a binary container format initially designed for adding metadata to JPEG files, and it’s now used in other file formats too. It is structurally similar to the ISO Base Media box file format, an extensible container format that is used for many different types of media files. JUMBF "superboxes" are boxes that only contain other boxes. JUMBF "content type" boxes contain actual payload data, the serialisation of which should match the advertised content type of the box. All boxes have labels, which allow boxes to be addressed and understood when parsing. C2PA uses JUMBF in all the media formats it supports to provide the container format for the Manifest, Claims, Assertions, Verifiable Credentials and Signatures.

Each piece of embedded provenance data is called a “Manifest”. A manifest contains a part of the provenance data about the current asset, or the assets it was made from. Because an asset might have been created from multiple original sources or have been processed multiple times, we will often need to store several manifests to understand the complete history of the current asset.

Manifests are located in the "Manifest Store", which is a JUMBF superbox. The last manifest in the store is the "ActiveManifest", which is the provenance data about the current asset and it's the logical place for validation to start. The other manifests are the data for the "ingredients" of the active manifest - i.e. the assets that were a part of the creation of the active manifest. This is one of the key features of C2PA: each asset provides a graph of the history of editing and composition actions that went into the active asset, exposing as little or as much as the asset publisher wants.

Each manifest within the store is again its own JUMBF superbox. A manifest then consists of: a "Claim", an "AssertionStore", a "W3C Verifiable Credentials" and a "Signature". Manifests are signed by an actor (the “Signer”) whose credential identifies them to the user validating or consuming them.

Diagram of a Manifest box, without any VCs

Assertions

Assertions are the statements being made by the signer of a manifest. They are the bits of provenance data that consumers of that data are being asked to trust, for example, the date of image capture, the geographical location, or the publisher of a video.
In the spec, each assertion has its own data model. Some are published as "Standard Assertions" in the spec, some are adoptions of existing metadata specifications such as EXIF, IPTC and schema.org, and it is expected that implementers will extend the spec by defining their own as well.

Media metadata isn't new

For example, the EXIF standard is nearly universal in digital photographs, used to record location and camera settings. The fundamentally new thing that C2PA does is allow you to cryptographically bind that metadata (with hashes) to a particular media asset and then sign it with the identity credential of the origin of that data, ensuring that the result is tamper-proof and provable.

Assertions are contained in their own JUMBF Content Type Box in the assertion store superbox and are serialised in the format defined in the spec for that assertion. The C2PA-defined assertions are stored as CBOR, while most adopted assertions from other standards are JSON-LD.

Here's an example of an "Action" assertion (in CBOR Diag) which tells you what the signer thinks was done in creating the active asset:

{
  "actions": [
    {
    "action": "c2pa.filtered",
    "when": 0("2020-02-11T09:00:00Z"),
    "softwareAgent": "Joe's Photo Editor",
    "changed": "change1,change2",
    "instanceID": 37(h'ed610ae51f604002be3dbf0c589a2f1f')
    }
  ]
}

And here's an EXIF one (in JSON-LD) that contains location data:

{
  "@context" : {
    "exif": "http://ns.adobe.com/exif/1.0/",
  },
  "exif:GPSLatitude": "39,21.102N",
  "exif:GPSLongitude": "74,26.5737W",
  ...
}

The one critical assertion is the binding, something that binds the claim to an asset. In fact, the spec requires one. This ensures that claims are not applied to any asset other than the one they were signed against. This is important in helping to ensure that the consumer can trust that the C2PA data wasn't tampered with between the publisher and the consumer. There are currently two types of "hard bindings" available, a simple hash binding to an area of bytes in a file or a more complex one intended for ISO BMFF-based assets, which can use their box format to reference specific boxes that should be hashed.

Claim

The claim in a manifest exists to pull together the assertions being made, any "redactions" (removals of previous provenance data for privacy reasons), and some extra metadata about the asset, the software that created the claim, and the hashing algorithm used. Assertions are linked by their reference in the assertion store and a hash. The claim itself is another JUMBF box, serialised as a CBOR structure. This is the thing that is signed, and it provides a location to find the signature itself.

Signature

The signature in a manifest is a COSE CBOR structure that signs the contents of the claim box. COSE is the CBOR version of the JOSE framework of specs, which includes JWT/JWS. The signature is produced using the credentials of the signer. The signer is the primary point of trust in the C2PA Trust Model, and consumers are expected to use the signer's identity to help them make a trust decision on the claim's assertions.

The only currently supported credentials for producing the signature are x.509 certs. The specification provides a profile that certificates are expected to adhere to (including key usages such as “id-kp-emailProtection”, which is a placeholder). The specification does not include any requirements on how validators & consumers assemble lists of trusted issuers, as it is expected that an ecosystem of issuers will develop around this specification. Instead, it simply requires that validators maintain or reference such a list of trust anchors. Alternatively, they can put together a trusted list of individual entity certificates provided out-of-band of the trust anchor list.

What now?

This is an overview and omits both the detail required to produce C2PA manifests and the breadth of some of the other components of the specification (e.g. ingredients, the use of Verifiable Credentials, the concept of assertion metadata, timestamping etc). I'd love to produce a worked example of how to extract and validate a C2PA manifest from an asset; watch out for that in the future. I will highlight an open-source implementation of C2PA available in Python, and I know of other implementations in the works, too.

At the BBC, we can't wait for this specification to develop and gain adoption. We'd love to see it supported in production and distribution tools, web browsers, and on social media and messaging platforms. We really think it can make a difference to some of the harms done by mis- and disinformation.

Introducing machine-based video recommendations in BBC Sport

Robert Heap — Tue, 14 Sep 2021 12:07:03 +0000

From this week we are adding a new feature to our short form video pages on the BBC Sport website.

Related clips

On every video page in BBC Sport you’ll see a related links section. This is usually put together by our editorial colleagues, a routine task which can be time consuming. They have good knowledge about related content, but cannot know about everything, which means that the audience do not see some content that might be relevant.

Short Form Video & Datalab

With this in mind we have worked with Datalab (our in-house BBC machine learning specialists) to create an algorithm-based video recommendations engine which we hope will help our audience see more of the content they love whilst reducing the editorial overhead of creating a set of relevant links.

Algorithm-based recommendations

The engine works by combining content information about the clip with more information about user journeys from across the BBC. This combination of multiple sources should provide a more relevant list of videos for our audience to watch next. This is the first cross product engine supporting user journeys across News and Sport, which means that you may see news, sport or a combination of both in the module. This is the first version of the short form video recommender, there will be more improvements to come, as we continue to develop it.

Launching the new recommendations

The plan is to launch this new functionality on all American Football clips the week commencing 13 September to give the engine a trial with freshly published content and to give us the opportunity to measure its impact. Provided all is well, we will gradually release this feature across all BBC Sport videos. After that we will begin to roll out the same engine for BBC News. Beyond that we will continue to work with editorial colleagues to improve it over the coming months.

If you have any feedback on this new video experience, please leave your comments below.

How metadata will drive content discovery for the BBC online

Jonathan Murphy and Jeremy Tarling — Wed, 15 Apr 2020 13:53:12 +0000

Jonathan Murphy, editorial lead for metadata and Jeremy Tarling, lead data governance specialist in Digital Publishing, explain what's being done to create a common metadata structure.

The BBC’s online portfolio has been built up over more than 20 years into a rich and varied collection of websites and services - but with all this content, it’s sometimes difficult trying to find some of it, let alone manage it all. As a result we have lots of hidden gems that aren’t being surfaced, and that’s something that we’re trying to fix with a new content discovery strategy.

One of the challenges we’re facing up to as we rebuild our digital portfolio is how to make more of our content discoverable and personalised to more of our audiences. That’s particularly true of the under 35s age group who now have an array of competing platforms which do a great job in building algorithms to attract their attention.

Underpinning this strategy of discovery and personalisation will be establishing more detailed metadata that describes all of our content using the same terminology and the same tools and data model.

Metadata is the background info that describes the things we make. It can come in all forms, from technical metadata such as which camera was used in a film shoot to promotional metadata used to describe the plot of a programme. For the remit of this project, we’re focusing on what we call descriptive content metadata - tags that describe what an asset (e.g. an article, programme or TV/Audio clip) is about or who/what it mentions. That’s already used in areas like the BBC News and BBC Sport websites using data architecture called Dynamic Semantic Publishing that was created for the London 2012 Olympics, and now drives many thousands of subject-based aggregations, or Topic pages.

There are also categories of programmes on BBC Sounds and BBC iPlayer, which use a mixture of genres and formats contained in the PIPs database that supports our vast online library of programme information. As a result of these two data silos, and their limitations, it’s difficult to offer audiences any pan-BBC experiences or anything that requires an in-depth understanding of the content.

Common Metadata

So we need to go further than that, in order to offer material that covers both the breadth of our online content and to suit everyone’s tastes and needs. Here in the BBC’s Digital Publishing team, we’re developing tools that make content description possible at all stages of production across our portffolio, and new vocabularies that allow for richer descriptions of our content.

We’ve already worked with the Sounds team, who have used our new tags and curations to create some of their Music Playlists which give you a soundtrack to suit your mood, whether that’s ‘chilled out’ ‘feel good’, music to dance to, or music to focus your mind.

To make this possible we’ve developed a concept that every piece of content has a basic set of common metadata associated with it, that it carries around wherever it’s surfaced across the BBC’s portfolio - whether that’s in Sounds, iPlayer or on the BBC News homepage. We’re storing this set of basic common metadata in what we call a ‘Passport’, and to create and manage this metadata we’ve developed a tool called Passport Control.

Using the simple graph model first developed for the BBC’s 2012 Olympic’s coverage, we create subject-predicate-object triples to describe the nature of the relationship between an asset (the subject) and a tag (the object).

In the above example this asset has been described as being “about” some things - this kind of subject based tagging is well established at the BBC, especially in our journalism output. But we have added three new predicates: “editorial tone”, “intended audience”, and “genre”.

Each predicate can be used with an associated controlled vocabulary of terms. In some cases these controlled vocabularies are taxonomic hierarchies (like BBC genres) while in others they are simple lists of terms that we have developed to describe our output in ways that make sense to us and our audience.

These new types of metadata can be used to make much richer collections of content, either as manual editorial curations or algorithmically generated recommendations.

Our colleague Anna McGovern explains further some of those challenges we face at the BBC here in building our curations and recommendations, building on our public service values. With the amount and variety of material that we produce, from news articles to music mixes, live events to boxsets, we think we’re in a good position to provide content for all kinds of different tastes.

We’ll update you more about metadata developments, curations and recommendations as these features begin to roll out on BBC Online over the coming months.

Scaling responsible machine learning at the BBC

Gabriel Straub — Fri, 04 Oct 2019 09:32:44 +0000

Machine learning is a set of techniques where computers can ‘solve’ problems without being explicitly programmed with all the steps to solve the problem, within the parameters set and controlled by data scientists working in partnership with editorial colleagues.

The BBC currently uses machine learning in a range of ways – for example to provide users with personalised content recommendations, to help it understand what is in its vast archive, and to help transcribe the many hours of content we produce. And in the future, we expect that machine learning will become an ever more important tool to help the BBC create great audience experiences.

The BBC was founded in 1922 in order to inform, educate and entertain the public. And we take that purpose very seriously. We are governed by our Royal Charter and public service is at the heart of everything we do. This means that we act on behalf of our audience by giving them agency and that our organisation exists in order to serve individuals and society as a whole rather than a small set of stakeholders.

With Machine Learning becoming a more prevalent aspect of everyday life, our commitment to audience agency is reflected in this area as well. And so in 2017, we submitted a written commitment to the House of Lords Select Committee on Artificial Intelligence in which we promised to be leading the way in terms of responsible use of all AI technologies, including machine learning.

But what does this mean in practice?

For the last couple of months, we have been bringing together colleagues from editorial, operational privacy, policy, research and development, legal and data science teams in order to discuss what guidance and governance is necessary to ensure our machine learning work is in line with that commitment.

Together, we agreed that the BBC’s machine learning engines will support public service outcomes (i.e. to inform, educate and entertain) and empower our audiences.

This statement then led to a set of BBC Machine Learning Principles:

The BBC’s Values

1. The BBC’s ML engines will reflect the values of our organisation; upholding trust, putting audiences at the heart of everything we do, celebrating diversity, delivering quality and value for money and boosting creativity.

Our Audiences

2. Our audiences create the data which fuels some of the BBC’s ML engines, alongside BBC data. We hold audience-created data on their behalf, and use it to improve their experiences with the BBC.

3. Audiences have a right to know what we are doing with their data. We will explain, in plain English, what data we collect and how this is being used, for example in personalisation and recommendations.

Responsible Development of Technology

4. The BBC takes full responsibility for the functioning of our ML engines (in house and third party). Through regular documentation, monitoring and review, we will ensure that data is handled securely. And that our algorithms serve our audiences equally and fairly, so that the full breadth of the BBC is available to everyone.

5. Where ML engines surface content, outcomes are compliant with the BBC’s editorial values (and where relevant as set out in our editorial guidelines). We will also seek to broaden, rather than narrow, our audience’s horizons.

6. ML is an evolving set of technologies, where the BBC continues to innovate and experiment. Algorithms form only part of the content discovery process for our audiences, and sit alongside (human) editorial curation.

These principles are supported by a checklist that gives practitioners concrete questions to ask themselves throughout a machine learning project. These questions are not formulated as a governance framework that needs to be ticked off, but instead aim to help teams building machine learning engines to really think about the consequences of their work. Teams can reflect on the purpose of their algorithms; the sources of their data; our editorial values; how they trained and tested the model; how the models will be monitored throughout their lifecycle and their approaches to security and privacy and other legals questions.

While we expect our six principles to remain pretty consistent, the checklist will have to evolve as the BBC develops its machine learning capabilities over time.

The Datalab team is currently testing this approach as they build the BBC’s first in-house recommender systems, which will offer a more personalised experience for BBC Sport and BBC Sounds. We also hope to improve the recommendations for other products and content areas in the future. We know that this framework will only be impactful if it is easy to use and can fit into the workflows of the teams building machine learning products.

The BBC believes there are huge benefits to being transparent about how we’re using Machine Learning technologies. We want to communicate to our audiences how we’re using their data and why. We want to demystify machine learning. And we want to lead the way on a responsible approach. These factors are not only essential in building quality ML systems, but also in retaining the trust of our audiences.

This is only the beginning. As a public service, we are ultimately accountable to the public and so are keen to hear what you think of the above.

Navigating the data ecosystem technology landscape

Hannes Ricklefs, Max Leonard — Tue, 03 Sep 2019 12:46:36 +0000

Credit: Jasmine Cox

Want to message your Facebook friends on Twitter? Move your purchased music from iTunes to Amazon? Get Netflix recommendations based on your iPlayer history? Well, currently you can’t.

Many organisations are built on data, but the vast majority of the leading players in this market are structured as vertically integrated walled gardens, with few (if any) meaningful interfaces to any outside services. There are a great number of reasons for this, but regardless of whether they are intentional or technological happenstance (or a mixture of both), there is a rapidly growing movement of GDPR supercharged technologists who are putting forward decentralised and open alternatives to the data-moated household names of today. For the BBC in particular, these new ways of approaching data are well aligned with our public service ethos and commitment to treating data in the most ethical way possible.

Refining how the BBC uses data, both personal and public, is critical if we are to create a truly personalised BBC in the near term and essential if we want to remain relevant in the coming decades. Our Chief Technology and Product Officer Matthew Postgate recently spoke about the BBC’s role within data-led services, in which he outlined some of the work we have been doing in this respect to ensure the BBC and other public service organisations are not absent from new and emerging data economies.

Alongside focused technical research projects like the BBC Box, we have been mapping the emerging players, technologies and data ecosystems to further inform the BBC’s potential role in this emerging landscape. Our view is that such an ecosystem is made up of the following core capabilities: Identity, data management (storage, access, and processing), data semantics and the developer experience, which are currently handled wholesale in traditional vertical services. A first step for us is hence to ascertain which of these core capabilities can realistically be deployed in a federated, decentralised future, and which implementations currently exist to practically facilitate this.

Identity, a crucial component of the data ecosystem, proves who users say they are providing a true digital identity. Furthermore we expect standard account features such as authentication and sharing options via unique access token that could enable users to get insights or to share data to be part of any offering. We found that identity, in the context of proving a user’s identity, was not provided by any of the solutions we investigated. Standard account features were present, ranging from platform specific implementations, to decentralised identifier approaches via WebID, and blockchain based distributed ledger approaches. As we strongly believe it is important to prove a user is who they say they are, at this point we would look to integrate solutions that specialise in this domain.

Data management can be further broken down into 3 areas:

Data usage and access, involves providing integration of data sources with an associated permission and authorisation model. Users should have complete governance of their data and usage by data services. Strong data security controls and progressive disclosure of data are key here. Given our investigation is based around personal data stores (PDS) and time series sensor/IoT device data platforms to capture personal, public and open data, providing access and controls around sharing of data was a fundamental capability of all offerings. All of them provided significant granularity and transparency to the users about what data is being stored, its source and usage by external services.
Data storage must provide high protection guarantees of users’ data, encrypted in transit and at rest, giving users complete control and transparency of data lifecycle management. Again, this is a fundamental requirement, such that storage is either a core offering of any platform or outsourced to external services that store data in strongly encrypted formats.
Data processing mechanisms to allow users to bring “algorithms” to their data, combined with a strong contract based exchange of data. Users are in control and understand what insights algorithms and services derive from their data. These might include aspects such as the creation of reports, creation and execution of machine learning models, other capabilities that reinforce the user’s control over how their personal data is used for generated insights. Through contract and authorisation based approaches users have complete audit trails of any processing performed which provides transparency of how data is utilised by services, whilst continuously being able to detect suspicious or unauthorised data access. Our investigations found that processing of data is either through providing SDKs that heavily specify the workflow for data processing, or no provisioning at all, leaving it to developers to create their own solution.

Data model and semantics refers to mechanisms that describe (schemas, ontologies) and maintain the data domains inside of the ecosystem, which is essential to provide extensibility and interoperability. Our investigations found this being approached in a wide spectrum from:

no provision requiring developers to come to conclusions about the best way to proceed
using open standards such as schema.org and modeling data around linked data and RDF
completely proprietary definitions around schemas within the system.

Finally the developer experience is key. It requires a set of software development tools to enable engineers to develop features and experiences as well as being able to implement unique value propositions required by services. This is the strongest and most consistent area across all our findings.

In summary our investigations have shown that there is no one solution that provides all of our identified and required capabilities. Crucially the majority of the explored end user solutions are still commercially orientated, such that they either make money from subscribers or through associated services.

So with the number of start-ups, software projects and standards that meet these capabilities snowballing, where might the BBC fit into this increasingly crowded new world?

We believe that the BBC has a role to play in all of these capabilities and that it would enhance our existing public service offering: to inform, educate and entertain. A healthy ecosystem requires multiple tenants and solutions providers, all adhering to core values such as transparency, interoperability and extensibility. Only then will users be able to freely and independently move or share their data between providers which would enable purposeful collaboration and fair competition toward delivering value to audiences, society and industry.

The BBC was incorporated at the dawn of the radio era to counteract the unbridled free-for-all that often comes with any disruptive technology, and its remit to shape standards and practices for the good of the UK and its population stands today as it did in 1927. With a scale, reach and purpose that is unique to the BBC, it is strongly congruent with our public service duty to help drive policy, standards and access rights to ensure that the riches on offer in these new ecosystems are not coopted solely for the downward pursuit of profit, and remain accessible for the benefit of all.

Machine learning and editorial collaboration within the BBC

Anna McGovern, Ewan Nicolson, Svetlana Videnova — Thu, 29 Aug 2019 13:55:00 +0000

The BBC is nearly 100 years old. Inevitably, as an organisation we are having to adapt to meet some of the technological requirements of the future, such as incorporating Machine Learning (ML) technologies. ML recommendations, for example, is a standard way for audiences to discover content and the BBC is committed to make this discovery more personal. Developing these services has brought an interesting opportunity for collaboration between the ML and Editorial teams within Datalab, the BBC team focused on building recommendation engines.

About a year ago we started the experiment of the BBC+ app. This was the first time the BBC provided the audience with a fully automated ML service. With this wealth of knowledge and with more data science initiatives taking shape, we want to use all the available expertise the BBC can provide.

Our aim is to create responsible recommendation engines, true to the BBC values and using all available expertise the BBC can provide. In industry, it is commonplace for data science teams to make use of specialist knowledge to inform how models are developed. For example, data scientists working for a travel site would use experts with knowledge about everything from business flights to how and when families go on holiday. Datalab consulted editorial teams and representatives who specialised in curation as it began to develop recommendations for content discovery.

Datalab’s editorial lead, Anna McGovern, helps us with advice on editorial judgement and content curation expertise within the BBC. Ewan Nicolson is lead data scientist and represents the technological aspect of Datalab’s work here. Svetlana Videnova, Business Analyst, poses some of the common teamwork problems within the public media industry and technological challenges we face today. We will focus on a given challenge about the curation of the content and leave its creation phase for another post. Both Anna and Ewan will provide their way of tackling that work in their own fields. The last column of the table below demonstrates an example of how the collaboration works in our team.

As you’ll see, the two fields of editorial and data science compliment each other. Working across discipline gives better results for the audience, and helps us learn from each other. It means that machine learning is actually solving the correct problems because we’re making use of the rich expertise from editorial. It also means that editorial are able to take advantages of techniques like machine learning to multiply their efforts and deliver more value to the audience.

Challenge	Machine Learning solution	Editorial solution	When we collaborate
How do we ensure curation is a good experience for users?	We consider many different measures of success: accuracy, diversity, recency, impartiality, editorial priority.	Traditionally on an editorial team, a journalist would research a story, discuss how it might be covered and compose the story itself to make it compelling.	The data scientists get a rich understanding from editorial of the different trade-offs between these measures of success. Deep domain knowledge.
How does recency impact curation of content?	We include publication date as a feature in our models. We sometimes try and optimise for recency, showing people more current content in some situations.	One of the challenges is that once that work is done it is fairly hard to bring the editorial creation back to life, especially for evergreen content. This is one of many examples that ML recommendations could help with, by surfacing this content in the most relevant time according to the user’s experience or history.	By working together we’re able to identify how to make decisions about which pieces of content are evergreen and suitable for recommendation, and which pieces have a limited shelf-life and shouldn’t be presented to users beyond a certain point.
How does the BBC ensure impartiality?	We use measures of statistical fairness to understand if our model is giving unbiased results. Good practice in machine learning make sure that we’re using unbiased training data.	Editors, journalists and content creators make a concerted effort to ensure that a range of views and perspectives are shown within a piece of content or across several pieces of content(within a series for example)	We combine our good practices with domain knowledge from editorial. We use techniques like human-in-the-loop machine learning, or semi-supervised learning to make editorial’s lives easier, and apply their knowledge at massive scale. ML helps editorial identifying those pieces of content that show a breadth of views.
How we ensure variety within content serving?	We construct mathematical measures for novelty and diversity. We include these in our machine learning optimisations.	Editorial staff responsible for curation ensure a breadth and depth of content on indexes, within collections etc	We learn about the differences between our different pieces of content. Working together we’re able to determine if our recommendations offer an interesting, relevant, and useful journey for the user. The BBC’s audio networks feature different output and tone of voice. ie. Radio 4 has a very different ‘flavour’ to 6Music. Consequently network can be used to ensure variety in results.
How do we avoid legal issues?	We are given a checklist, and we check the items off. We get told that there are things “we can’t do for opaque legal reasons” but never really understand why, and limit the functionality of our solution.	Editors, journalists and content creators have to attend a mandatory course relating to media law, so that they have full knowledge about issues such as contempt of court, defamation and privacy. An editor will sign off content to ensure that content is compliant with legal requirements.	By talking to legal advisers we can build business rules to minimise the risk of legal infractions. Close collaboration with editorial means we gain a deep understanding of the potential problems ahead at an early stage. We build with awareness of these concerns, and with that awareness build a solution that is high quality from both a technical and editorial point of view.
How we handle editorial quality?	We build and refine a model using data science good practices, and then turn it over to our editorial colleagues. They then decide if the results are good or not.	When editors curate they can choose content that is relevant, interesting and of good quality. Recommendations present a specific editorial challenge, in that recommenders can surface content that is not the best of our output.	In BBC+ we prioritised content that we knew would suit the environment in which it appeared: standalone, short-form videos, appearing in a feed, from digital first areas such as Radio 1, The Social, BBC Ideas etc Including editorial throughout the process means that they teach us about what is important in the results, so that data science understand the real problems that we’re trying to solve. We fail quickly, and learn quickly, getting to a better quality result.
How we learn from our audiences? Accuracy/user generated content?	Measure user activity with the products, and construct measurements of engagement. Building implicit and explicit feedback loops. An explicit feedback loop is having a“like” button, an implicit feedback loop is determining a way to measure when something has gone wrong, like bounce rate or user churn.	We monitor feedback and analyse stats to build a picture about how our audiences engage with our content.	We work with editorial to understand the insights we get from data. They help rationalise the behaviours that we see in the data. They also teach us things that we should look for in the data.
How we test recommendations	A mixture of offline evaluation metrics (e.g.testing against a known test set of data), and online evaluation metrics (e.g.A/B testing)	Traditionally: We monitor feedback and analyse stats to build a picture about how our audiences engage with our content.	The editorial lead works with data scientists on the composition of the recommender. The results are then reviewed by the editorial lead and to obtain a variety of opinions the results are reviewed by more editorial colleagues. More on quantitative testing here . The rich editorial feedback lets us understand where our model could be better and make improvements.

We’re big believers in cross-disciplinary collaboration. As we’ve touched on in this article the BBC has a lot of uniquely complex problems to solve in this space. This collaboration is essential if we’re going to continue to deliver value to the BBC’s audience using data.

If you are curious about this collaboration and would like to know more in depth about how we work, leave us a message and we will be happy to get back to you.

Also, we are hiring https://findouthow.datalab.rocks/.

Developing personalised recommender systems at the BBC

Jana Eggink — Thu, 22 Aug 2019 14:20:43 +0000

The BBC is on a journey to become more personalised, and recommendations are an important part of that goal. To date, recommendations in the BBC have been provided primarily by external providers. We feel that offering — and understanding — good recommendations is a crucial area for us in reaching our target audience of young listeners and so we have started exploring this area in-house. The Datalab team is a relatively new team specialising in machine learning, and looking after recommender systems in the BBC. We work with product groups to develop new ways to personalise their offerings, and also collaborate with BBC R&D.

We want to be able to explain the composition of our recommendations and so we need to understand how they are generated. Our recommendations should reflect the breadth and diversity of our content and meet our editorial guidelines, as well as informing, educating and entertaining! All these were good reasons for us to build the capability to constantly create challengers to the existing recommendation models.

Current recommendations on Sounds website

Datalab was assigned this brilliant and fun challenge and began collaborating with the Sounds team, using a multidisciplinary group made up of data scientists, engineers, editorial specialists and product managers.

The team had some prior experience building personalised recommendations for our video clip app BBC+. For BBC+, the recommender was purely content based, using existing metadata information such as genres (e.g. Drama/Medical) or brands (e.g. Glastonbury Festival). This would probably have been a good approach if our content had been labelled for the express purpose of personalisation. However, the BBC’s production workflows were designed to meet the needs of broadcast systems, and we didn’t always have all the labels we would have wanted for recommendations.

BBC+ app

Factorisation Machines come with the enticing promise of combining content-based recommendations with collaborative filtering.

Using a standard content-based approach, if a user had listened to podcasts from the genre ‘Health & Wellbeing’ the system would recommend a new episode from Radio 1’s Life Hacks but it could also recommend Radio 4’s Inside Health, which has a very different tone of voice. By contrast, collaborative filtering matches programmes based on what similar users have enjoyed — so if they listen to Radio 1’s Life Hacks, they might be recommended Radio 1 comedy. This model relies on ‘adjacent’ content similar to the recommendations found in shopping websites where ‘customers who bought this also bought that’. This approach often leads to better recommendations for established content, but is less effective for fresh content that hasn’t been consumed by significant numbers of people. Since the BBC continuously produces new content throughout the day this recommendation strategy by itself would be limiting.

Factorisation machines are a smart way to combine both. They have been around a few years, and open source toolboxes exist to support them. Our team programs primarily in Python, so we wanted a toolbox that integrates with that. Obviously, we also wanted it to be fast, give superior results and be easy to use (more on that later…).

We stored user-item interactions (i.e. the programmes a specific user has listened to) in a BigQuery table. The programme items with the corresponding genre and brand metadata were in a different table, and both needed to be assembled in the correct format for the factorisation machines. Our first choice of toolbox was xlearn. The code seemed relatively mature, running a first test example was easy, and the toolbox offers a variety of different flavours in terms of learning algorithm. But it was hard to get the data into the correct format and, even now that we have a version up and running, we’re still not sure we got everything right — mainly because the initial results are nowhere near as good as we had wanted (and expected) them to be!

The quality of recommendations can be subjective and we needed a way to test them before making them live anywhere on the BBC’s websites or apps. Predicting past behaviour is one way of doing this, but also comes with all sorts of problems: users only click on what they see, a piece of content might be brilliant, but if it does not appear in the results, the user will not see it and cannot click on it. Recommending the most popular items generally gives good numbers (as by definition these items get the most clicks), but rarely leads to recommendations of fresh content. In practical terms, it’s also a lot of work to set up if your data is stored in ways that were not devised with the ease of access for data scientists in mind…

So we decided to test the results using qualitative evaluation, asking about 20 editorial and other non-technical people to judge our new recommendations against those from an existing provider. We didn’t tell them which set came from which recommender! We used the individual history of the internal test participants to generate the recommendations by both providers and asked for their preference and general feedback.

Qualitative experiment

Most of our test users preferred the recommendations we currently have live to our first set of test recommendations and we weren’t keen on them either, so we knew we had more work to do.

With the overall infrastructure set-up, it was quite easy to swap out the toolbox we’ve used for the factorisation machines. We had previously looked at lightFM, and it had a much simpler data format, so we decided to give it a go. We were able to compute new recommendations and run another qualitative experiment in less than two weeks. Our recommendations looked much better, and our test users agreed! However, these are still first results. We don’t feel we’ve fully solved the problem of recommending popular items versus programmes that are strongly tailored towards a specific user’s interests, and are looking into ways to improve this.

We are happy with the results so far, but there is still a lot of work to do to bring the recommender into production. The infrastructure needs decidedly more work to make it robust and able to scale, and we’d like to do more testing. Having a variety of offline metrics should help us to optimise parameters, and test new algorithms without having to go back to our testing panels every few days. We’re also still looking at a simple content-based recommender to have another baseline, so more results hopefully soon.

We also still have some more fundamental questions that we hope our practical work will help us to answer. For example, can we use the same approach for recommending entertainment as for news, or do we need specialised systems for each domain? And what if we change the medium and move from audio and video to text, or new interfaces like voice controlled devices? Even if the overall editorial guidelines do not change, we might need different technical approaches to be able to achieve them. But we also want to avoid starting from scratch for every new recommender we build, and we’re still trying to figure out how best to do that. In summary, there is lots to do, but it’s exciting and we’re enjoying the challenge!

Want to work with us? https://findouthow.datalab.rocks/

Looking at the BBC's role in data-led services

Matthew Postgate — Wed, 19 Jun 2019 08:20:47 +0000

It’s been a busy time in my team over the last few months – with updates to BBC Sounds and iPlayer, 5G trials in Orkney (and in London), UHD trials for the FA Cup, Doctor Who launching in Virtual Reality and those teams behind the scenes that keep our broadcast and online services going day-after-day.

But one area that keeps on coming up when I’m out and about speaking at conferences, or at meetings with our partners - is the question of data – how we use it, how we share it and its potential to help us understand the world around us.

Not a week goes by without stories about data. There are negative stories, about data being used to target you with specific messages or sell you more, or leaks of personal data to third-parties. But there are also positive stories, like using big data to help reduce carbon emissions or helping the justice system work better.

This has made me think about the BBC’s role in this new ‘data economy’ – and what that should be.

How we use your data

At the BBC, we use data to make what we provide you, our viewers, listeners or readers, even better. It helps us tailor our products and services to be more about you – recommending programmes or content we think you might like, or alert you to the fact your favourite sport team has just scored (or lost a match). We also use it to ensure we’re making something for all audiences – and helps find gaps when we commission programmes and services.

But is there more that we could be doing to ensure data is used for good – that the data you give organisations is not just used for commercial gain but is used in a way that helps you and potentially your wider community? We think, potentially, yes.

That’s why we’ve started to work with teams here at the BBC and other partners on specific projects to help us identify what public service value we can bring to these new markets driven by data.

To be clear – we’re experimenting at this stage, and we will learn what works, what people might like – and what areas we think the BBC can help with, as we go along. We’re particularly interested in learning about how organisations can share data to get new insights and how people can safely move their data around. And, we know that when it comes to data, people are rightly concerned about privacy, safety and security. That’s why these trials will start small and controlled, so participants will have signed-up clear in knowing what, why and how their data is being used.

So what have we been up to?

Late last year, the DCMS published a report which looked at the potential of personal data portability to stimulate innovation and competition in the UK. It found that the ability to safely and securely move personal data around could unlock huge economic and societal gains, but that there are big practical issues (both in the way organisations share data and how consumers use it) to resolve first.

Following this, (and with DCMS, ICO and CDEI as observers), we’re involved in two controlled trials of data sharing by 25 individuals. These trials tests how it could be practicality possible to put a person in control of the data they share about themselves with other companies and what concerns this brings up.

The first trial is cross-sector, with the participants signing up to share data from a range of commercial companies – as well as the BBC. You can find out more about that here.

The second looks at bringing together data from media providers into a BBC data store (or what we’re calling internally a BBC Box) to improve people’s experiences when watching or listening to programmes. Bill has blogged about this here.

What’s next?

Over the coming months, we’ll continue looking at this area – with more experiments and closed trials.

We’ll be sharing more about what we learn – and look at what value the BBC can bring you – ensuring this market develops in a way that maximises the huge potential benefits of data and shares them as widely as possible.

I'll be in touch.

BBC+ relaunch - An experimental app for a more personal BBC

James Metcalfe — Wed, 12 Dec 2018 11:34:28 +0000

Open just about any mobile app and your experience is no doubt being tailored for you in some way by algorithms, personalisation, AI, Big Data, machine learning... any more buzzwords you can think of?

This can be great for us as users... we get a better feed, faster.

With the refreshed BBC+ app, we will be making recommendations using data to take steps towards re-inventing the BBC for a new generation.

It also brings a great opportunity to be more experimental in our approach. We will be testing ideas quickly to realise the possibilities of a data driven world both in the way we work and the way we make our products. We have some of the most popular apps in the UK, and know we could do a lot more to improve and make them more welcoming, relevant and personal. We also need to understand how, as a public service organisation, we maintain our values in a data driven world.

One of the biggest challenges with making our data more connected is that the information we have on our content is based on our organisational structure.This means you can indicate a topic that is relevant to you in the BBC News app (such as health) but that preference is not reflected in iPlayer and so we currently don’t recommend all the great health shows you might like. This can be frustrating as you have to set up and manage each product separately.

To change this we are aiming for a more consistent and personalised experience. What if we could easily bring the best content we make, to you, from anywhere in the BBC? Would people discover and enjoy the BBC more? We need to take some risks, try some new and challenging ideas, and ask some tough questions as we work towards a more relevant and personal BBC for everyone.

With this in mind, we are today making an experimental update to our BBC+ app.

Why change BBC+?

BBC+ was set up as a personal BBC app with similar goal, however without the support of the data capabilities we are trying now. It didn't take off as we'd hoped. Much of what it currently offered is provided by BBC News and replicates the BBC Homepage.

In order to move forward we have decided to start again, while beginning to use machine learning and data. Our plan is to keep it small, to try new ideas as quickly as possible, learn as we go and improve based on regular feedback.

The new BBC+ app is a combined effort from the Datalab and BBC+ app development teams. We have also been working collaboratively with colleagues from across the BBC to understand the challenges we collectively face. My colleague Svetlana Videnova describes some of the technical approach in her blog and the Datalab team have released a free online course to help people understand machine learning, data science and how we work.

For now the new BBC+ app contains short videos, however there is far more breadth and depth to explore than the old app. Our overall mission is to help you discover the gems that you didn’t even know you were missing, at a time that suits you.

This is clearly an early starting point and as we learn more we expect to make changes often

Why now?

We are releasing this update at an early experimental phase as we believe using data and machine learning in the BBC is a shared challenge and needs to be done in a transparent way.

Working in a closed group limits how much we can learn about machine learning and so we want to invite your feedback and discussion throughout the process. We know there are improvements we need to make and we want people to be directly involved in shaping the next stage. Our team are keen to gather thoughts and ideas on improvements, and so please use the feedback buttons in the app to send in your thoughts and comments. The team will be reporting back regularly on progress!

BBC+ is available on Android: https://play.google.com/store/apps/details?id=uk.co.bbc.bbc_plus

and iOS: https://itunes.apple.com/gb/app/bbc-the-bbc-just-for-you/id1110317391?mt=8

Datalab representing machine learning in the BBC - the experiment

Svetlana Videnova — Wed, 12 Dec 2018 10:06:17 +0000

One of the key objectives for the BBC for the coming years is to focus on younger audiences. Machine Learning recommendation capabilities can help us achieve that.

In a recent speech to BBC staff Tony Hall described our shared ambition “to grow our weekly online reach with younger audiences from 55% to 90% within four years”. We expect that proving personalised content via the Datalab platform to the new BBC+ app, including the learning throughout the experiment, will begin to contribute to this growth. Product Manager James Metcalfe describes the approach and objectives of BBC+ in his blog post.

The aim of Datalab, part of BBC Design + Engineering, is to help achieve this vision and reach younger audiences, via a new way of working and by experimenting with new technologies.

Datalab is the first BBC platform working with Google Cloud Platform (GCP) in production. Being the pioneer of this integration is meant to refine our approach to information security and the data privacy approval process, as well as establishing a new infrastructure.

As if this wasn’t challenging enough, we brought more excitement to our engineering team by including gRPC integration, Elasticsearch, Kubernetes and Spinnaker for container-based deployments and Drone as part of our stack. As a team the decision was made to implement these technologies, regardless of the fact that not everyone was experienced with them. As a result, we have adopted some more than others, but again, one of the key points is that we learn along the way, and gain skills that will be valuable later in the programme.

As a platform, we had to connect to the existing BBC data stores, including our User Activity Store (UAS) database, as well as serving the media content from a different AWS database.

This infrastructure allowed us to provide the groundwork for data scientists to start to explore various methods to satisfy the needs of BBC products and master the personalised experience for BBC users, starting with BBC+ app.

For example, new users, who have no previous history with the BBC, are given a cold start recommendation, so that they can begin their BBC+ journey. The first focus for the Datalab team was to create these recommendations to serve relevant content. It progressively incorporates a user’s history, refines their content and helps them discover interesting BBC suggestions.

Finding out what data we had to work with was crucial. We quickly discovered that the metadata for some BBC content is inconsistent. This lead us to conversations with our editorial colleagues on ways of tagging and creating content metadatain a more uniform manner, so that we can surface their output to the audience in a more personalised way.

The first content type to be ingested is video clips. In the future the aim is to include audio and articles. Currently in our Elasticsearch DB we have 1,137,598 clips. A set of filters was applied to provide only relevant clips, with complete metadata and editorial risks mitigated:

Unique clips
Only English content (for now)
Filter out audio and weather (for now)
Filter out clips older than 2013 (for now)
Editorial risk filtering
128 BBC brands are not surfaced in the BBC+ app
8 master brand were filtered out (mainly to help serve only English content for now)

That leaves us to 131,626 clips available for BBC+ users.

The following techniques are being used by our data science team, to experiment, score and create “better” recommendations:

Model-based collaborative filtering: we’re using embeddings for our content using word2vecmodels, making the content -“words” and our playlists -“sentences”
Offline scoring: to measure how a particular recommender system is performing, we’re using different metrics, like recency, popularity, Normalized Discounted Cumulative Gain (nDCG) and hit-rate. This helps us to select which versions of our models should go for online scoring
Online scoring: we have a system in place to score the recommenders with the live user data, using A/Bor hit-ratetesting, whenever necessary
Combining ML with editorial guidelines: we can prioritise one genre or brand against the others, to fit the editorial needs and better match our audience expectation
To help share what we know we created a course, to establish a good understanding of what we are building

We came out of this iterative process full of valuable insights, and building a new ML platform was just one of the things we discovered. Datalab established a positive team culture, with a great deal of multi-disciplinary learning. Despite the bumpy road, we now have a clear vision of our engineering and data science responsibilities, including A/B testing our process, trying various agile methodologies, and bringing teams in Salford and London together for closer collaboration.

We are excited about the next steps, and there is a lot to do on all aspects of our platform: infrastructure, engineering, devops, data science and ML. We will be inviting a new BBC team into the Datalab world: beginning with an exploratory session with the Voice Team over Christmas. We will also be proposing collaborations with other product groups, including R&D and News, early in 2019 to continue to experiment, push the boundaries of our exploration further and innovate ML in the BBC.

If you would like to know more about Datalab, our journey, or you would like to share your own experience, feel free to contact us at datalab@bbc.co.uk.

And in other important news: we are still recruiting! https://findouthow.datalab.rocks/

BBC+ is available on Android: https://play.google.com/store/apps/details?id=uk.co.bbc.bbc_plus

and iOS: https://itunes.apple.com/gb/app/bbc-the-bbc-just-for-you/id1110317391?mt=8

How we built BBC Sounds on the web

Julia Miroshina — Mon, 26 Nov 2018 16:26:00 +0000

In late 2017 engineering teams that had previously worked on iPlayer Radio and BBC Music began work on a new product. Launched 16th October 2018, BBC Sounds is an audio product from the BBC that brings together live and on-demand radio, music and podcasts into a single personalised experience.

In this blog post we’ll talk about the technologies and engineering practices that helped us to build the BBC Sounds web application.

We designed BBC Sounds as a completely new web application. It’s built upon experience from developing and maintaining iPlayer Radio and BBC Music. It allowed us to re-evaluate architecture and take advantage of new web technologies.

Technology

Cloud infrastructure and performance

The web version of BBC Sounds was released to the existing iPlayer Radio audience as well as absorbing traffic from other parts of the BBC web estate. It is a cloud hosted application built to scale to handle a large number of requests, with the business logic encapsulated into the API layer and shared amongst clients.

As our applications migrate into the cloud from on premise infrastructure our engineering teams assume DevOps responsibilities. We often use Gatling for load testing, along with mock servers that provide upstream responses for specific scenarios. For BBC Sounds we developed new metrics and defined actionable alerting. This improved application availability making it more resilient and scalable than ever before.

To find out how BBC serves its web pages to the public, check out the post “How we deliver BBC pages to the internet”.

New stack

Building a new product provides an opportunity to start using new technology. BBC Sounds has been built using Node.js and React with server-side rendering.

“BBC Sounds was built from the ground up, using Node.js, React, Redux and Express. The migration to Node.js wasn’t completely alien to us, as by 2017 we were already using it for client-side pipeline, such as Gulp, Sass, minifying and linting tasks. What was new was transitioning to server JavaScript using Express as web framework for Node.js. We chose it for its simplicity and minimal footprint, allowing us to get application setup quickly, but also providing engineers with freedom to define development style.”
— Jason Williams, Principal Software Engineer, BBC Sounds

We began by thinking about the UI components, and how they would be laid out. Our existing Radio network home pages (Radio 1, Radio 2 etc) were also built from UI components, but they were completely independent packages separately fetched to build the page. The problem with this approach was that modifying the components in the browser remained difficult, as they were composed server-side (using Twig templates). We needed to change our approach.

Building Components with Atomic Design

By using React to build components we can render sections of the page server-side and then continue to modify them in the browser without duplication of code. Some of these components need to be reusable, so we needed a versatile, but consistent approach.

We used Atomic Design methodology to break components up into distinct levels: atoms, molecules and organisms.

Here is an example using Radio 1’s Listen Live roundel on the homepage:

Example of a component with atomic design

The process starts with the design of a component from our UX team. As you can see, they want to show the user what’s playing live on our radio networks.

We start to break this down by identifying the smallest discrete units of the design e.g. the ProgressBar, the live Label, a ResponsiveImage etc. These are atoms which we can compose to represent a single network, lets call it a NetworkItem molecule. Finally we can represent our networks in a scrollable list — a Carousel organism.

Example of NetworkItem being populated, data comes from API layer

Each component we then build and test separately, with API simplicity and reusability in mind. All that’s left is to hydrate the components with the data from our API and voila, it renders! The more reusable components we build, the faster we can test different configurations with users, and deliver new sections of Sounds.

Rendering on the server and isomorphism

We run server-side JavaScript alongside a JavaScript front-end framework. It helps to avoid having disparate languages for rendering and interacting with the UI. The application renders the full markup of its initial state in the server context. Then, we deliver both the state and a copy of the transpiled application, to pick up where we left off client-side.

When a user visits a page in Sounds, the application responds with the static HTML for the page requested . Users don’t have to wait for JavaScript to load. Then, as the React app mounts atop the existing HTML, components become more interactive; buttons appear on items to fetch more episodes from a podcast, or allowing users to subscribe.

"The benefits of this approach are multiple: ready-to-render initial responses are great for SEO and accessibility where good network connectivity and JavaScript is not always a given. But for engineers it’s also powerful to remove the context switch between client/server languages and tools, to code one language everywhere."
— Stephen Roberts, Senior Software Engineer, BBC Sounds

React

Choosing a front-end framework from the plethora offered up by the JavaScript ecosystem is no small feat for a new product. Our decision to use React was ultimately guided by some core features we knew Sounds would need to be future-proof and flexible to changing requirements, but also scale horizontally with our development capability.

React is among the frameworks which implement a virtual DOM (VDOM). The VDOM is a layer of abstraction which represents the UI in memory and employs a “diffing algorithm” to apply updates to the DOM efficiently given a change in application state. This frees us up from writing boilerplate, data-binding / DOM manipulation code, to build the audience-facing components which make BBC Sounds.

React is a mature, open-source framework from Facebook, with a large community of developers and learning resources. Many large-scale applications run on React in production externally to the BBC, but also internally — product teams have been reaping the benefits of using React and React-like frameworks. Choosing React therefore, leaves the door open for inter-product code and component sharing.

Redux

Delivering richer experiences around our content on the web translates to added complexity in frontend logic — so it becomes increasingly important for engineers to be able to: reason about the internal state of our JS, communicate state changes to UI components efficiently, and debug when things go wrong.

In Sounds, we use Redux as a predictable state container to provide our React app with a central, read-only store for its state. The state store can only be updated via a set of pure functions called reducers. These functions take the current app state plus a plain JavaScript object describing a state-change (an action), and update the store accordingly. React-redux then connects the data from the store to React components, creating an easy to understand pattern for unidirectional data flow.

In this way, application state is always predictable; any sequence of actions can be replayed on state with the same result. And, since the reducer functions that handle the changes are pure, they’re very straightforward to test.

Just as we changed the way we build applications, we also changed the way engineering team works and adopted new practices.

Engineering team practices

Each software engineering team comprises of a project manager, product team (e.g. product owner, business analyst), embedded UX designers and QA testers, team lead and software engineers, with streamlined processes and agile methodology in place, similar to other BBC teams.

We use a variety of tools to speed up our development process and facilitate communication, e.g. GitHub, custom Slack integrations etc. There’re also various practices that engineering team has been following — some of them are listed below.

Cross-discipline collaboration

Developers often pair with UX designers, especially for branding related tasks and fine-tuning interactive experiences. Developing the component library has also been a joint effort from both engineering and UX teams to ensure consistency for users.

Running regular sessions with engineers working on the service layer and collaborating with other client teams ensures consistency in requirements and reduces complexity for the clients. You can read more about developing Sounds in the API.

Open source and inner source

It has been an ongoing aspiration within the BBC to open source more as well as use inner source tools and libraries.

Here are examples of open-source BBC libraries that have been used for Sounds development on the web:

bbc-a11y — BBC Accessibility Guidelines Checker
express-simple-timing — Express middleware that sets Server Timing API headers and optionally sends timers to stats systems.
BBC Grandstand — collection of common CSS abstractions and utility helper classes
speculate — automatically generates an RPM Spec file for a Node.js project

Architecture decision records (ADRs)

Following API-led product development, clients of the API contribute to architecture decisions. Depending on context and complexity, these decisions could be discussed in the form of pull requests, white-boarding sessions, and stubs.

Architecture design decisions are short documents which enable our teams to have a current understanding of the context and decisions the team is making. As decision is made, it gets documented as an ADR (Architecture Decision Record) and added to version control with the code.

“The hope is these can act as a unit of knowledge being passed across team boundaries to allow us, much like a distributed system made of people, to achieve an eventually consistent base of knowledge of our technical estate. They also form an historical record for a team, to help them bring people up to speed and remind themselves of their own technical foundations.”
— Paul Caporn, Lead Technical Architect BBC TV&Radio

OKR methodology

In Sounds each team is responsible for setting its own Objectives and Key Results — a methodology which is described in detail in the book “Measure What Matters: How Google, Bono, and the Gates Foundation Rock the World with OKRs” by J. Doerr. OKRs are meant to set strategy and goals over a specified amount of time for teams and individuals. At the end of that time period, OKRs provide a reference to evaluate whether these objectives have been met.

A/B testing

While developing Sounds we started extensively using A/B testing tools and success measures to ensure that the changes we’re making are the right ones for our audience. Every visit of an audience member is an opportunity to improve the product. Our experimentation toolkit allows teams to quickly test variations against an existing experience to determine what the optimal approach is in terms of a set of metrics. You can find out more about experimentation in the BBC in this blog post on optimising iPlayer experience.

10% time

Sounds have been making use of 10% time initiative adopted by the BBC software engineering teams that enabled developers to research and implement innovative solutions. Typical sprints in Sounds teams lasts two weeks, which means that one day per sprint is dedicated to 10% initiatives. This ensures 10% doesn’t skew delivery estimates for project managers, allowing flat distribution of allocated time per sprint. 10% time has been highly beneficial for engineers, giving space to grow, learn and think outside the box.

What’s next?

In 2019 we’re looking to personalise BBC Sounds even more, and follow our data-driven approach to implement meaningful improvements. By knowing our audience we aim to deliver a better experience of sounds, innovate on content, challenge our users, and provide an equal opportunity to access the best content for everyone.

If you are interested in working on projects like BBC Sounds, solving exciting, complex and large scale problems, why not come and join us? See latest jobs in D&E TV & Radio.

Building BBC Sounds in the API

Patrick Cunningham — Thu, 08 Nov 2018 09:44:22 +0000

Radio and Music Services team (minus Dan and Ray)

For the past 3 years, the Radio and Music Services (RMS) team has been making APIs for iPlayer Radio, BBC Music and podcasts in the Radio and Music metadata space. We are an Agile Software Development team made up of 8 Software Engineers, 3 testers, a Product Owner, Project Manager and Business Analyst. Our vision is:

“We encapsulate business logic for Radio & Music products on all platforms. We add value by providing the right blend of metadata, reliably and fast“

In 2017 the Radio and Music department were tasked with building a new personalised audio product built using an API first approach. I’m going to talk about what that means to us and how we went about building Sounds in the API.

In the past typically we have made API endpoints designed for specific features our clients would request from us, they would write or modify existing client code to integrate with this and then deliver the feature. Clients would still integrate with various BBC APIs to build products but the benefits of having a platform independent API to provide a single integration point to these services were obvious. We decided for Sounds to go one step further and define the layout and content of the product in an API.

The stack

We love Scala (after we overcame the learning curve) and our APIs are built using predominantly Akka HTTP and Akka Streams. We’ve found these tools provide excellent performance on modern cloud server architecture, and are scalable, resilient and a fantastic fit for the kind of concurrent retrieval of data we need from various BBC systems and databases.

Programme metadata is our bread and butter, and customising that to each users’ needs is at the core of Sounds. We have, at the time of writing, roughly 350,000 pieces of available audio we want users to be able to access. We think you should be able to serve up some of the oldest content, Britain declares war on Germany just the same as the latest Breakfast show with Greg James.

We follow the Reactive Microservices Architecture made by the super smart people at Lightbend as much as we can.

Overview of Sounds API architecture

Do one thing and do it well.

We’ve tried to stick to this principle to give us the flexibility to deploy and scale services individually.

APIs — Do one thing and do it well, Database per microservice

The internal architecture at the BBC means that we typically get personalised information to hydrate programmes with from various sources:

User Activity Service for play history, Bookmarks and Subscribing to Programmes
Recommendations API for programmes recommended to a user from their listening history
BBC Account for authentication to the BBC cross product Identification system

We use the API composition design pattern for returning most of this content. The Personalised Programmes and Recommendations services are examples of this. We pass a users authentication token to an external service and get a list of programme identifiers back, then verify that those programmes are available and return a list of programmes back upstream.

API Composition in our Personalised APIs

A piece of work undertaken by a combination of Engineering, UX, Product and Project teams across RMS and client teams was to simplify the domain objects returned by the API. We hedged our bets that we would mostly be returning a combination of lists of programme content or single entities. We whittled it down to 4 main types:

1) Playable
The core of the product and an object you will see everywhere. In a nutshell, something you can play. It can be an episode, clip or live stream but clients don’t need to know that

2) Display
A piece of content, image, link, text or placeholder

3) Container
A container of other programme objects. Typically this can be a programme brand or series, a category, radio network or editorially curated collection of programmes

4) Broadcast
Programmes with specific dates and times that are linked to Playable (on demands or live streams) or Display items (typically things that aren’t available yet)

Each one of these object types was refined to return all necessary information but ONLY what was agreed with all teams. Making them as small as possible meant we optimised the size of content, efficiency and consistency of our responses.

We thought another good bet would be to define the experience of the product in the API and maintain a consistency across platforms. This also means there is one place to change layout, messaging and internationalisation. Clients are then free to apply their expertise in rendering that view in the best possible way for their platform. We began investigating returning all content required to load all the metadata content page in one request, benefitting mobile clients in particular. The continuing improvement of our services gave us more confidence in their performance and response times, that we could manage this all inside the Experience API and deliver all content apart from the platform specific layout on each. We started to develop this and found that not only could we improve page load times and latency, but client device and server performance, simplifying client code. This works particularly well for Search results, the Sounds homepage (Discover), Containers (Brand, Series and Collections) and My Sounds in the app.

It’s been an enjoyable (but busy) year, with lots of interesting challenges and fun problems to solve. The team are really proud of their work and hope our users will get the benefit from it whether they’re enjoying Sounds through the mobile apps (Android, iOS) or the web.

In 2019 we’re looking to personalise BBC Sounds even more through audience segmentation and releasing new features through multivariate testing, providing true cross platform Multivariate tests and measurement capabilities to our clients. We’re also looking at content discovery feeds such as popularity allowing the audience to find more of the audio content they love and improving the ‘continuous play’ experience. We’ll let you know how we get on.

If you are interested in working on projects like BBC Sounds or a wider variety of exciting, complex and large scale problems, come and join us: Latest jobs in D&E TV & Radio.

Opening up our data science course materials

Felix Mercer Moss — Tue, 23 Oct 2018 11:30:00 +0000

Today, the BBC Datalab team is releasing onto the open web the course materials it developed for newcomers to data science at the BBC.

Data science, and particularly machine learning, is trending like never before. At just 12 minutes (!), NIPS tickets sold out quicker than Burning Man this year and if anyone still needs convincing, check out this plot:

It shows us that as recently as this year, both terms overtook the word “hipster” in web searches. While this may not be great news for avocado sales, it does look like it’s a good time to be someone working with data! But as ever, with exposure can often come confusion. So first, let’s try and get some definitions straight.

The BBC handles terabytes (read - “a lot”) of new data every day, and data science is the field of study that helps us use that data to make better decisions and deliver greater value to our audiences. Machine learning, meanwhile, is just one class of statistical techniques widely used in data science that, as it turns out, is particularly powerful. However, it is not magic. Instead it is a collection of tools and techniques that use mathematics to draw conclusions from data.

It could be argued that nothing indicates more the need to publish a training course than the moment a topic surpasses in web traffic a cool word millennials (usedto?) use. But our team did have other reasons for opening this course up. Here are just three:

1. To get people excited about learning more data science and machine learning; particularly when data literacy is so high on many employers’ wish lists.

2. To share some of the interesting problems faced by the BBC in the data space.

3. To demonstrate how large organisations such as the BBC can use their audience’s data to generate a positive impact.

Continuing to create engaging content, while the expectations of our (especially younger) audience members are constantly shifting, is one of our greatest challenges. To ensure we meet these new expectations, it is important for us to analyse and understand the conditions and behaviour patterns that lead to more, or less engagement.

In this course, we use data to explore the question: “What makes BBC audiences engaged?”. If you have at least a basic understanding of Python programming (statistics would be a bonus!) and a healthy interest in taking your first steps in data science, we think you will have a lot of fun exploring our course.

And if you found this training easy and had fun doing it, why not join us?

We take readers on a four-part journey that follows the typical pattern of many data-led projects. Each part is expected to take readers around one hour to complete and focuses upon: data exploration, data transformation, classification models and regression models.

In data exploration, we first look into how to formulate our data science problem and perform some preliminary analyses to get a better feel for our data. In data transformation we then introduce readers to some basic machine-learning theory and walk through the process of how we prepare our data for ingestion into our statistical models. In the final two parts (classification and regression), the actual machine-learning starts, where we explain how to train, evaluate and choose the most appropriate model for our purpose.

The dataset that you get to work on contains the logs from 10,000 BBC iPlayer users. As you might expect, a public service broadcaster like the Beeb doesn’t take data privacy lightly. So while the dataset we use is ‘real’, you can be sure that we have enforced particularly strong anonymisation so that the identity of the users is impossible to recover.

With an introductory course like this, we don’t expect to make anyone an expert in data science overnight. However, we do hope that those of you who do take it will have a lot of fun, while also gaining some valuable insight into the decisions we make when working with data at the BBC. The topics covered in the online course only scratch the surface of the data science problems we encounter daily at the BBC.

If you would like to find out more about the challenges we face and how we are using data science and machine learning to find innovative solutions to connect with audiences, please get in touch!

We are always looking to improve the content of the course, so if you have any feedback or ideas for further instalments we would really like to hear from you.

Link to course: https://github.com/bbc/datalab-ml-training

My Life, My Data, #MyTomorrow

Chris Sizemore — Thu, 24 May 2018 14:17:03 +0000

Tomorrow(’s World) starts today.

Personal data is one of the most important issues we’re all grappling with these days, but it can all feel so confusing and abstract that we tend to dismiss it with a mixture of 😕, 🤯, ¯\_(ツ)_/¯.

What is personal data anyhow? It’s data about you, but is it your friend, or your enemy? Often when it comes to personal data, a sense of dystopia can understandably prevail - we’re being spied on, manipulated, and our civil liberties are under threat. Meanwhile, services that collect data about us have become so useful and seemingly indispensable (and those Buzzfeed quizzes ain’t gonna do themselves).

In a world where it can often feel that things are constantly happening to us, how can we influence our own futures in a positive, active way?

It’s in this context that the General Data Protection Regulation (GDPR) kicks into effect this Friday, 25 May 2018 - and it’s a game-changer for people living in the UK and Europe when it comes to their rights to privacy in the digital age and what they can actively do with data about themselves.

To help audiences explore the implications of this monumental intervention, Tomorrow’s World is launching “My Life, My Data, #MyTomorrow”, a campaign about people and communities shaping the future by taking control of the data they create. The campaign starts today and runs for the following week.

Using short films, interactive experiences, and conversations across social media, Tomorrow's World will help explain just how exciting and important personal data is.

Highlights of the “My Life, My Data, #MyTomorrow” campaign include:

“Future Values, or A Short Ride in an Intelligent Machine” Data about you is helping drive the development of artificial intelligence. “Future Values, or A Short Ride in an Intelligent Machine” which launches today on BBC Taster, will send you into a “what if?” future built atop your own data. You’ll exchange banter with the artificial intelligence behind a driverless taxi, and discover the guiding principles that make up your own deepest values. Experience “Future Values, or A Short Ride in an Intelligent Machine". Read more about Charisma.ai, the cutting edge conversational storytelling technology behind this pilot.

“My Life, My Data” A short film (5 minutes) hosted by Leila Johnston and Alex Lathbridge takes us on a fast-paced journey to explore what is personal data and how is it currently used. Brought to life with animations by Jamie Squire, the short film reminds us of the quid pro quo that comes with access digital services such as Facebook, Google and Instagram: nothing comes for free. View here.

“Instagram Chatbot” To illustrate the power of data about you and how it is driving the development of machine learning, our “Instagram Chatbot" will analyse your next Instagram post and compare it with thousands of others to estimate how people will react to it. The chatbot will then create a unique, personalised short video explaining why and to illustrate the power of personal data. Try it here.

“Donate Your Data Day” A short film (10 minutes) that imagines a not-to-distant future where we can donate data about ourselves en masse, using a click-to-donate button on our mobile phones. Galvanised around a gently satirical global 'Donate Your Data Day', the audience can decide what data to donate and who to donate it to, showing their support by becoming a data donor with their own Data Donor Card. View here.

Tomorrows World presenters Leila Johnston and Alex Lathbridge

“Meet the Personal Data Superheroes” A special episode of the Tomorrow’s World podcast explores the work of people making a difference when it comes to our data rights. Meanwhile, on social media, BBC presenters, journalists, and influencers - including Radio 1 Newsbeat’s Tina Daheley, CBBC’s Katie Thistleton and BBC Click’s Spencer Kelly - discuss the importance of the data we're creating everyday. Listen here.

"How Do You Feel About Your Data? A Survey" - a timely research project from Tomorrow's World partners The Open University, on their new nQuire citizen science platform. How much personal information are you happy to share? What should companies be allowed to do with the data you create? Audiences can contribute to this short survey and give their views on data protection. View here.

The “My Life, My Data, #MyTomorrow” campaign will make you smile and think, and ask you to reconsider your preconceptions about the relationship between us, the data we create, and the companies, governments, and other organisations that use that data.

So what comes next?

This is just the beginning, and there's much more to do. People are recognising their rights and starting to take greater control of the data they create. They are actively helping create a future their grandchildren will want to live in. We can all join in.

Together, we're creating #MyTomorrow.

Read more about the Tomorrow’s World partnership between Science Museum Group, Wellcome, The Open University, the Royal Society, and the BBC.

Your data matters

Julie Foster — Tue, 22 May 2018 14:04:00 +0000

Last May, we updated everyone on our plans to make the BBC more personalised and relevant to you. We can give you more of what you love when we understand you better, and also make sure that as a public service, we make something for everyone.

We now have over 15 million people with BBC accounts using the BBC’s websites and apps in the last month. What’s more, they are also using BBC websites and apps more than people who are not signed in. 64% of BBC account users visit BBC online more than 2 days per week, compared to 46% of all users. And when they are on BBC websites and apps, people with BBC account spend an additional hour per week than people not signed in.

Your personal data is helping power this transformation. We can’t provide you with a meaningful personal or tailored experience without this information, but it is ultimately your data. And your data matters.

The General Data Protection Regulation, or GDPR for short, is coming into enforcement in the next week. It makes sure that businesses clearly explain to you why they collect your personal data, and how they use it. It is an evolution of the Data Protection Act, and gives you new and important rights.

As we’ve said before, we’ve built our new BBC account system with GDPR in mind, but we’re always reviewing our processes, technology and governance.

We use your personal data for different reasons, and it’s important we are transparent to you why we collect and use this data. Our site ‘Using The BBC’ spells out, in plain English, what we will (and importantly won’t) do with your data. It also can help you exercise your GDPR rights, such as changing some of your details in Settings.

How have we prepared for GDPR?

For starters, you should not need to be a rocket scientist to know your rights. We’ve updated our policies to make them even more transparent and clear.

What are my rights?

We’ve created a new section in Using The BBC all about GDPR to help you understand what your rights are, how you can exercise them with the BBC and get help.
We’ve also innovated and developed technology with data privacy at the heart of what we create. Below are a couple of examples of the kind of work we’ve done to prepare for GDPR.

Privacy for children

We want to help you make sure that your child can only watch programmes, read comments and upload their creations in a space that is age appropriate and suitable to them. For this reason, we’ve developed a way for parents or guardians to register their child which is simple and easy to do, but more importantly is safe and secure for the entire family. We want your children to get great experiences on BBC websites and apps, and play our part to help protect them.

Privacy by design

This little button means a lot to us.

We really want you to have a personalised experience, like picking up where you left off watching a show, getting recommendations on programmes you might like or getting notifications about your favourite football team. But you have the right to turn off these features if you don’t want them. Our analytic services team has worked hard to develop a technical solution that can do this easily, and ensure your privacy.

What’s next?

We have some fantastic events coming this summer, from the Biggest Weekend to FIFA World Cup and Wimbledon. Your data is helping us learn what you like, so we can make sure you get the best out of this summer, and improve our services for you in the future.