Data sushi or why everyone just wants the raw data

Data sushi or why everyone just wants the raw data

Tom Weiss, Tue 15 August 2017

When I first started doing business in Japan in the mid 1990s I fell in love with sushi. Sitting at the counter eating sushi and drinking sake with my friends felt like the most glamorous thing in the world.

Now, when I'm talking to clients and hear them asking about access to the "raw" data I wonder if they have the same infatuation I had with raw fish. It took me a good ten years to re-acclimatise myself to the pleasures of cooked fish and I think the same thing may be going on here.

When people talk about raw data, it usually comes from a desire to not be constrained by what they want to do with the data. With raw data, it's assumed that anything is possible, and it can be processed exactly as required. That is certainly true, but if you're asking for the "raw" data you also need to be very clear on what you are really asking for.

It's really easy to make sushi, right?

In video, raw data is frequently little more than click-stream information from the remote control (set-top box) data, or a list of frames that match an ingested stream (Smart TV data).

This data is messy and complex, and like a visitor to a restaurant who wants more than a raw fish dumped on his plate, most agencies or networks don’t want to be ingesting that kind of raw data.

Most people, when they request “raw” data mean that they want a granular, sanitised dataset that they can drill down to the level of an individual device. They want any inconsistencies in the data smoothed out, they want the population to have been made representative of the country, and they want data they can trust, and that won’t make them look silly in front of clients.

In other words, they want the opposite of raw data. They want finely prepared fish. There are three key steps to make your data sushi appetising and palatable, as follows:.

Step 1: Smoothing out the inconsistencies

Raw data is full of common inconsistences, notably:

Set-top-boxes that occasionally report viewing from the 1st January 1970 Missing fast-forward or rewind session data Smart TV data where, if the same content is on two networks at once, the ACR algorithm continues to flip-flop between the two OTT data with inconsistences between iOS and Android Any dataset needs to go through a rigorous cleansing process to remove these erroneous instances.

Step 2: Make it representative

Set-top-box data typically is only sourced from an MVPD’s footprint. This needs to be modeled for it to be applied to the rest of the country, and any skews in that footprint need to be removed. With more and more people using antennas to get terrestrial broadcast, STB data is already going to be missing large chunks of the population.

Smart TV is more representative as it covers cable, satellite and broadcast, but it’s only going to cover homes where people have bought a new TV in the last few years and where they’ve connected to the internet through wi-fi. That’s still going to skew the data towards more prosperous and younger people.

Even OTT data-which, by its nature, is representative of every device using that service, needs to be ‘de-skewed’ by eliminating test data and bot views.

Step 3 : No food poisoning please

Finally, you need to be sure that the data is reliable. Nielsen is often rapped for being late in delivering its data. Late is not the worst data can be: data providers frequently have outages and you need resilience in your data strategy to cope with this.

No one wants food poisoning. When Nielsen's Florida data was late in March due to a power outage, people complained, but it was eventually published. If you're taking data to your client claiming that nobody watched their competitor’s commercial, you had better be sure there wasn’t an outage at your data provider during that period. It’s better that you wait for your sushi and getting it in perfect condition than getting it on time but far from fresh.

Leave raw data to the sushi masters

That sashimi on your plate may look delightfully simple and elegant, but a lot has gone into getting it there. It takes five years of hard work to train to be a sushi chef, and creating data sushi is no less complicated.

Unless you have an army of fully-trained sushi chefs on staff, I'd recommend you ask for the processed data from your data providers, and to leave the raw data to the data masters.

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Science


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531