Sample sizes, universes, margins of error - going small with big data

Sample sizes, universes, margins of error - going small with big data

Tom Weiss, Tue 26 November 2019

If you’re used to analyzing advertising campaigns based on Nielsen data, you’re used to having around 25,000 homes at your disposal. If you step up to connected device data, you're more likely to have 5 million or more homes in front of you. That’s a two-hundred-fold increase in the amount of data you’ve got your hands on. You can certainly measure everything at a much more granular level and analysis that was previously only possible on national networks can run on pay channels or local networks. Where we might previously have been able to optimize on dayparts, we can now optimize by the hour. 15 minute departs can now be broken down to 5-second granularity.

How low can we go?

But what's the cut-off point? Customers are often asking us how small we can go in terms of measuring campaigns that, for example, may only incorporate a few small DMAs, or a small addressable campaign in the hundreds of thousands of impressions. To do this, we use sampling error to help us figure out whether, even with 5 million devices, we can accurately measure an audience. 

When we’re doing any analysis based on a sample of a population - such as 5 million TVs in a country of 119 million TV homes - we calculate the sampling error to establish how meaningful our results are.

Using a 95% confidence interval, a sample of 5 million TVs across the country will be accurate at the level of 100 TVs to within 19 TVs and 1000 TVs to within 61 TVs.

In other words, any results that rely on 100 TVs are 95% likely to be accurate to within 19% of the reported values, and any results that depend on 1000 TVs are 95% likely to be accurate to with 6.1% of the reported values. We need to get up to 50,000 TVs to end up being 95% confident that we are within 1% of the actual result.

In practice, this means that to get minute by minute ratings, you need over a hundred devices watching each minute. If we are going to break this down by DMA, we need 100 per DMA.

For ratings, this means that we can run minute-by-minute ratings for any show with an average rating of 0.01%. For shows with a 1% rating, we can run minute by minute ratings for each of the top 50 DMAs.

Where did the audience come from?

We can also look better at how the audience moves between networks. If we have around 100 TVs coming from each different station, we can do great analysis on audience flow. In practice, we need several thousand devices watching a show for this to be practical.

With our five million homes, we can run audience flow analysis on any programming that has a rating of more than 0.1%. That’s a lot of programming where we can measure exactly why these shows are gaining the audience and why others are losing it. It is game-changing for schedulers. For larger shows, we can even drill down to exactly when people are switching and why.

How effective is my advertising?

Running closed look effectiveness on advertising requires massive dataset. A national campaign that hits 30% of the population and has a conversion rate of 30% gives us an analysis sample of 30% of the 30%, or 9%.

With our 5 million homes this gives us 45,000 devices. We can split this down into several hundred different segments and still have a small sampling error in each of these segments. We can compare networks and dayparts across a whole load of DMAs. We can also do frequency optimization, measuring differences in conversion rate for the different number of devices.

We can even undertake frequency optimization for smaller campaigns by aggregating data across multiple campaigns to get the audience up to the level of statistical significance. With smaller budgets, it’s even more important to know how to spend, and when we’ve looked at the data, we’ve often found that smaller advertisers are over-buying on frequency at the expense of reach.

Multi-touch attribution on smaller campaigns

Beyond frequency optimization, we can split the audience down to analyze scenarios where we didn’t run on some networks, day-parts or programming. For example, consider a smaller campaign running on three networks with a total reach of 3% and 3% conversion. We get only end up with 4500 TVs in our analysis, but we can still split them down to look at a subset that only saw on network A, those that saw on B and those that saw on B or C but not A.

If you want to know the uplift you got from a particular network you need to be able to break out how the campaign would have looked like without this network, and with five million homes this becomes an everyday analysis you can include on every campaign

What comes next?

People are constantly upgrading both their TV sets and cable boxes, and more and more of these newer devices are connected. By increasing the number of devices available for analysis, we increase the granularity that goes with this. However, even if we increase from 5 million homes to the entire country, that’s only a twenty-fold increase in granularity, not another hundredfold improvement.

The big increases in numbers are now behind us, and as an industry, we should focus less on going from 5 million to 25 million devices, and more on ensuring that the millions of devices reflect the make-up of the country. The sampling error is only a valid calculation if you are sure your population is not skewed. The value of Nielsen’s panel is not its size but the fact that it is representative. We need to get the same focus in device data: once you’ve reached 5 million homes the value is in how well the data represents the country, not simply in how many homes it represents.

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Science


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531