Sample Size in Massive Mobile Data Analytics: Not a Simple Concept!

Big Data  |  Transportation  |  Mobile

In the transportation industry, we love to debate the optimal sample size for data collection. But this deceptively simple concept can be tricky.

What is the best or most useful definition of sample size? There’s not one obvious answer, and we’ve learned that one size does not fit all when it comes to determining the right sample size for a project.

Here are four “pickles” that we at StreetLight Data consider when evaluating sample size:

  1. Sample Rate: In intercept surveys, researchers typically look for the “sample rate,” – say 2%  but that’s typically 2% of a given population over a just a few hours, or perhaps an entire day. Is a 2% sample rate of traffic in a single day better or worse than a 1% sample each day for 30 days?
  2. Sample Size: Sample size can also be evaluated in terms of number of devices. But what if each device only creates a few data points per week, which means that it actually can’t be used to analyze transportation behavior? How do we “count” that device?
  3. Sample Geography: What if the sample size for a given region is very high - say 10% - but that 10% is spatially concentrated in one corner of that geographic region? In this case, is that sample truly representative?
  4. Sample Size Unit: What is the best (and most feasible) “unit” for sample size for your project? A trip? A person? A vehicle? A device? A travel day? An activity?

Acknowledging that the best answer might differ depending on the situation, we’re starting a series of posts exploring the concept of “useful” sample size, using our own data resources, and we would love to hear your thoughts!

What does sample size mean to you, and what factors do you consider in your evaluation? Let us know in the comments, and check back here in the coming weeks for our next installment on sample size.