Evaluating Location-Based Services Data for Transportation: Is Our Sample Representative?
Two of the questions we’re often asked here at StreetLight Data are: “What percentage of the population is creating the location records in your sample?" and "Does the location data in your sample fairly represent all income groups, or is it biased?” In this blog post, we’re pulling back the curtain on our internal evaluation process with a deep-dive analysis of our newest data source: Location-Based Services data. We started using this data source chiefly because of its large sample size and representativeness, so in this blog post, I will show you our process for determining these characteristics. (Click here to read more about Location-Based Services data in general.)
When we evaluate Location-Based Services data, our goal is to answer two key questions:
- What is the data set’s sample share? In other words, what percentage of a given region’s population use the devices that create the location records in the sample?
- In terms of income, are the device users in the data set representative of the people that live there? In other words, do the incomes of the device users in our sample match the distribution of income levels of people living in that region?
In this blog post, I’ll demonstrate how we answered both of these questions for the state of Florida. Florida is an excellent case study because it’s a populous state with a diverse range of incomes, industries, and land uses. While there are regional variations in our sample, our findings in Florida are similar to our findings for the rest of the United States. (Note: This post does simplify the actual algorithmic processing and data science performed, so my apologies to our Data Science team!)
Device Penetration Rate
To determine device sample share rates for Location-Based Services data, our first step is to estimate the number of devices in our sample that “live” in a particular area. Next, we compare that number to the region’s total population per the most recent US census.
First, we look at devices' locations during nighttime hours, when people tend to be near their residences. We assign a probability that devices are affiliated with a particular census block based on how many nighttime hours they spend there. A device is disaggregated, so 30% of it might belong to one block, 30% to another, and 40% to a third. For clarity’s sake: we do not have any personally identifiable data (just points in space and time) about the devices' owners.
Our next step is to add up all the devices in our sample that are probably affiliated with each census block. If we assign 15 devices to a given census block that 100 people live in, that means our device sample share for that census block is 15%. Since StreetLight only publishes Metrics about groups of people – even in blog posts – we aggregated all of the census blocks in this study into tracts. Keep in mind that about 30 people live in the average census block, and about 6,000 people live in the average census tract.
Results for Florida: 10% Device Sample Share
Our average device sample share across all of Florida’s census tracts is 10.1%. This number holds very steady across the tracts. Figure 1 (below) shows a histogram for all 4,000+ tracts in the state of Florida. As you can see, the vast majority of tracts have a rate between 6% and 12%. (To be exact, 90% of tracts have a penetration rate between 6.5% and 13.9%.) This is a very consistent sample.
Figure 1: This histogram shows the number of census tracts in Florida for which StreetLight’s Location-Based Services data device sample share falls in a certain bin. (Note: this is based on data created in September 2016 - and device sample share is increasing over time.) StreetLight Data’s algorithms adjust the way each individual device is counted according to its blocks’ device share. This means it automatically adjusts for remaining geographic and demographic sample bias.
This device sample share rate impacts how we calculate our travel pattern analytics, too. If a device in one of our client’s study areas is affiliated with a block with a 5% sample share, we’ll treat that device differently from a device that lives at a block with a 10% sample share. Essentially, our algorithms automatically scale our Metrics to account for our sample share rates at the devices’ census blocks.
Digging into Income Bias
When it comes to bias, our clients usually care more about income than any other demographic factor. Many of us in the transportation industry have spent years controlling for income bias in samples, and having a representative sample is a critical component of ensuring that transportation plans, infrastructure, and policies are equitable. So we take it very seriously! However, keep in mind that this same type of bias analysis can be extended to other census demographics.
To explore income bias in our sample, we use the same aggregated “device home block locations” - or "nighttime locations" - that we used for our sample share analysis. We focus on the block group level. In terms of population, block groups rank between census blocks and tracts. The American Community Survey organizes its statistics for income by block group, and we use their statistics for this analysis.
Once we have our sample share by block group, we determine the average income distribution of each block group. Our goal is to find out if our sample share is different across higher income and lower income block groups. For this case study, our answer is "No" – the penetration rate is similar across all income levels, as shown in Figure 2 below.
Figure 2: This shows our penetration rate for the block groups in each income level. Block groups are organized and aggregated into bins according to the average income of that group. To make this graph easier to read, we have grouped together block groups into bins according to their average income.
Going Further: An Invitation to Explore Our Data
While we’ve provided the “bird’s eye” view of our sample’s representativeness in Florida, we also know that many of our clients are interested in the details. To help answer any questions you may have, we made an interactive map of each census tract for you to explore (see below).
Scroll or click on the track and learn that tract’s population, average income, and StreetLight device penetration for September 2016. (Keep in mind - it gets better every month!). The map is colorized to show higher-than-average tracts and lower than average tracts.
Note that the most densely populated tracts are usually physically small, because the census tries to keep the number of people per tract consistent. Thus, a tract in Miami may be tiny on the map but have more people than a rural tract which “looks” much bigger on the map.
Find an interesting trend as you explore? Please let us know!