A team of computer scientists has derived accurate, neighborhood-level estimates of the racial, economic and political characteristics of 200 U.S. cities using an unlikely data source — Google Street View images of people's cars.
Published this week in the Proceedings of the National Academy of Sciences, the report details how the scientists extracted 50 million photographs of street scenes captured by Google's Street View cars in 2013 and 2014. They then trained a computer algorithm to identify the make, model and year of 22 million automobiles appearing in neighborhoods in those images, parked outside homes or driving down the street.
The vehicles seen in Street View images are often small or blurry, making precise identification a challenge. So the researchers had human experts identify a small subsample of the vehicles and compare those to the results churned out by their algorithm. They that the algorithm correctly identified whether a vehicle was U.S.- or foreign-made roughly 88 percent of the time, got the manufacturer right 66 percent of the time and nailed the exact model 52 percent of the time.
While far from perfect, the sheer size of the vehicle database means those numbers are still useful for real-world statistical applications, like drawing connections between vehicle preferences and demographic data. The 22 million vehicles in the database comprise roughly 8 percent of all vehicles in the United States. By comparison, the U.S. Census Bureau's massive American Community Survey reaches only about 1.6 percent of American households each year, while the typical 1,000-person opinion poll includes just 0.0004 of American adults.
To test what this data set could be capable of, the researchers first paired the Zip code-level vehicle data with numbers on race, income and education from the American Community Survey. They did this for a random 15 percent of the Zip codes in their data set to create a “training set.” They then created another algorithm to go through the training set to see how vehicle characteristics correlated with neighborhood characteristics: What kinds of vehicles are disproportionately likely to appear in white neighborhoods, or black ones? Low-income vs. high-income? Highly-educated areas vs. less-educated ones?
That yielded a number of reliable correlations. The five vehicle types most closely associated with white neighborhoods, for instance, were SUVs, cars made by Jeep and Subaru, expensive cars, and cars classified as “wagons.” In black neighborhoods, on the other hand, Cadillacs, Buicks, Mercurys, Chryslers and sedan-type vehicles were more prevalent.
You can do similar exercises for other demographic characteristics, like educational attainment. People with graduate degrees were more likely to drive Audi hatchbacks with high city MPG. Those with less than a high school education, on the other hand, were more likely to drive cars made by U.S. manufacturers in the 1990s.
One important thing to note is that these are just correlations. Saying that white people are more likely to drive Subaru wagons isn't the same as saying all white people drive Subaru wagons, or that all Subaru wagons are driven by white people. But the data set showed that white people were more likely than black or Asian people to drive those cars.
Armed with all these correlations, it was time to put the algorithm to its true test: Could it accurately infer the demographics of the remaining 85 percent of Zip codes, given only the car data?
Short answer: yep. “We found a strong correlation between our results and ACS [American Community Survey] values for every demographic statistic we examined,” the researchers wrote. They plotted the algorithm's demographic estimates against the actual numbers from the ACS and measured their correlation coefficient: a number from zero (no correlation) to 1 (perfect correlation) that measures how accurately one set of numbers can predict the variation in a separate set of numbers.
At the city level, the algorithm did a particularly good job of predicting the percent of Asians (correlation coefficient of 0.87), blacks (0.82) and whites (0.77). It also predicted median household income (0.82) quite well. On measures of educational attainment, the correlation coefficients ran from about 0.54 to 0.70 — again, not perfect, but fairly impressive accuracy considering the predictions derived solely from auto information and nothing else.
“Taken together, these results show our ability to estimate demographic parameters, as assessed by the ACS, using the automated identification of vehicles in Google Street View data,” the researchers wrote. They pointed out that if they broadened the scope of their inquiry — extracting not just cars, for instance, but also features of homes, landscaping, or sidewalks and roads — they would probably be able to achieve even greater accuracy.
It's a little unsettling that a computer can figure out so much about us simply by noting what types of vehicles we drive, and indeed the authors note that the research raises “important ethical concerns” related to expectations of privacy and fairness. What happens if say, insurers start charging higher rates based on Street View photos of neighborhoods? Or if a bank denies a mortgage to someone based on the type of car they drive?
The researchers say those perils need to be carefully weighed against the prospect of more accurate and more immediate data on the communities we live in, which could be used to improve our understanding of ourselves — “the potential to measure demographics with fine spatial resolution, in close to real time.”