Mixing the Old with the New: Integrating New Data into Traditional Data Systems for Sustainable Development

 

Written by Hayden Dahmm

New methods of data collection have the opportunity to create a timelier, more detailed understanding of sustainable development challenges. For example, earth observation (EO) data has been identified as a tool for monitoring a wide range of issues, including agriculture, health, cities, and biodiversity at often more frequent and granular levels, and has also proven essential during the current pandemic. Additionally, last month, SDSN in collaboration with Esri launched SDGs Today, a platform of timely data sources related to the Sustainable Development Goals to provide users with a snapshot of the state of sustainable development. Although new methods can provide valuable insights, they need to be treated with caution as they are not replacements for existing methodologies.

For example, UN Global Pulse has developed an innovative technique for detecting impoverished communities by using remote sensing and machine learning to identify the materials that household roofs are made from. This information is already being used by the Uganda Bureau of Statistics, but unlike a household survey, it can only serve as a proxy-indicator for poverty and it is not an actual count of people living in poverty. Additionally, new methods might not achieve the same level of accuracy possible with traditional data sources. Rather famously, Google Flu Trends was developed to provide real-time monitoring of influenza cases around the world based on search data, but it ended up massively overestimating the prevalence of flu and offered a worse prediction than extrapolating from less timely traditional data.

A Kinsa smart thermometer

A Kinsa smart thermometer

As we explore and expand new methods, we should apply them in a thoughtful way, or else we risk compromising our understanding of the very issues we are attempting to measure. When possible, we should work to make sure that new methods complement rather than supplant traditional data sources. Fortunately, there are already examples of this being achieved.

Timely data about rainfall is critical to evaluating climatological shifts and the risk of floods or droughts. While our traditional understanding of rainfall comes from rain gauge stations located around the world, unfortunately, the number of rain gauge stations has been in decline, and these systems can only provide information about precipitation at a scattered set of points. Estimates of precipitation derived from satellite imagery can provide continuous, global coverage, but these estimates tend to underestimate extreme events. The Climate Hazards Group Infrared Precipitation with Station Data (CHIRPS) is a joint project between the US Geological Survey and UC Santa Barbara that brings together both rain gauge and satellite data to estimate rainfall for nearly the entire world dating back to 1981. Researchers combine historical monthly averages from rain gauges with five different satellite products, and local rainfall is calculated using regression techniques. CHIRPS’ then adjust for any biases in the estimates by blending in available daily rain gauge data, assuming the gauges represent the actual rainfall at their specific locations. By integrating new data with the traditional data, we can now access a more globally complete estimate for analyzing seasonal precipitation trends and drought monitoring.

There are also examples of joining new and traditional methods to measure social issues. Detailed and reliable population data are critical to enabling effective responses to the COVID-19 pandemic, but the census data for a number of low-income countries may be decades out of date, and the pandemic has significantly disrupted census activities in at least 32 countries and territories. Satellite imagery and other data sources can provide valuable estimates of population change, but this is less feasible when the historic census data is lacking. To address this challenge, the World Bank has piloted a bottom-up technique that combines satellite imagery with traditional survey data to strengthen population estimations. By applying this approach in Sri Lanka, they were able to more accurately predict village populations not captured by the household survey, as validated by contemporary census data. If conducting a complete census becomes logistically or economically infeasible for more countries, expanding such an approach could help provide estimates that are at least informed by survey data. The WorldPop group, for example, has combined satellite imagery, available surveys, and other data to estimate the population of Afghanistan, where a census has not been performed since 1979.

The United States, in particular, has struggled to scale-up testing and traditional data collection to track the spread of COVID-19, and the actual case count could be anywhere from 6 to 24 times the reported figure. Data from Kinsa smart thermometers have demonstrated the ability to identify potential COVID-19 outbreaks days before these are reflected in official data, and Harvard researchers have created a model for predicting COVID-19 spread based on social media and search data. These new methods could provide powerful insights alongside testing, but they have not yet been peer-reviewed and vetted. While the reliance on search data by Google Flu Trends was ultimately unsuccessful at accurately predicting infections, a 2015 study combined search data and social media data along with hospital visit records and data from a participatory surveillance system to generate a predictive model of flu cases in the United States offered promising results. By leveraging the relative strengths of each of these data types, their machine learning algorithm was able to predict cases with a high degree of accuracy up to four weeks ahead of the official Center for Disease Control reports. A similar combination of data sources might support efforts during the current crisis.

While exploring the potential of new methods to meet the growing demand for data, we must not overlook the continued importance of traditional data sources. Rather than shifting exclusively to remote measurements of climate events, we should still be investing in more rain gauges so that major areas do not go without adequate ground-sourced data; instead of foregoing population enumeration and relying on EO-based estimates, we should still be collecting household-level data; and even as innovative data sources provide insights into the spread of the pandemic, we should continue to ramp up traditional data collection. Not only do these traditional data sources continue to be valuable in their own right, but their production is essential for us to capitalize on the value of new methods.