Big Data for 21st Century Economic Statistics

Big Data for 21st Century Economic Statistics

An NBER conference on Big Data for 21st Century Economic Statistics met in Washington, DC on March 15-16, 2019. Research Associates Katharine G. Abraham of University of Maryland and Matthew D. Shapiro, University of Michigan; Ron S. Jarmin of U.S. Census Bureau; and Brian Moyer of Bureau of Economic Analysis, organized the meeting, sponsored by the Alfred P. Sloan Foundation. These researchers' papers were presented and discussed:


Carol Robbins, National Science Foundation; Jose Bayoan Santiago Calderon, Claremont Graduate University; Gizem Korkmaz, Daniel Chen, Sallie Keller, Aaron Schroeder, and Stephanie S. Shipp, University of Virginia; Claire Kelling, Pennsylvania State University

The Scope and Impact of Open Source Software: A Framework for Analysis and Preliminary Cost Estimates

Open source software is everywhere, both as specialized applications nurtured by devoted user communities, and as digital infrastructure underlying platforms used by millions daily. This type of software is developed, maintained, and extended both within the private sector and outside of it, through the contribution of people from businesses, universities, government research institutions, nonprofits, and as individuals. Robbins, Korkmaz, Calderon, Kelling, Keller, and Shipp propose and prototype a method to document the scope and impact of open source software created by these sectors, thereby extending existing measures of publicly-funded research output. The researchers estimate the cost of developing packages for the open source software languages R, Python, Julia, and JavaScript, as well as re-use statistics for R packages. These reuse statistics are measures of relative value. The researchers estimate that the resource cost for developing R, Python, Julia, and JavaScript exceeds $3 billion dollars, based on 2017 costs.


Katharine G. Abraham, University of Maryland and NBER; Margaret Levenstein, University of Michigan; and Matthew D. Shapiro, University of Michigan and NBER

Securing Commercial Data for Economic Statistics


W. Erwin Diewert, University of British Columbia and NBER, and Robert C. Feenstra, University of California, Davis and NBER

Estimating the Benefits of New Products

A major challenge facing statistical agencies is the problem of adjusting price and quantity indexes for changes in the availability of commodities. This problem arises in the scanner data context as products in a commodity stratum appear and disappear in retail outlets. Hicks suggested a reservation price methodology for dealing with this problem in the context of the economic approach to index number theory. Feenstra and Hausman suggested specific methods for implementing the Hicksian approach. Diewert and Feenstra evaluates these approaches and recommends taking one-half of the constant-elasticity gains computed as in Feenstra, which under weak conditions will be above but reasonably close to the gains obtained from a linear approximation to the demand curve as proposed by Hausman. The researchers compare the CES gains to those obtained using a quadratic utility function. The various approaches are implemented using some scanner data on frozen juice products that are available online.


David Copple, Bradley J. Speigner, and Arthur Turrell, Bank of England

Transforming Naturally Occurring Text Data into Economic Statistics: The Case of Online Job Vacancy Postings

Copple, Speigner, and Turrell combine both official and naturally occurring data to get a more detailed view of the UK labor market which includes heterogeneity by both region and occupation. The novel, naturally occurring data are 15 million job vacancy adverts as posted by firms on one of the UK’s leading recruitment websites. The researchers map this messy online data into official classifications of sector, region, and occupation. The recruitment firm's own unofficial job sector field is mapped manually into the official sectoral classification and, making use of official vacancy statistics by sector, is used to reweight the data to reduce bias. The researchers map the latitude and longitude of each vacancy directly into regions. In order to match up to official statistics organised by standard occupational classification (SOC) codes, the researchers develop an unsupervised machine learning algorithm which takes the text data associated with each job vacancy and maps it into SOC codes. The algorithm makes use of all text associated with a job, including the job description, and could be used in a range of other situations in which text must be mapped to official classifications. The researchers plan to make the algorithm available as a Python package via GitHub. Used in combination with official statistics, these data allow us to examine the weak UK productivity and output growth which have been enduring features of the post-crisis period. Labor market mismatch between the unemployed and job vacancies has previously been implicated as one driver of the UK's productivity 'puzzle' (Patterson, Christina, et al. "Working hard in the wrong place: A mismatch-based explanation to the UK productivity puzzle." European Economic Review 84 (2016): 42-56.). Using the fully labelled dataset, the researchers examine the extent to which unwinding occupational and regional mismatch would have boosted productivity and output growth in the post-crisis period. The effects of mismatch on output are driven by dispersion in productivity, tightness, and matching efficiency (for which the researchers provide new estimates). The researchers show evidence of significant dispersion of these across sub-markets, with the aggregate data hiding important heterogeneity. Contrary to previous work, the researchers find that unwinding occupational mismatch would have had a weak effect on growth in the post-crisis period. However, unwinding regional mismatch would have substantially boosted output and productivity growth relative to the actual path, bringing it in line with the pre-crisis trend. The researchers demonstrate how naturally occurring data can be a powerful complement to official statistics.


Edward L. Glaeser, Harvard University and NBER, and Hyunjin Kim and Michael Luca, Harvard University

Nowcasting the Local Economy: Using Yelp Data to Measure Economic Activity (NBER Working Paper No. 24010)

Can new data sources from online platforms help to measure local economic activity? Government datasets from agencies such as the U.S. Census Bureau provide the standard measures of local economic activity at the local level. However, these statistics typically appear only after multi-year lags, and the public-facing versions are aggregated to the county or ZIP code level. In contrast, crowdsourced data from online platforms such as Yelp are often contemporaneous and geographically finer than official government statistics. Glaeser, Kim, and Luca present evidence that Yelp data can complement government surveys by measuring economic activity in close to real time, at a granular level, and at almost any geographic scale. Changes in the number of businesses and restaurants reviewed on Yelp can predict changes in the number of overall establishments and restaurants in County Business Patterns. An algorithm using contemporaneous and lagged Yelp data can explain 29.2 percent of the residual variance after accounting for lagged CBP data, in a testing sample not used to generate the algorithm. The algorithm is more accurate for denser, wealthier, and more educated ZIP codes.


Rishab Guha, Harvard University, and Serena Ng, Columbia University and NBER

A Machine Learning Analysis of Seasonal and Cyclical Sales in Weekly Scanner Data

Guha and Ng analyze weekly scanner data collected for 108 groups at the county level between 2006 and 2014. The data display multi-dimensional weekly seasonal effects that are not exactly periodic but are cross-sectionally dependent. Existing univariate procedures are imperfect and yield adjusted series that continue to display strong seasonality upon aggregation. The researchers suggest augmenting the univariate adjustments with a panel data step that pools information across counties. Machine learning tools are then used to remove the within-year seasonal variations. A demand analysis of the adjusted budget shares finds three factors: one that is trending, and two cyclical ones that are well aligned with the level and change in consumer confidence. The effects of the Great Recession vary across locations and product groups, with consumers substituting towards home cooking away from non-essential goods. The data are thus informative about local and aggregate economic conditions once the seasonal effects are removed. The two-step methodology can be adapted to remove other types of nuisance variations provided that these variations are cross-sectionally dependent.


Gabriel Ehrlich and David Johnson, University of Michigan; John C. Haltiwanger, University of Maryland and NBER; Ron S. Jarmin, U.S. Census Bureau; and Matthew D. Shapiro, University of Michigan and NBER

Re-Engineering Key National Economic Indicators

Traditional methods of collecting data from businesses and households face increasing challenges. These include declining response rates to surveys, increasing costs to traditional modes of data collection, and the difficulty of keeping pace with rapid changes in the economy. The digitization of virtually all market transactions offers the potential for re-engineering key national economic indicators. The challenge for the statistical system is how to operate in this data-rich environment. Ehrlich, Haltiwanger, Jarmin, Johnson, and Shapiro focus on the opportunities for collecting item-level data at the source and constructing key indicators using measurement methods consistent with such a data infrastructure. Ubiquitous digitization of transactions allows price and quantity be collected or aggregated simultaneously at the source. This new architecture for economic statistics creates challenges arising from the rapid change in items sold. The researchers explore some recently proposed techniques for estimating price and quantity indices in large-scale item-level data. Although those methods display tremendous promise, substantially more research is necessary before they will be ready to serve as the basis for the official economic statistics. Finally, the researchers address implications for building national statistics from transactions for data collection and for the capabilities and organization of the statistical agencies in the 21st century.


Andrea Batch, Jeffrey C. Chen, Alexander Driessen, Abe Dunn, and Kyle K. Hood, Bureau of Economic Analysis

Off to the Races: A Comparison of Machine Learning and Alternative Data for Predicting Economic Indicators


Tomaz Cajner, Leland D. Crane, Ryan Decker, Adrian Hamins-Puertolas, and Christopher Kurz, Federal Reserve Board

Improving the Accuracy of Economic Measurement with Multiple Data Sources: The Case of Payroll Employment Data

Cajner, Crane, Decker, Hamins-Puertolas, and Kurz combine the information from two sources of U.S. payroll employment to increase the accuracy of real-time measurement of the labor market. The two data sources are the CES payroll employment series and an employment series based on microdata from the payroll processing firm ADP. The two time series are derived from roughly equally-sized and mostly nonoverlapping samples. The researchers argue that combining CES and ADP data series reduces the measurement error inherent in both data sources. In particular, they infer “true” unobserved payroll employment growth using a state-space model and find that the optimal predictor of the unobserved state puts approximately equal weight on the CES and ADP series. The researchers show that the estimated state helps forecast future values of CES, even controlling for lagged values of CES and a state estimate using CES information only. In addition, the researchers present the results of an exercise that benchmarks the data series to an employment census, the QCEW.


J. Bradford Jensen, Georgetown University and NBER; Shawn D. Klimek, Andrew L. Baer, and Joseph Staudt, U.S. Census Bureau; and Lisa Singh and Yifang Wei, Georgetown University

Automating Response Evaluation for Franchising Questions on the 2017 Economic Census

Between the 2007 and 2012 Economic Censuses (EC), the count of franchise-affiliated estblishments declined by 9.8%. One reason for this decline was a reduction in resources that the Census Bureau was able to dedicate to the manual evaluation of survey responses in the franchise section of the EC. Extensive manual evalution in 2007 resulted in many establishments, whose survey forms indicated they were not franchise-affiliated, being recoded as franchise-affiliated. No such evaulation could be undertaken in 2012. Baer, Jensen, Klimek, Singh, Staudt, and Wei examine the potential of using external data harvested from the web in combination with machine learning methods to automate the process of evaluating responses to the franchise section of the 2017 EC. The method allows the researchers to quickly and accurately identify and recode establishments have been mistakenly classified as not being franchise-affiliated, increasing the unweighted number of franchise-affiliated establishments in the 2017 EC by 22%-42%.


Sudip Bhattacharjee and Ugochukwu Etudo, University of Connecticut, and John Cuffe, Justin Smith, and Nevada Basdeo, U.S. Census Bureau

Using Public Data to Generate Industrial Classification Codes

The North American Industrial Classification System (NAICS) is the system by which multiple federal and international statistical agencies assign business establishments into industries. Generating these codes may be a costly enterprise, and the variety of data sources used across federal agencies leads to disagreement over the "true" classification of establishments. Bhattacharjee, Cuffe, Etudo, Smith, and Basdeo propose an improvement to the generation of these codes that could improve the quality of these codes and the efficiency of the generation process. The NAICS codes serve as a basis for survey frames and published economic statistics. In the current state, multiple statistical agencies and bureaus generate their own codes (e.g. Census Bureau, Bureau of Labor Statistics (BLS), and Social Security Administration) which can introduce inconsistencies across datasets housed at different agencies. For example, the business list comparison project undertaken by BLS and the Census Bureau found differences in classification even for single-unit establishments (Fairman et al., 2008, Foster et al., 2006). The researchers propose that combining publicly available data and modern machine learning techniques can improve accuracy and timeliness of Census data products while also reducing costs. Using an initial sample of approximately 1.3 million businesses gathered from public APIs, the researchers use user reviews and website information to accurately predict two-digit NAICS codes in approximately 59% of cases. The approach may have some merit, however substantial methodological and possible privacy issues remain before statistical agencies can implement such a system.


Jeremy Moulton, University of North Carolina, Chapel Hill, and Marina Gindelsky and Scott A. Wentland, Bureau of Economic Analysis

Valuing Housing Services in the Era of Big Data: A User Cost Approach Leveraging Zillow Microdata

Historically, residential housing services or "space rent" for owner-occupied housing has made up a substantial portion (approximately 10%) of U.S. GDP final expenditures. The current methods and imputations for this estimate employed by the Bureau of Economic Analysis (BEA) rely primarily on designed survey data from the Census Bureau. Gindelsky, Moulton, and Wentland develop new, proof-of-concept estimates valuing housing services based on a user cost approach, utilizing detailed microdata from Zillow (ZTRAX), a "big data" set that contains detailed information on hundreds of millions of market transactions. Methodologically, this kind of data allows us to incorporate actual market prices into the estimates more directly for property-level hedonic imputations, providing an example for statistical agencies to consider as they improve the national accounts by incorporating additional big data sources. Further, the researchers are able to include other property-level information into the estimates, reducing potential measurement error associated with aggregation of markets that vary extensively by region and locality. Finally, they compare the estimates to the corresponding series of BEA statistics, which are based on a rental-equivalence method. Because the user-cost approach depends more directly on the market prices of homes, the researchers find that since 2001 the initial results track aggregate home price indices more closely than the current estimates.


Shifrah Aron-Dine, Stanford University, and Aditya Aladangady, Wendy Dunn, Laura Feiveson, Paul Lengermann, and Claudia R. Sahm, Federal Reserve Board

From Transactions Data to Economic Statistics: Constructing Real-Time, High-Frequency, Geographic Measures of Consumer Spending

Data on consumer spending are important for tracking economic activity and informing economic policymakers in real time. Aladangady, Aron-Dine, Dunn, Feiveson, Lengermann, and Sahm describe construction of a new data set on consumer spending. They transform anonymized card transactions from a large payment technology company into daily, geographic estimates of spending that are available only a few days after the spending occurred. The Census Bureau's monthly survey of retail sales is a primary source for monitoring the cyclical position of the economy, but it is a national statistic which is not well suited to study localized or short-lived shocks. Moreover, lags in the release of the survey and subsequent -- sometimes large -- revisions can diminish its usefulness for policymakers. Expanding the official survey to include more detail and faster publication would be expensive and add substantially to respondent burden. The approach helps fill these information gaps by using data on consumer spending with credit and debit cards and other electronic payments from a private company. The researchers daily series are available from 2010 to the present and can be aggregated to generate national, monthly growth rates similar to official Census statistics. As an application of the new, higher-frequency, geographic information in the researchers' data set, they quantify in real-time the effects on spending of Hurricanes Harvey and Irma.


David Friedman, Crystal G. Konny, and Brendan K. Williams, Bureau of Labor Statistics

Big Data in the U.S. Consumer Price Index: Experiences & Plans

The Bureau of Labor Statistics (BLS) has generally relied on its own sample surveys to collect the price and expenditure information necessary to produce the Consumer Price Index (CPI). The burgeoning availability of big data has created a proliferation of information that could lead to methodological improvements and cost savings in the CPI. The BLS has undertaken several pilot projects in an attempt to supplement and/or replace its traditional field collection of price data with alternative sources. In addition to cost reductions, these projects have demonstrated the potential to expand sample size, reduce respondent burden, obtain transaction prices more consistently, and improve price index estimation by incorporating real-time expenditure information -- a foundational component of price index theory that has not been practical until now. In CPI,Friedman, Konny, and Williams use the term alternative data to refer to any data not collected through traditional field collection procedures by CPI staff, including third party datasets, corporate data, and data collected through web scraping or retailer API's. The researchers review how the CPI program is adapting to work with alternative data, followed by discussion of the three main sources of alternative data under consideration by the CPI with a description of research and other steps taken to date for each source. The researchers conclude with some words about future plans.


Don Fast and Susan Fleck, Bureau of Labor Statistics

Measuring Export Price Movements With Administrative Trade Data

The International Price Program (IPP) surveys establishments to collect price data of merchandise trade and calculates import and export price indexes (MXPI). In an effort to expand the quantity and quality of MXPI, research the potential to augment the number of price indexes by calculating thousands, and potentially millions, of prices directly from export administrative trade transaction data maintained by the Census Bureau. This pilot research requires reconsideration of the long-held view that unit value price indexes are biased because product mix changes account for a large share of price movement. In this research, Fast and Fleck address the methodological concern and identifies others by analyzing two semi-homogeneous product categories among the 129 5-digit BEA End Use export categories. The results provide a road map of a consistent and testable approach that aligns with the concepts used in existing MXPI measures, maximizes the use of high volume data, and mitigates the risk of unit value bias. Preliminary analysis of all 129 5-digit BEA End Use categories for exports shows potential for calculating export price indexes for 50 of the 5-digit classification categories, of which 21 are currently not published.


Rebecca J. Hutchinson, U.S. Census Bureau

Investigating Alternative Data Sources to Reduce Respondent Burden in United States Census Bureau Retail Economic Data Products


Abe Dunn, Bureau of Economic Analysis; Dana Goldman and Neeraj Sood, University of Southern California and NBER; and John Romley, University of Southern California

Quantifying Productivity Growth in Health Care Using Insurance Claims and Administrative Data

Dunn, Goldman, Romley, and Sood assess changes in multifactor productivity (MFP) in delivering episodes of care (including that received after initial discharge from a hospital) for elderly Medicare beneficiaries with three important conditions over 2002-2014. Across the conditions, the researchers find that MFP declined during the 2000s and then stabilized. For heart attack, for example, MFP decreased by 15.9% over the study period. While heart-attack patients experienced better health outcomes over time, growth in the cost of care for these episodes dominated. The cost of hospital readmissions among heart-attack patients appears to have increased substantially.