• Share
  • Email
  • Embed
  • Like
  • Private Content
Capps programoninformationsciencebrownbag
 

Capps programoninformationsciencebrownbag

on

  • 504 views

Our guest speaker, Cavan Capps, who is Big Data Lead services presented this talk as part of the Program on Information Science Brown Bag Series. ...

Our guest speaker, Cavan Capps, who is Big Data Lead services presented this talk as part of the Program on Information Science Brown Bag Series.

[slideshare id]

Big Data provides both challenges and opportunities for the official statistical community. The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community. Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes.

Speaker Bio: Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb. The 'DataWeb' is a data library of networked statistical databases from all federal statistical data domains, with sophisticated visualization, descriptive analytics, data integration and dashboard construction tools. The DataWeb is the source of official API to Census data products.

Statistics

Views

Total Views
504
Views on SlideShare
220
Embed Views
284

Actions

Likes
0
Downloads
5
Comments
0

5 Embeds 284

http://drmaltman.wordpress.com 200
http://informatics.mit.edu 77
http://informatics-dev.mit.edu 4
http://feedly.com 2
http://flavors.me 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This work by Cavan Capps is licensed under the Creative Commons Attribution-Share Alike 4.0 International License.To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.The use of Big Data isn’t new for the Census Bureau. We’ve been using administrative records, such as tax data, for decades to improve our collections. However, there is a new generation of Big Data – as the electronic environment flourishes – that we must keep up with. We must research ways to utilize these new data sources in our collections in order to increase efficiencies and to reduce costs and the time it takes to disseminate statistics. At the same time, we must also continue to maintain the quality of the official statistics. I’ll be addressing these aspects throughout my talk today. I was asked today to talk about two specific questions. I’ll address these questions broadly and then share some case studies of how Big Data is being used in our programs at the Census Bureau. I’ll also briefly touch on a Big Data source coming from the Census Bureau and ways the private sector could use our data in concert with Big Data.
  • Bob Groves called traditional survey data specifically created to measure something “Designed data”The private sector maintains vast troves of transactional data, much of which is “data exhaust‟, or data created as a by-product of other transactions. With the use of mobile phones, much of this data can be associated with individuals and their locations. The public sector in most countries also maintains enormous datasets in the form of census data, health indicators, and tax and expenditure information. … The global internet is currently offering near real-time data on durable and nondurable goods prices, housing sales, and other relevant events. This data exhaust can also be termed “Organic data” which has its own strengths and weaknesses.The Census Bureau is the largest statistical agency in the U.S. Many of the Nation’s economic indicators and other critical socio-economic measures come from the Census Bureau. Similarly, much of the data we collect and process are critical inputs to major economic indicators and measures produced by other statistical organizations. We can not afford to ignore the opportunities offered by these new data sources and techniques.My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • Data Collection. For data collection, we are moving to a Multi-Mode Data Collection model (for survey and census data collection) that utilizes different collection modes based on survey response rate, quality, cost and several other factors to effectively collect data. We are architecting a Big Data environment that makes it easier to collect large volumes of data from various sources, integrate with internal and external sources of data, and make real-time decisions about effective collection modes.In terms of Data Analysis, we are researching Big Data methodological techniques, such as modeling or mashing (or integrating) together a variety of data sources, that allow us to work effectively with the Big Data. We’re also exploring technology solutions, such as High Performance and Distributed Computing Environments, to improve the effectiveness and speed of data analytics aided by better visualization techniques that incorporate geographic information. And for Data Release, we are exploring a using correlated “Big Data” sources to improve and speed data review and to test that the released data maintain privacy and confidentiality.
  • Currently most of the Census statistical processing is based on designed surveys or designed measures from administrative data. Most of the processing is batch processing in SAS. Depending on the size of the data, processing times can be lengthy. Most speed improvements have been achieved by increasing the size of the machine.In the future as more data may be combined with various sources of organic as well as designed data, data sizes may grow rapidly. User expectations are also growing, expecting data to be released more timely, with more geographic, historical and industrial detail. The stress to deliver this information while maintaining strict confidentiality will explode. As a result new estimation and data processing paradigms are being explored.
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The use of alternative data sources such as administrative records or Big Data poses a number of opportunities for improving the current construction statistics produced by the Census Bureau and reducing data collection costs for these programs. For example:ForeclosuresData on residential properties in various stages of foreclosure could aid in our analysis of data on new residential construction and sales. These data are currently collected by a couple of private data vendors (for Bill’s info: CoreLogic and Realty Trac). The Census Bureau has purchased address-level files from a data vendor for analysis related to household surveys, but it did not easily allow for calculation of totals needed for analysis of national data. We also purchased annual totals by state from another vendor for use in data analysis; however, the vendor does not allow purchasers to disseminate data to the public.Manufactured HomesCensus conducts the Manufactured Housing Survey (MHS) for the U.S. Department of Housing and Urban Development, or HUD, to provide data that they are required to collect on manufactured home placements. By law, manufactured homes must be inspected at the factory. These inspections are conducted by the Institute for Building Technology and Safety (IBTS), which provides information on the inspections that becomes the universe and sampling frame for the Manufactured Housing Survey. If we could partner withHUD and IBTS as well as manufactured home manufacturers and dealers to follow through on the inspection forms to collect information on the placement of the home, we could use this information to tabulate data on placements. The data would have no sampling error and data collection costs would be drastically reduced.Public ConstructionOur estimates of construction spending include spending on construction funded by federal, state, and local governments, collected using voluntary surveys. Much of the information on government spending can be gleaned from publicly available budget documents. We do this to supplement and benchmark the data that we collect, but we could partner with government agencies that conduct construction (especially at the federal level) to obtain data files that would reduce our data collection costs and improve data quality. We have contacts at most agencies, but we have not yet undertaken a concerted effort to obtain the detailed electronic files that we need. Property OwnersCensus Bureau surveys collect information from homeowners on owner-occupied properties. Data on non-owner-occupied properties are more difficult to obtain because the owner of the property must be located. Various administrative sources such as the Business Register, tax data, and local deed records could provide information on property owners and their individual properties. Data on improvements to non-owner-occupied properties are no longer included in the construction spending estimates because the cost of finding the owners was prohibitive. The Residential Housing Finance Survey (RHFS, a HUD-sponsored survey) had the same problem. Reducing the cost could make it feasible to improve the construction spending estimates and would allow Census to conduct other surveys more cost effectively. However, startup costs to create an up-to-date list of property owners could be significant.Building PermitsThe largest opportunity for using administrative records for the construction area is data on building permits issued by local governments. Issuance of building permits in the U.S. is mostly at the local level, where approximately 20,000 unique jurisdictions issue permits. Some states are capturing data on all permits issued in their states, but this is not as prevalent.Building Permits SurveyCensus conducts a monthly and annual Building Permits Survey (BPS) to obtain data on the numbers of new housing units authorized from local jurisdictions. Because of cost and respondent burden concerns, data on nonresidential permits and permits for alterations and repairs are not collected.As more and more jurisdictions computerize their operations and more states begin compiling permit data from their jurisdictions, we have the opportunity to capture individual permits (which are public records) for use in our estimates. Information on individual new residential permits could replace the current Building Permits Survey data collection, and it also has tremendous potential for updating the Master Address File used for many household surveys and for the decennial Census. Staff working on this survey are partnering with colleagues in the Census Bureau’s Geography Division to encourage local governments to work toward providing files of permits. Lists of individual permits would also greatly improve the annual population estimates, which currently rely on the use of statistical algorithms to allocate the Building Permits Survey jurisdiction totals to more local areas. Survey of ConstructionThe Survey of Construction (SOC), which collects data on housing starts and new home sales, requires field representatives to list individual permits in a sample of jurisdictions to create the sampling frame. Use of individual permits received from jurisdictions could eliminate this expensive operation. Use of Certificate of Occupancy permits would also eliminate the need to follow up cases in sample until the building is completed. This would reduce the cost of interviewing by about one-third and save up to $1 million per year.To collect data on spending on nonresidential construction, we currently purchase a list of new projects from a third party vendor each month. This list is incomplete and expensive. If we could acquire data on nonresidential permits from local jurisdictions, it would be much less expensive and more complete.There are many opportunities when looking at Big Data for use in official construction statistics. There are also many challenges. We have had discussions with our government counterparts about how we could assist governments with automation and with standardizing the format of their data files, but jurisdictions have local regulations and custom computer systems that make standardization challenging. Likewise, these surveys are voluntary, so obtaining all permits in the U.S. would not be feasible without legislative changes. An iterative approach would be needed, starting with obtaining information from large jurisdictions with automated systems that are willing to participate.
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
  • The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…

Capps programoninformationsciencebrownbag Capps programoninformationsciencebrownbag Presentation Transcript

  • Big Data activities at the U.S. Census Bureau Cavan Capps Big Data Lead U.S. Census Bureau February 13, 2014 Prepared for MIT Libraries Program on Information Science Brown Bag Talk Feb 2014
  • Big Data Challenge at the Census Bureau “Designed Data” vs. “Organic Data” “The world is now producing large amounts of data.. data from Internet searches, credit card transactions, retail scanners, and social media”. “ There also are more and more digital administrative data (e.g., tax records, social security records, Medicare/Medicaid records, food stamp records, HUD records). Some of these data are not directly linked to the populations we study; some have item missing data problems; none offer a real replacement for our surveys, but many will be useful as auxiliary data sources.” 1
  • Big Data Challenge at the Census Bureau Big Data is about creating information to make Big Decisions from novel, and often massive data sources. 2
  • Big Data creates new Statistical Agency Challenges A recent meeting of International Statistical Agencies observed: 1. The volume of data generated outside the government statistical systems is increasing much faster than the volume of data collected by the statistical systems; almost all of these data are digitized in electronic files. 2. As this occurs, the leaders expect that relative cost, timeliness, and effectiveness of traditional survey and census approaches of the agencies may become less attractive. 3
  • Big Data creates new Statistical Agency Challenges A recent meeting of International Statistical Agencies observed: 3. Blending together multiple available data sources (administrative, commercial electronic transactions and internet webpage data, search frequency data, twitter, facebook etc. ) with traditional surveys and censuses (using paper, telephone, face-toface interviewing) to create high quality, timely statistics that tell a coherent story of economic, social and environmental progress must become a major focus of central government statistical agencies. 4. This requires efficient record linkage capabilities, the building of master universe frames that act as core infrastructure to the blending of data sources, and the use of modern statistical modeling to combine data sources with highest accuracy. 4
  • Big Data creates new Statistical Agency Challenges A recent meeting of International Statistical Agencies observed: 5. The Agencies will need to develop the analytical capabilities to distill insights from more integrated views of the world and impart a stronger systems view across different government and private sector information systems to provide more geographical and industry detail. 6. There are growing demands from researchers and policy-related organizations to analyze the micro-data collected by the agencies, to extract more timely and detailed information from the data. 5
  • Big Data Development Challenges for Statistical Agencies The Meeting Recommended that Statistical Agencies develop: 1. 2. 3. 4. High-speed, “big data” software/hardware systems for record linkage and extraction of key information from massive files. Efficient and sophisticated imputations procedures needed to make the combined data sources jointly useful. More use of statistical modeling for statistical estimation, to provide more: 1. Timely estimates 2. Small area estimates 3. New measures New ways to give secure access to micro-data for legitimate policy and research purposes, to increase their impact of their work. 6
  • In Summary, massive challenges for the Statistical Agencies: 1. The Internet and Private E-Transactions are generating data faster and more cheaply than Statistical agencies can afford to do. 2. To be reliable sources of information on the Demographics, Economy and Social change in the U.S., this information needs to be mashed together with traditional surveys and adjusted for bias. 3. The sizes of the files and the number of computations to mash up the data will be larger. 4. Spoiled by the Internet, users expect more timely, and detailed data provided at lower costs. 5. Privacy/Confidentiality must be maintained. 7
  • Big Data Projects at the Census Bureau The Census Bureau “Big Data” Information Life Cycle Data Collection - Multi-Mode Data Survey Collection model - New Data sources (Web, E-Transactions, Admin Recs) Data Integration & Analysis - Record Linkage - Small Area Estimation modeling & “Now Casting” Data Release - Data Review for Release - Confidentialize data for public release 8
  • Big Data Current Process Future Process (exploring) • Designed Data • Designed & Organic Data • Proprietary Software • Next Generation Open-Source & Proprietary Software • Batch Processing • More Parallel Processing • Long processing times • Faster processing times 9
  • Big Data Collection: Improving Survey Logistics & Cost Improving Survey Collection and Imputation Operations(Adaptive Design) 1. Multi-modal data collection to reduce operational costs of data collection – More effective use of existing data such as administrative records – Incorporating new data into decennial operations • Paradata from Internet Data Capture • Information from Social Media Feeds 2. 3. Edits and Imputations Data Review 10
  • Big Data Collection: Evaluating Web Data as Inputs Potential Internet Data Collection 1. 2. Examine Google & Bing search frequency trend data 3. Examine Twitter, and other social media trend data Examine “Web Scraping” of housing data, price data, local tax data, crime data, corporate profits etc. 11
  • Big Data Collection : Evaluating Commercial E-Transaction Input Data 1. Housing: – – 2. Foreclosures: Use vendor data on new residential properties in foreclosure to aid analysis of data on new construction and sales. Building Permits: Web scrape opportunity to access local jurisdictions and state agencies posting public records online. Construction: – – 3. Difficulty obtaining electronic data from numerous state and local agencies Data are needed immediately to tabulate the monthly economic indicators. Retail Sales: Evaluating electronic payment processing to fill data gaps such as geographical detail and revenue measures by firm size – New data products – Improvements to current data quality 12
  • Big Data Integration & Analysis: (Current processes) Data Integration Expertise: • Record linkage – Gov’t Admin Records to other Gov’t Admin Records – Gov’t Admin Records to Gov’t Surveys – Commercial records to Gov’t Admin Records • Model based integration – Small Area Poverty & Income Estimates – Small Area Health & Income Estimates – Longitudinal Economic & Housing Dynamics 13
  • Big Data Integration & Analysis: Exploring “Now Casting” Exploring “Now Casting” to improve Statistical Timeliness : 1. Some “real time” Internet data correlates with Official Statistics: – Google search data modeled to match BLS unemployment & CDC Flu spread – Univ. of Michigan Twitter unemployment – MIT Billion Price Project match to BLS CPI 2. Census experiments with Gov’t Pension data 14
  • Big Data Lab 1. Setting up an experimental Cluster 2. Testing performance of Hardware 3. Testing value of Software – Open Source Big Data Software: Hadoop, Mahout, Distributed R, Hbase, Pig, Hive, Casandra, Mongo, Flume, Neo4J, I-Graph, Allegrograph – Internally Developed software: TEA, DataWeb, Matching software
  • On the Horizon, Development of Big Data Center Research, capacity building and economic Big Data Processing: 1. Proposal to create a new center that will include members from academy and Census staff to: 1. Help lead work Census Bureau on practices to make sense of Big Data. Developing principles to apply Big Data to federal statistics. 2. Facilitate CB as unbiased provider for information collected as Big Data 3. Validate new techniques and data sources it at a low cost (field staff allow us to do ground checks, survey questions) 4. Lead on methods to integrate Big Data and develop standards 5. The Center should provide a way to bring both faculty and graduate students to Census to facilitate Big Data capacity building at the Census Bureau 1. We will explore partnerships with others doing research in this area. Universities, and Silicon Valley