Successfully reported this slideshow.
Your SlideShare is downloading. ×

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based Enterprise Data Lake

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 19 Ad

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based Enterprise Data Lake

Download to read offline

The Census Bureau is the U.S. government's largest statistical agency with a mission to provide current facts and figures about America's people, places and economy. The Bureau operates a large number of surveys to collect this data, the most well known being the decennial population census. Data is being collected in increasing volumes and the analytics solutions must be able to scale to meet the ever increasing needs while maintaining the confidentiality of the data. Past data analytics have occurred in processing silos inhibiting the sharing of information and common reference data is replicated across multiple system. The use of the Hortonworks Data Platform, Hortonworks Data Flow and other open-source technologies is enabling the creation of a cloud-based enterprise data lake and analytics platform. Cloud object stores are used to provide scalable data storage and cloud compute supports permanent and transient clusters. Data governance tools are used to track the data lineage and to provide access controls to sensitive data.

The Census Bureau is the U.S. government's largest statistical agency with a mission to provide current facts and figures about America's people, places and economy. The Bureau operates a large number of surveys to collect this data, the most well known being the decennial population census. Data is being collected in increasing volumes and the analytics solutions must be able to scale to meet the ever increasing needs while maintaining the confidentiality of the data. Past data analytics have occurred in processing silos inhibiting the sharing of information and common reference data is replicated across multiple system. The use of the Hortonworks Data Platform, Hortonworks Data Flow and other open-source technologies is enabling the creation of a cloud-based enterprise data lake and analytics platform. Cloud object stores are used to provide scalable data storage and cloud compute supports permanent and transient clusters. Data governance tools are used to track the data lineage and to provide access controls to sensitive data.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based Enterprise Data Lake (20)

Advertisement

More from DataWorks Summit (20)

Recently uploaded (20)

Advertisement

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based Enterprise Data Lake

  1. 1. © Cloudera, Inc. All rights reserved. TRANSFORMING AND SCALING LARGE SCALE DATA ANALYTICS: MOVING TO A CLOUD-BASED ENTERPRISE DATA LAKE Terry Padgett - Senior Solutions Architect, Cloudera Nitin Naik - Chief Technology Officer, U.S. Census Bureau
  2. 2. 2 Census Background • United States leading provider of quality data about its people and economy • Decennial, Economic, Demographic, and a multitude of other surveys • Serves other federal agencies • Processes large volumes of data • Preparing for future analytic needs of the enterprise
  3. 3. Enterprise Data Lake Make changes to culture, processes, and technologies to practice and accelerate the efforts required to remain a leader in data and technology innovation. • Optimize Survey Operations • Reduce Respondent Burden • Improve Data Products • Consolidate Data and Code • Manage Large Datasets • Centralize Security
  4. 4. EDL Guiding Principles • Scalability • Availability • Automation • Security & Privacy • Data Diversity • Data Stewardship • Identifiable, Locatable and Linkable Data • Reproducibility • Governance
  5. 5. Cloud First • Establish the EDL in GovCloud • On-demand Server Instances • Cloud Object Stores • Leverage Serverless Computing
  6. 6. Data Availability • Short-term and shared data available through cloud object stores • Long-term data available through archival stores • Built-in resiliency of storage to prevent data loss • EDL applications deployed as highly available
  7. 7. Deployment Automation • Repeatable deployment across tiers • Platform Infrastructure - Teraforms • AMI Creation - Ansible, Puppet, scripting • HDP Clusters - Cloudbreak
  8. 8. Census Business Process and EDL Mapping of the Survey Lifecycle to the Data Lifecycle and the identification of the data flow, allows EDL to incrementally build upon the key areas highlighted in green. The EDL will focus on the Enterprise Data Lifecycle Stages (Process, Derive, ect.) and leverage technology advances (e.g. Data mashups, Machine Learning, Distributed computing). Consolidation of Data Collection Systems Consolidation of Data Management / Store Systems Consolidation of Data Dissemination Systems CEDCaP Enterprise Data Lake (EDL) CEDSCI Survey Design Frame Development Sample Design Response Data Collection Instrument Development Data Editing & Imputation Disclosure Avoidance Research/ Analytics Data Product Dissemination Estimation, Data Review, & Analysis DEFINE COLLECT Survey Design Frame Development Sample Design Instrument Development Response Data Collection CAPTURE 3rd Party Data Capture PROCESS DERIVE PUBLISH RESEARCH Data Editing & Imputation Estimation, Data Review, & Analysis Disclosure Avoidance Data Product Dissemination Research/ Analytics DISSEMINATE Areas currently in the scope of CEDCaP, CEDSCI or other programs Areas currently in the scope of EDL LEGEND Survey Lifecycle Data Lifecycle EDL Supported Areas (not in scope) Data Lifecycle
  9. 9. Enterprise Data Lake Features • Data Control • Data Lineage • Authorization Model • Storage Management • Data Sharing • Dynamic Platform Provisioning • Cloud-based • Cost Control 9
  10. 10. Data Control • Data Registration • Datasets onboarded with mandatory metadata • Registered in the existing Data Management System • Project access controls generated • Code Repositories • Data Lineage: Atlas • Authorization: Ranger • Controls for projects • Column protection • Row filtering
  11. 11. © Cloudera, Inc. All rights reserved. 11 Data Platform Common Shared Services
  12. 12. © Cloudera, Inc. All rights reserved. 12 Data Platform Compute On Demand
  13. 13. © Cloudera, Inc. All rights reserved. 13 Data Platform Transient Clusters
  14. 14. © Cloudera, Inc. All rights reserved. 14 Data Movement DataPlane Service
  15. 15. © Cloudera, Inc. All rights reserved. 15 Data Sharing • S3 as first-class storage -permanent data • Local HDFS - working data • Hive: Data Warehouse
  16. 16. © Cloudera, Inc. All rights reserved. 16 Data Science • Spark • R • SAS
  17. 17. © Cloudera, Inc. All rights reserved. 17 Data Lineage Tracability
  18. 18. © Cloudera, Inc. All rights reserved. 18 Data Protection • EBS volume encryption • S3 server side encryption • SSL/TLS • Hadoop TDE
  19. 19. © Cloudera, Inc. All rights reserved. THANK YOU

×