Big Data@Scale

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Gargi Singh Chhatwal, Associate Solutions Architect, AWS
Dr. Nitin Naik, Chief Technology Officer, Census
Session Code : 194326
Big Data @Scale
Nandakumar Sreenivasan, Senior Solutions Architect, AWS

Key Takeaways
1. Why big data?
2. How to do big data processing on AWS?
3. Architectural patterns
4. US Census data lake overview

Ever Increasing data
International Data Corporation(IDC) -Digital universe
2016 – 16.1 Zettabyte(ZB) 2025 – 163 Zettabyte(ZB)
Volume
Velocity
Variety
1 Zettabyte : 1000 Exabyte : 1 million PB : 1 billion TB

Big Data Processing @ Scale
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers

COLLECT
Logging
Logging
Amazon
CloudWatch
AWS
CloudTrail
Devices
Sensors &
IoT solutions AWS IoT
Analytics
IoT
Mobile apps
Web apps
Enterprise apps
Applications

Getting data into AWS
AWS Direct Connect
AWS Snowball
Amazon Kinesis
Firehose
AWS Storage
Gateway

COLLECT STORE
data answers

STORE
Amazon
Elasticsearch Service
Amazon DynamoDB
Amazon Redshift
Amazon RDS
Search SQL NoSQL
Database
Amazon S3
Storage
File/Object
Storage
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
IOT / Applications/Devices streams
Streaming
data

COLLECT STORE
data answers
PROCESS/
ANALYZE

PROCESS / ANALYZE
Data Enrichment
Analyze- Batch, Interactive,
Streaming
Extract Transform Load
(ETL)
Data Lake
Amazon EMR Amazon Kinesis AWS Glue
Amazon EMR Amazon Kinesis Amazon QuickSightAmazon Redshift*
Amazon ES Amazon EMR Amazon S3Amazon Athena
Amazon EMR AWS GlueAmazon Redshift*

PROCESS / ANALYZE
AWS Elastic
MapReduce
(EMR)
Fully Managed Hadoop Cluster Framework
Supports big data frameworks such as Hive, Impala, Presto, Spark and
more...
EMR File System(EMRFS) allows Amazon EMR clusters to efficiently
and securely use Amazon S3 for storage of any scale.
Integrated with Amazon S3, Amazon RDS, Amazon Redshift, & any
JDBC-compliant data store
On-demand and spot pricing; pay as you go

PROCESS / ANALYZE
Amazon
Redshift
Fully managed Relational data warehouse
Massively parallel; Petabyte scale
Data Compression reduces I/O massively
Columnar data storage designed for scale
$1,000/TB/Year; starts at $0.25/hour
a lot faster
a lot simpler
a lot cheaper

PROCESS / ANALYZE
Amazon
Kinesis
Managed Service for Real Time Big Data Processing
Kinesis Data Streams
Create Streams to Produce & Consume Data
Elastically add and remove Shards for performance and scale
Kinesis Data Firehose
Easily load massive amount of streaming data into S3,Redshift
Kinesis Data Analytics
Easily analyze data streams using standard SQL queries
Elastically scales to match data throughput

PROCESS / ANALYZE
Amazon
Athena
An interactive query service that makes it easy to analyze data
directly from Amazon S3 using Standard SQL.
Server less – No infrastructure or resources to manage at any
scale
Schema on read – Same data, many views

PROCESS / ANALYZE
AWS
Glue
Data Catalog
Hive Metastore compatible with enhanced functionality
Crawlers automatically extract metadata and creates tables
Managed Transform Engine
Auto-generates ETL code
Build on open frameworks – Python and Spark
Job Scheduler
Runs jobs on a serverless Spark Platform; Massively scalable
Integrated with S3, Amazon RDS, Amazon EMR, Amazon
Redshift, Athena & any JDBC-compliant data store

COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers

CONSUME
Apps & Services
API
Amazon QuickSight
Analysis and Visualization Notebooks

Putting It All Together

CONSUME
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksAPI
ETL
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS/ANALYZ
E
Amazon Machine
Learning
Presto
Amazon
EMR
BatchInteractiveStreamML
Amazon EC2
COLLECT
Mobile apps
Web apps
Devices
Sensors &
IoT solutions AWS IoT
Analytics
Enterprise
apps
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTApplications
STORE
Amazon Elasticsearch
Service
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon RDS
Amazon DynamoDB
Streams
SearchSQLNoSQLFileStream
Amazon Redshift

Architectural Patterns

Building Event-Driven Batch Analytics on AWS
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Staging Data
Input
validation
/conversion
layer
Pre-processed dataAWS Lambda
Input Tracking
layer
AggrJob
Submission
and Monitoring
Layer
AWS Lambda
AWS Lambda
State
Management
Store
Identity and Access Management (IAM)
Monitoring and logging (CloudWatch)
Aggregation
and load layer
Amazon
Redshift
Amazon EMR

Real-Time and Batch Analytics Using the Big Data
Architecture
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Athena
Amazon QuickSight
Raw data in
Serving Layer
Pre-processed Views
Filtered data
S3 Bucket
S3 Bucket
Speed Layer
Kinesis Data Analytics
User device settings
Raw Data
Batch Layer
S3 Bucket
S3 Bucket
Amazon EMR

Amazon Redshift Spectrum Extends Data Warehousing Out to Exabyte's—No Loading
Required
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore

Data lake on Amazon S3 with AWS Glue
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL

U.S Census - Enterprise data lake

Official US Statistics
Collection and
dissemination:
mostly the same
since World War II
Multi-Agency
effort
Surveys are
dominant data
source
Administrative
records support
surveys
11

Users want more, faster, current…
27
Users
want
more:
• Timely and
detailed
estimates
• Statistics that
link with other
data
• Microdata
• Relevant data

Big Data Benefits for Census
28
Enhance current surveys
Reduce respondent burden
Improve timeliness of
release
Better information for
unique situations
Granularity enhanced
Optimize Data Quality
Process

Problem Statement
Today, the process surrounding data access for the Census’s MathStats and Data Scientists are manual,
cumbersome, and slow. Whether to gain access to data or to link the data across datasets (e.g., AdRecs, multi-survey
data, and multi-period data) for longitudinal or other studies, the Census’s data stewardship policies must be
respected. The resulting data may inherit controls from the source data (e.g., Title 13, Title 26, and more), and manual
efforts are currently required to track the data lineage from source to resulting data. Additionally, multiple IT
environments are installed to handle each project’s survey instance.
29
• Linking data across
surveys is difficult
• Sharing data is a manual
exercise
• Data is copied multiple
times
• Honoring data
stewardship policies
requires distributed
manual efforts
Decentralized Data
Management Limitations
• Controls must be
duplicated for every survey
system
• Governance and security
measures are cumbersome
• Auditing and monitoring
capabilities are
inconsistent
Security Control Limitations
• Data processing code is
inconsistently managed from
one group to the next
• Reproducing results from base
data is not feasible since data
lineage is not consistently
tracked
Processing Approach
Limitations
• Current approach
requires constant
acquisition of new
servers
• Technology is
inconsistent from one
group or survey to the
next
• Handling large datasets
with complex
calculations is
challenging
Technology Limitations
DEMOECON
S1 S
2
S
3
S
4
…
.M
1
M
2
M
3
…
.
Y
1
Y
2
Y
n
…
.
…
.
Survey Portfolio
Time
Period
Census Data Limitation
S
n
S
n
Sn
+1
01
0
3
04
0
2

Census Security Control and Usage of Data
30

Enterprise Data Lake (EDL) Solution Supports the Mission
31
Security as a Service
Analytics as a Service
Enterprise Data Lake
Data as a Service
Content
Repositories
Infrastructure & Operations as a Service
1
Data/Code
Repository
LEGEND
Cloud
Standardized Cloud
Services
Standardized EDL
Services
Component of EDL
Ecosystem Specific to the EDL
Computational
Environment
Data Ingestion Services
Transactional Systems /
Data Sources
The proposed EDL solutions will support the business process by storing and analyzing any data with associated code at anytime throughout
the lifecycle.
data
encryption key
permissions
monitoring

Proposed Enterprise Data Lake in the Cloud
32
The data lake will streamline time consuming tasks and simplify complex
processes to make the Business and IT users’ lives easier. MathStats and
Data Scientists will be able to focus on their data, models, and products rather
than on administrative tasks.
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Survey N
DEMOGRAPHICS
DECENNIAL
OTHER PROGRAMS
Survey N + 1 Survey N + 1…
ECON
Enterprise
Directorate
Analytic
s
Directorate
Analytic
s
Directorate
Analytic
s
EDL Standard Services
Standardized Cloud Services
Standardized Census Data Services
Governance
Security

Please complete the session survey in
the summit mobile app.

Thank you!
Questions?

Big Data@Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data@Scale

Similar to Big Data@Scale (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Big Data@Scale