From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018
Columnar data formats such as Parquet and ORC are designed to optimize both query performance and costs for analytics scenarios. On the other hand, serverless computing platforms such as AWS Lambda allow you to run highly scalable applications without provisioning or managing servers. The combination of columnar storage and serverless computing can drastically simplify many of the pain points related to big data analytics, data collection, data exploration, and ETL orchestration, while at the same time reducing the total cost of ownership.
Speaker: Alex Casalboni - Technical Evangelist, AWS
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018
1. Alex Casalboni
Technical Evangelist, AWS
@alex_casalboni
@ 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
From Data Collection to
Actionable Insights in 60 Seconds
2. About me
• Software Engineer & Web Developer
• Startupper for 4.5 years
• Serverless Lover & AI Enthusiast
• AWS Customer since 2013
3. Agenda
1. Data Challenges
2. Columnar Formats
3. Data Lakes vs. Data Warehouses
4. Serverless Analytics
5. Demo time
4. Data Challenges
Data variety and data volumes are increasing rapidly
Multiple Consumers and Applications
Ingest
Discover
Catalog
Understand
Curate
Find insights
11. Traditional Data Warehouse
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Terabytes to Petabytes scale
Schema defined prior to data load
Operational reporting
and ad-hoc analysis
12. Data Lakes extend traditional warehouses
Relational and non-relational data
Terabytes to Exabytes scale
Schema defined during analysis
(Schema on Read)
Diverse analytical engines to gain insights
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111
0010101011100101010
0001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Data Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
13. Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Wide variety of ways to bring data in
Durability and availability at Exabyte scale
Security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store at $0.023 / GB-month
Query for $0.05 / GB scanned
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS
15. Serverless Analytics
Deliver cost-effective analytic solutions faster
S3
Data Lake
Glue
(Data Catalog
and ETL)
RedShift
Spectrum
QuickSight
Serverless
Zero infrastructure
Zero administration
Pay only for
what you use,
not for idle
resources
$
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
Snowball
Snowmobile
Kinesis
Data Firehose
many
other
sources
Other BI Tools
Amazon
Athena
Amazon
EMR
17. AWS Glue—Serverless Data Catalog & ETL
Data Catalog ETL
Discover data and
extract schema
Auto-generate
customizable code
in Python and Spark
Automatically discovers data and stores schema
Data is immediately searchable
and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless Model
18. Crawlers: Automatic Schema Inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Grok based parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool
19. IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
XML
JSON & BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressions
(ZIP, BZIP, GZIP, LZ4, Snappy)
What can Crawlers Classify?
20. Detecting Schema Similarity
name:
str
id: num
Schema A
root
addr
street: str city: str zip: num
name:
str
id: num
Schema B
root
addr: str
Schema similarity heuristic
§ 1 point for matching name
§ 1 point for matching data type
§ Match when similarity index > 0.7
intersection
min(A,B)
7
8
.875sim
23. Other Ways of Creating Tables
Call Glue’s CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
24. Amazon Redshift - Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at any scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions;
Start at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3
25. Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n a n S 3 d a t a l a k e
Exabyte Redshift SQL queries against S3
Join data across Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited
concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
S3 data lakeRedshift data
Redshift Spectrum
query engine
26. Redshift Spectrum
Q u e ry y o u r dat a lake
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …
27. Data Lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL
33. About me
Kinesis Data Generator (KDG)
github.com/awslabs/amazon-kinesis-data-generator
Serverless Data Pipeline powered by AWS SAM
github.com/alexcasalboni/serverless-data-pipeline-sam
AWS Big Data Blog
aws.amazon.com/blogs/big-data
Additional Resources
34. AMAZON CONFIDENTIAL
Did We Scan Your Badge?
Remember to opt-in to AWS
communications and you will receive a
post-event email with a link to:
• AWS Developer Workshop Slides
• $200 in AWS Credits