From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018

Alex Casalboni
Technical Evangelist, AWS
@alex_casalboni
@ 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
From Data Collection to
Actionable Insights in 60 Seconds

About me
• Software Engineer & Web Developer
• Startupper for 4.5 years
• Serverless Lover & AI Enthusiast
• AWS Customer since 2013

Agenda
1. Data Challenges
2. Columnar Formats
3. Data Lakes vs. Data Warehouses
4. Serverless Analytics
5. Demo time

Data Challenges
Data variety and data volumes are increasing rapidly
Multiple Consumers and Applications
Ingest
Discover
Catalog
Understand
Curate
Find insights

Right tool for the job
Customer Needs Come First

Purpose-Built Analytics on AWS
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon Dynamo
DB
Amazon
Elasticsearch
Amazon EMR
Amazon
Redshift
Amazon
QuickSight
AWS Database Migration Service AWS Glue
Amazon Athena
Amazon AI

Open-source standards (Apache)
Parquet, ORC, etc.
Optimize Performance
Optimize Costs
Analytical queries
Columnar data

Why it matters
Big Data Analytics
Real-time Analytics
Data exploration

Traditional Data Warehouse
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Terabytes to Petabytes scale
Schema defined prior to data load
Operational reporting
and ad-hoc analysis

Data Lakes extend traditional warehouses
Relational and non-relational data
Terabytes to Exabytes scale
Schema defined during analysis
(Schema on Read)
Diverse analytical engines to gain insights
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111
0010101011100101010
0001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Data Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time

Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Wide variety of ways to bring data in
Durability and availability at Exabyte scale
Security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store at $0.023 / GB-month
Query for $0.05 / GB scanned
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS

Data Lake Components
Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight Amazon' AI EMR Redshift
Athena Kinesis RDS
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
S3
Snowball Database' Migration'
Service
Kinesis' Firehose Direct'Connect
Data'Ingestion
Get%your%data%into%S3
Quickly%and%securely
Protect'and'Secure
Use%entitlements% to%ensure%data% is%secure%and% users’% identities% are% verified
Processing' &'Analytics
Use%of%predictive%and%prescriptive%
analytics%to%gain%better%understanding
Security'Token'
Service
CloudWatch CloudTrail Key'Management'
Service
Catalog' &'Search
Access%and%search%metadata
Access'&'User'Interface
Give%your%users%easy%and%secure%access
DynamoDB Elasticsearch API'Gateway Identity'&'Access'
Management
Cognito
QuickSight A
Central'Storage
Secure,%cost5effective
Storage%in%Amazon%S3
Glue ETL

Serverless Analytics
Deliver cost-effective analytic solutions faster
S3
Data Lake
Glue
(Data Catalog
and ETL)
RedShift
Spectrum
QuickSight
Serverless
Zero infrastructure
Zero administration
Pay only for
what you use,
not for idle
resources
$
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
Snowball
Snowmobile
Kinesis
Data Firehose
many
other
sources
Other BI Tools
Amazon
Athena
Amazon
EMR

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”

AWS Glue—Serverless Data Catalog & ETL
Data Catalog ETL
Discover data and
extract schema
Auto-generate
customizable code
in Python and Spark
Automatically discovers data and stores schema
Data is immediately searchable
and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless Model

Crawlers: Automatic Schema Inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Grok based parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool

IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
XML
JSON & BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressions
(ZIP, BZIP, GZIP, LZ4, Snappy)
What can Crawlers Classify?

Detecting Schema Similarity
name:
str
id: num
Schema A
root
addr
street: str city: str zip: num
name:
str
id: num
Schema B
root
addr: str
Schema similarity heuristic
§ 1 point for matching name
§ 1 point for matching data type
§ Match when similarity index > 0.7
intersection
min(A,B)
7
8
.875sim

Available partitions
Automatically Detect Partitions

Automatically update table version as data evolves
Automatic Schema Versioning

Other Ways of Creating Tables
Call Glue’s CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore

Amazon Redshift - Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at any scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions;
Start at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3

Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n a n S 3 d a t a l a k e
Exabyte Redshift SQL queries against S3
Join data across Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited
concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
S3 data lakeRedshift data
Redshift Spectrum
query engine

Redshift Spectrum
Q u e ry y o u r dat a lake
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …

Data Lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL

Uncompressed
Compressed (-94%)
Parquet (-70%)
Partitioned (-70%)
Overall 99.5% improvement!

Uncompressed
Compressed (-94%)
Parquet (-100%)
Partitioned (-100%)
Overall 100% improvement!

Uncompressed
Compressed (-94%)
Parquet (-72%)
Partitioned (-100%)
Overall 100% improvement!

About me
Kinesis Data Generator (KDG)
github.com/awslabs/amazon-kinesis-data-generator
Serverless Data Pipeline powered by AWS SAM
github.com/alexcasalboni/serverless-data-pipeline-sam
AWS Big Data Blog
aws.amazon.com/blogs/big-data
Additional Resources

AMAZON CONFIDENTIAL
Did We Scan Your Badge?
Remember to opt-in to AWS
communications and you will receive a
post-event email with a link to:
• AWS Developer Workshop Slides
• $200 in AWS Credits

From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018

Similar to From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018