Get Value From Your Data
Danilo Poccia
AWS Technical Evangelist
@danilop
danilop
3 HOURS

FOR $4828.85/hr
Instead of 

$20+ MILLIONS

in infrastructure
ON A SINGLE INSTANCE
COMPUTE TIME: 4h

COST: 4h x $2.1 = $8.4
ON MULTIPLE INSTANCES
COMPUTE TIME: 1h

COST: 1h x 4 x $2.1 = $8.4
Data Analytics
Data Analytics
Value > Costs Storage and
Analysis Costs

are Going Down
Making

New Use Cases
Possible
+ Elastic and Highly Scalable
+ No Upfront Capital Expense
+ Only Pay for What You Use

+ Available On-Demand
= Remove Constraints
Structured
Vs
Unstructured
Data
High Degree

of Organization
Data Model
Free Text
Multimedia
Social Media
Structured
Semi-structured
Unstructured
Data
XML
JSON
Batch
Vs
Real-time
Data
Fixed
Dataset
Updated in

Discrete Moments
Continuous

Stream of Data
Batch
Report
Real-time
Alerts
Prediction
Forecast
Past Present Future
Unstructured

Data
?
Unstructured

Data
Structured

Data
Unstructured

Data
Structured

Data
Resilient Distributed Datasets (RDDs)
Memory
Fast Processing
Large Quantity of Data
Disk
Hadoop
MapReduce
Spark
?
Amazon

Elastic MapReduce

(Amazon EMR)
Unstructured

Data
Structured

Data
Amazon

Elastic MapReduce

(Amazon EMR)
Structured

Data
Unstructured

Data
Structured

Data
Amazon

Elastic MapReduce

(Amazon EMR) Managed clusters
For Hadoop, Spark, Presto

or any other applications

in the Apache / Hadoop stackWhat is

Amazon EMR?
Amazon

Elastic MapReduce

(Amazon EMR)
Overview of

Amazon EMR

Architecture
Storage
HDFS EMRFS
Local

File System
Data Processing Frameworks
Hadoop Spark …
Applications and Programs
Hive Pig …
ClusterResourceManagement
YARNAgent…
Amazon

Elastic MapReduce

(Amazon EMR)
Overview of

Amazon EMR

Architecture
Master

Instance

Group
Core

Instance

Group
Task

Instance

Group
EC2 Spot
Instances
Separate Compute and Storage
Resize and shut down

Amazon EMR clusters with no data loss
Point multiple Amazon EMR clusters

at the same data in Amazon S3
Easily evolve your analytic infrastructure

as technology evolves
Leverage

Amazon S3 with 

EMR File System
(EMRFS)
S3 Bucket
Cluster
EMR Cluster
Cluster
EMR Cluster
Amazon

Elastic MapReduce

(Amazon EMR)
Read-after-write consistency

Very fast list operations

(thanks to Amazon DynamoDB)
Transparent to applications as s3://…
S3 Bucket
Cluster
EMR Cluster
DynamoDB Table
Amazon

Elastic MapReduce

(Amazon EMR)
EMRFS

makes it easier

to use Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘some/path/input/'
S3 Bucket
Cluster
EMR Cluster
DynamoDB Table
Amazon

Elastic MapReduce

(Amazon EMR)
Going

from HDFS

…
S3 Bucket
Cluster
EMR Cluster
DynamoDB Table
Amazon

Elastic MapReduce

(Amazon EMR)
Going

from HDFS

to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://bucket/path/input/'
Amazon

Elastic MapReduce

(Amazon EMR)
EMRFS client-side
encryption
S3 Bucket
Cluster
EMR Cluster
Cluster
EMR Cluster
AWS KMS or your custom key vendor
AmazonS3encryptionclients
EMRFSenabledfor

AmazonS3client-sideencryption
Amazon S3 is your Data Lake
S3 Bucket
Cluster
Hive, Pig
Cluster
Presto
Cluster
Spark
Cluster
Ad Hoc
Cluster
Cascading
Logical Separation of Jobs
CASE STUDY:
SPOTIFY ADDS 20,000 TRACKS/DAY TO ITS CATALOGUE
Amazon

Elastic MapReduce

(Amazon EMR)
Structured

Data
Unstructured

Data
Structured

Data
<demo>
...
</demo>
A managed service that makes it easy

to deploy, operate, and scale Elasticsearch

in the AWS Cloud
High availability, patch management, failure detection

and node replacement, backups, and monitoring
Integrated with Logstash and Kibana
Scale up and scale down your cluster to deliver optimum
performance as data and usage patterns change, paying
only for the resources you actually consume
Control access to the Elasticsearch APIs using

AWS Identity and Access Management (IAM) policies
What is

Amazon ES?
Amazon

Elasticsearch Service

(Amazon ES)
Amazon ES

Architecture
Amazon

Elasticsearch Service

(Amazon ES)
Elasticsearch
Kibana
Amazon

CloudWatch
AWS

CloudTrail
Elastic

Load Balancing
Amazon

Route 53
Elasticsearch

APIs
AWS Credentials

(AWS IAM)
Structured

Data
?
Structured

Data
Information
Amazon

Redshift
Structured

Data
Information
Relational Data Warehouse
a lot faster
a lot simpler
a lot cheaper


Massively parallel + Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; starts at $0.25/hour
What is

Amazon Redshift?
Amazon

Redshift
Amazon Redshift
Architecture
Amazon

Redshift
Compute

Node
Compute

Node
Compute

Node
Leader

Node
SQL Clients / BI Tools
Amazon S3 / Amazon DynamoDB / SSH
10GbE
Ingestion/Backup
JDBC / ODBC
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
Amazon Redshift
Performance
Amazon

Redshift
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Sort Keys

and

Zone Maps
Amazon

Redshift
SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’
Unsorted Sorted by Date
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Parallel
and
Distributed
Amazon

Redshift
Compute

Node
Compute

Node
Compute

Node
Leader

Node
SQL Clients / BI Tools
Amazon S3 / Amazon DynamoDB / SSH
Query
Load / Export / Backup / Restore
Parallel
and
Distributed
Amazon

Redshift
Compute

Node
Compute

Node
Compute

Node
Leader

Node
SQL Clients / BI Tools
Amazon S3 / Amazon DynamoDB / SSH
Compute

Node
Query
Load / Export / Backup / Restore
Resize
Amazon Redshift
Innovation
Amazon

Redshift
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy, Distributed
Tables, Audit Logging/CloudTrail,
Concurrency, Resize Perf., Approximate
Count Distinct, SNS Alerts, Cross Region
Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
Well over 100 new features added since launch
Release every two weeks
Automatic patching
Amazon Redshift
Features
Amazon

Redshift
Approximate functions
User defined functions
Machine Learning
Data Science
Amazon ML
Amazon Redshift
Ecosystem
Amazon

Redshift
Data Integration Systems IntegratorsBusiness Intelligence
30 MINUTES 

DOWN TO

12 SECONDS
Amazon

Redshift
Structured

Data
Information
Real-time

Data
?Data Stream
Real-time

Information
Amazon

Kinesis
Data Stream
Real-time

Information
A Platform for Streaming Data on AWS
What is

Amazon Kinesis?
Amazon

Kinesis
Amazon

Kinesis

Streams
Amazon

Kinesis

Firehose
Amazon

Kinesis

Analytics
Amazon

Kinesis

Streams
Amazon

Kinesis
Build your own custom applications

that process or analyze streaming data
Amazon

Kinesis

Streams
Amazon

Kinesis
Use the Kinesis Client Library (KCL)

to consume data from Kinesys Streams
Amazon

Kinesis

Streams
Amazon

Kinesis
AWS Lambda

Functions
Use AWS Lambda for a serverless architecture
Amazon

Kinesis

Streams
Amazon

Kinesis
Low latency I/O
Configurable retention period from 1 to 7 days
The maximum size of a data blob is up to 1 MB
Each shard can support:
up to 1,000 records / second and
up to 1 MB / second for writes
up to 5 transactions / second and
up to 2 MB / second for reads
Amazon

Kinesis

Firehose
Amazon

Kinesis
Easily load massive volumes

of streaming data into AWS
Amazon

Kinesis

Analytics
Amazon

Kinesis
Easily analyze streaming data with standard SQL
(Coming Soon)
Amazon

Kinesis
Data Stream
Real-time

Information
Learning

from Data
?Data Model
Amazon

Machine Learning

(Amazon ML)
Data Model
Machine learning is the technology that automatically
finds patterns in your data and uses them to make
predictions for new data points as they become
available
Your Data + Machine Learning

= Smart Applications
What is

Machine Learning?
Amazon

Machine Learning

(Amazon ML)
Designed for Developers
No Machine Learning skills are required
Batch prediction
Real-time predictions
Can be used by other applications via APIs
What can you do?
Amazon

Machine Learning

(Amazon ML)
Amazon.com
1994
BEST PRACTICES
&
LESSONS LEARNED
B
EST
PR
A
C
TIC
ES
USE ALL 

AVAILABLE DATA

Your company has more data on
your users than what you think…
Quizz
What percentage of data 

do firms use for analytics?
A: 12% C: 52%
B: 34% D: 68%
Quizz
What percentage of data 

do firms use for analytics?
A: 12% C: 52%
B: 34% D: 68%
B
EST
PR
A
C
TIC
ES
ENRICH DATA BASED

ON SOCIAL NETWORKS

User’s friends are valuable

sources of information
75% of users select movies based
on recommendations
Amazon

Machine Learning

(Amazon ML)
Data Model
Data

Orchestration & Visualization
Data Orchestration can be a Task by Itself
S3 Bucket
Cluster
EMR Cluster
DynamoDB Table
Redshift DB
RDS Instance
S3 Bucket
On
Premises
Helps you reliably process and move data between
different AWS compute and storage services, as well
as on-premise data sources, at specified intervals
What is AWS

Data Pipeline?
AWS

Data Pipeline
Access your data where it’s stored, transform and
process it at scale, and efficiently transfer the results
to other AWS services
What is AWS

Data Pipeline?
AWS

Data Pipeline
Helps you migrate databases to AWS easily and
securely: the source database remains fully
operational during the migration, minimizing
downtime to applications that rely on the database
What is

AWS Database

Migration Service?
AWS Database

Migration Service
Customer
Premises
Application Users
AWS
Internet
VPN
AWS
Database Migration
Service
Migrate off Oracle and SQL Server
Move your tables, views, stored procedures and DML
to MySQL, MariaDB, and Amazon Aurora
AWS Schema
Conversion Tool
AWS Database

Migration Service
Know exactly where manual edits are needed
AWS Schema
Conversion Tool
AWS Database

Migration Service
?
Structured

Data
Visual
AWS Marketplace
Structured

Data
Visual
https://aws.amazon.com/marketplace
Amazon

QuickSight
Structured

Data
Visual
A very fast, cloud-powered business intelligence (BI)
service that makes it easy to build visualizations,
perform ad-hoc analysis, and quickly get business
insights from their data
What is Amazon
QuickSight?
Amazon

QuickSight
First analysis
in about
60 seconds
Amazon

QuickSight
Business user
Sign-in
Amazon

QuickSight
Architecture
Amazon

QuickSight Business User
QuickSight API
Data Prep Metadata SuggestionsConnectors SPICE
Business User
QuickSight UI
Mobile Devices Web Browsers
Partner BI products
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon RDSFiles Third-party
Point to a Data Source
Visualize in Minutes
Smart Visualizations
Dynamically Optimized Graphics
Get Answers

Fast
Amazon

QuickSight
Amazon QuickSight uses SPICE – a Super-fast,
Parallel, In-memory optimized Calculation Engine
built from the ground up to generate answers on
large datasets
Use AWS Partner

BI Solutions
with Amazon
QuickSight
Amazon

QuickSight
Amazon QuickSight provides partners

a simple SQL-like interface to query the data stored
in SPICE, so that customers can continue using their
existing BI tools while benefiting from the faster
performance delivered by SPICE
Amazon

QuickSight
Structured

Data
Visual
Collect Store Analyze
AWS Direct

Connect
AWS

Import/Export

Disk
AWS

Import/Export

Snowball
Amazon

Kinesis

Streams
Amazon VPC

VPN Connection
AWS Database

Migration Service
AWS

Data Pipeline
Amazon

Kinesis

Firehose
Amazon

Kinesis

Analytics
AWS Storage

Gateway
Amazon S3
Amazon

Glacier
Amazon RDS
Amazon

Redshift
Amazon

Elastisearch

Service
Amazon

DynamoDB
Amazon EMR Amazon EC2
Amazon EC2
Container Service
Amazon ML
Amazon

QuickSight
Start Simple
Amazon S3 + Amazon EMR
or
Amazon S3 + Amazon Redshift
Grow As You Need
Pay Only For What You Use
RAW DATA
BUSINESS
INTELLIGENCE
RAW
INFORMATION
DATA
PREDICTIONS
Get Value From Your Data
Danilo Poccia
AWS Technical Evangelist
@danilop
danilop

Get Value From Your Data