AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Matt Yanchyshyn, Sr. Manager Solutions Architecture
June 17th, 2015
AWS Deep Dive
Big Data Analytics and Business Intelligence

Analytics and BI on AWS
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
Amazon Machine
Learning
Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis

Batch processing
GBs of logs
pushed to Amazon
S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output stored
in Amazon S3
Load subset into
Amazon Redshift

Reporting
Amazon S3
Log Bucket
Amazon EMR Structured
log data
Amazon
Redshift
Operational Reports

Streaming data processing
TBs of logs sent
daily
Logs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2

TBs of logs sent
daily
Logs stored in
Amazon S3
Amazon EMR
clusters
Hive Metastore
on Amazon EMR
Interactive query

Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from S3
Predictions
in S3
Query for predictions with
Amazon ML batch API
Your application
Batch predictions

Your application
Amazon
DynamoDB
+
Trigger event with Lambda
+
Query for predictions with
Amazon ML real-time API
Real-time predictions

Amazon Machine Learning
Easy to use, managed machine learning
service built for developers
Create models using data stored in AWS
Deploy models to production in seconds

Powerful machine learning technology
Based on Amazon’s battle-hardened
internal systems
Not just the algorithms:
Smart data transformations
Input data and model quality alerts
Built-in industry best practices
Grows with your needs
Train on up to 100 GB of data
Generate billions of predictions
Obtain predictions in batches or real-time

Pay-as-you-go and inexpensive
Data analysis, model training, and
evaluation: $0.42/instance hour
Batch predictions: $0.10/1000
Real-time predictions: $0.10/1000
+ hourly capacity reservation charge

Build & Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML

Create a Datasource object
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ds = ml.create_data_source_from_s3(
data_source_id = ’my_datasource',
data_spec= {
'DataLocationS3':'s3://bucket/input/',
'DataSchemaLocationS3':'s3://bucket/input/.schema'},
compute_statistics = True)

Explore and understand your data

Train your model
>>> import boto
>>> model = ml.create_ml_model(
ml_model_id=’my_model',
ml_model_type='REGRESSION',
training_data_source_id='my_datasource')

Fine-tune model interpretation

Batch predictions
Asynchronous, large-volume prediction generation
Request through service console or API
Best for applications that deal with batches of data records
>>> import boto
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)

Real-time predictions
Synchronous, low-latency, high-throughput prediction generation
Request through service API or server or mobile SDKs
Best for interaction applications that deal with individual data records
>>> import boto
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}

Amazon Elastic MapReduce (EMR)

Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster

The Hadoop ecosystem can run in Amazon EMR

Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS

Easy to add/remove compute capacity to your cluster
Match compute
demands with
cluster sizing
Resizable clusters

Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost

Amazon S3 as your persistent data store
Separate compute and storage
Resize and shut down Amazon EMR
clusters with no data loss
Point multiple Amazon EMR clusters at
same data in Amazon S3
EMR
EMR
Amazon
S3

EMRFS makes it easier to use Amazon S3
Read-after-write consistency
Very fast list operations
Error handling options
Support for Amazon S3 encryption
Transparent to applications: s3://

EMRFS client-side encryption
Amazon S3
AmazonS3encryption
clients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)

HDFS is still there if you need it
Iterative workloads
• If you’re processing the same dataset more than
once
Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to/from HDFS for processing

Amazon Redshift Architecture
Leader Node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute Nodes
• Execute queries in parallel
• Node types to match your
workload: Dense Storage (DS2) or
Dense Compute (DC1)
• Divided into multiple slices
• Local, columnar storage
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC

Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

Amazon Redshift
Column storage
Data compression
Zone maps
• Track the minimum and
maximum value for each block
• Skip over blocks that don’t
contain relevant data
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959

Amazon Redshift
Column storage
Data compression
Zone maps
• Local storage for performance
• High scan rates
• Automatic replication
• Continuous backup and
streaming restores to/from
Amazon S3
• User snapshots on demand
• Cross region backups for
disaster recovery

Amazon Redshift online resize
Continue querying during resize
New cluster deployed in the background at no extra cost
Data copied in parallel from node to node
Automatic SQL endpoint switchover via DNS

Snowflake
Star
Amazon Redshift works with existing data models

Distribution Key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
Amazon Redshift data distribution

Sorting data in Amazon Redshift
In the slices (on disk), the data is sorted by a sort key
Choose a sort key that is frequently used in your queries
Data in columns is marked with a min/max value so
Redshift can skip blocks not relevant to the query
A good sort key also prevents reading entire blocks

User Defined Functions
Python 2.7
PostgreSQL UDF Syntax System
Network calls within UDFs are prohibited
Pandas, NumPy, and SciPy pre-installed
Import your own

Interleaved Multi Column Sort
Currently support Compound Sort Keys
• Optimized for applications that filter data by one leading column
Adding support for Interleaved Sort Keys
• Optimized for filtering data by up to eight columns
• No storage overhead unlike an index
• Lower maintenance penalty compared to indexes

Amazon Redshift works with your
existing analysis tools
JDBC/ODBC
Amazon Redshift

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details
• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring
• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open
• Come and see what AWS and the cloud can do for you.
• Click here to register: http://amzn.to/1RooPPL

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Editor's Notes