AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BDM205
Big Data Mini Con
State of the Union
Roger Barga, AWS
November 29, 2016

What is Big Data?
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share it.

Amazon EMR Amazon
EC2
Process & Analyze
Amazon
Glacier
Amazon
S3
Store
AWS
Import/Export
AWS Direct
Connect
Collect
Amazon Kinesis
Amazon
Machine
Learning
Amazon
Redshift
Amazon
DynamoDB
Amazon
Kinesis
Analytics
Amazon
QuickSightAWS Database
Migration
Service
AWS Data
Pipeline
Amazon RDS,
Aurora
Big Data services on AWS
Amazon
Elasticsearch
Service

Store anything
Object storage
Highly scalable
99.999999999% durability
Amazon S3
Collection and storage

Petabyte-scale data transfer service that
uses Amazon-provided storage devices for
transport.
Copy up to 80TB data from on-prem file
system to the Snowball through a 10Gbps
network interface
All data is encrypted by 256-bit GSM
encryption
AWS
Import/Export
Snowball
Collection and storage
E-ink shipping label
Ruggedized
case
“8.5G Impact”
50TB & 80TB
10G network

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; start at $0.25/hour
Amazon Redshift
Structured data processing

Hadoop as a service
Spark, Presto, Flink, Hbase, Hive, etc.
Easy to use; fully managed
On-demand and Spot pricing
HDFS & S3 file systems
Amazon EMR
Semi-structured / unstructured data processing

Distributed search and analytics engine
Managed service using Elasticsearch and Kibana
Fully managed - zero admin
Highly available and reliable
Tightly integrated with other AWS servicesAmazon
Elasticsearch
Service
Semi-structured / unstructured data processing

Serverless compute service that runs your
code in response to events.
Extend AWS services with user-defined
custom logic.
Pay only for the requests served and
compute time required - billing in
increments of 100 milliseconds
AWS Lambda
Serverless event processing

Streams: Build your own custom application to
process streaming data using Amazon Kinesis Client
Library. Connectors to S3, DynamoDB, Lambda,
Amazon Redshift, Elastisearch, Storm spout,…
Firehose: Load massive volumes of streaming data
into S3, Amazon Redshift, Elasticsearch. Inline
processing using Lambda and library of exiemplates.
Analytics: Analyze streaming data using standard
SQL, no servers to manage, elastically scale, pay as
you go.
Amazon
Kinesis
Streaming data processing

Streams: Build your own custom application to
process streaming data using Amazon Kinesis Client
Library. Connectors to S3, DynamoDB, Lambda,
Amazon Redshift, Elastisearch, Storm spout,…
Firehose: Load massive volumes of streaming data
into S3, Amazon Redshift, Elasticsearch. Inline
processing using Lambda and library of ready to use
templates.
Analytics: Analyze streaming data using standard
SQL, no servers to manage, elastically scale, pay as
you go.
Amazon
Kinesis
Streaming data processing

Fast, powered by SPICE, automatically scales.
Explore, analyze, share insights with anyone.
1/10th the cost of traditional BI solutions.
Broad connectivity with AWS data services, on-
premises data, files and business applications.
Amazon
QuickSight
Visualize and explore
Amazon RDS
Amazon S3 Amazon Redshift

Scale as your data and business grows
The volume, variety, and velocity at which data is being generated are leaving
organizations with new questions to answer, such as:

Store and analyze all your data, structured and unstructured
from all of your sources, in one centralized location at low cost.
Quickly ingest data without needing to force it into a
pre-defined schema, enabling ad-hoc analysis by applying
schemas on read, not write.
Separating your storage and compute allows you to scale each
component as required, attach multiple data processing and
analytics services to the same data set.
Scale
S3 Data Lake

Implementing a Data Lake on AWS
Elasticsearch

Starting small is powerful, when you can scale up fast
Scaling up your analytics systems With AWS Traditional IT *
Get a new BI server 20 minutes 3 months
Upgrade your analytics server to the newest
Intel processors and add 16GB memory
10 minutes 2 months
Add 500TB of storage instant 2 months
Grow a DWH cluster from 8GB to 1PB 1 hour 8 months
Build a 1024-node Hadoop cluster 30 minutes unlikely
Roll out multi-region production environment hours months
* actual provisioning times in a well-organized IT division

Netflix: Using Amazon S3 as the fabric of our big data ecosystem
Tuesday, Nov. 29
5:30pm – 6:30pm
Mirage, St. Croix B

Putting it together: cost
How much would it cost to process the Twitter fire hose?

Putting it together: cost
How much would it cost to process the Twitter fire hose?
S3: $0.025/GB-Mo
Redshift: Starts at $0.25/hour
EC2: Starts at $0.02/hour
Glacier: $0.007/GB-Mo
Kinesis: $0.015/shard 1MB/s in;
2MB/out; $0.014/million puts

500MM tweets/day = ~ 5,800 tweets/sec
2k/tweet is ~12MB/sec (~1TB/day)
$0.015/hour per shard, $0.014/million PUTS
Amazon Kinesis cost is $0.47/hour
Amazon Redshift cost is $0.850/hour (for a 2TB node)
S3 cost is $1.02/hour (no compression)
Total: $2.34/hour – on demand
Cost

Use only the services you need
Scale only the services you need
Pay for only what you use
Discounts through Reserved Instances
Types including Spot, and upfront commitments.
Cost

Putting it together
Scale and security

Putting it together: scale and security
FINRA: Monitor and enforce trading regulations
FINRA handles approximately
75 billion market events every
day to build a holistic picture of
trading in the U.S. Hundreds of
surveillance algorithms against
massive amounts of data.
FINRA mission
 Deter misconduct by enforcing the rules.
 Detect and prevent wrongdoing in US markets
 Discipline those who break the rules
Scale brings unique challenges
 Market volumes are volatile and increasing
 Exchanges are dynamically evolving
 Regulatory rules are created and enhanced
 New securities products are introduced
 Market manipulators innovate

Petabytes of data generated on
premise and brought to AWS and
stored in S3 data lake.
Thousands of analytical queries
performed on EMR and Redshift. Over
400 analytics packages.
Stringent security requirements met by
leveraging VPC, VPN, Encryption at
Rest and In Transit, AWS CloudTrail and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Platform that adapts to market dynamics
Web Applications
Analysts; Regulators
Amazon EMR
Amazon EMR
Amazon Redshift

Store an exabyte of data or more in S3
Analyze GB to PB using standard tools
Encryption of all data at each step
Auditability of all APIs and retrievals
Control egress and ingress points using VPCs
Scale and
security
FINRA: Building a Secure Data Science Platform on AWS
Tuesday, Nov. 29
4:00pm – 5:00pm
Mirage, St. Croix B

Putting it together
Agility and actionable insights

Actionable insights
Demonstration
http://amzn.to/bigdata
Access from a mobile device…

What item most interests you this week?
What item will be the most difficult to explain to
your significant other when you return home?
What will give you the biggest headache this week?
New Amazon Web Services Blackjack
Networking with Peers re:Play Party

What item most interests you this week?
What are your colleagues most interested in hearing
about when you return next week?
What will give you the biggest headache this week?
New Amazon Web Services Blackjack
Networking with Peers re:Play Party

Kinesis
Ingestion
Stream
Kinesis
Analytics
Kinesis
Aggregate
Stream
Lambda
Function
DynamoDB
TableAmazon
Cognito
SELECT ROWTIME, userId, COUNT(*)
FROM STREAM
GROUP BY userId, FLOOR(ROWTIME to
SECOND)
S3 Bucket
HTML, JavascriptAggregated DataRaw Device and
Quadrant Data
Demo architecture

The demo application
CREATE OR REPLACE STREAM DESTINATION_SQL_STREAM (UNIQUE_USER_COUNT INT, ANDROID_COUNT INT, IOS_COUNT INT, WINDOWS_PHONE_COUNT INT,
OTHER_OS_COUNT INT, QUADRANT_A_COUNT INT, QUADRANT_B_COUNT INT, QUADRANT_C_COUNT INT, QUADRANT_D_COUNT INT, WINDOW_TIME TIMESTAMP);
CREATE OR REPLACE STREAM DISTINCT_USER_STREAM (COGNITO_ID VARCHAR(64), DEVICE VARCHAR(32), OS VARCHAR(32), QUADRANT char(1), DT
TIMESTAMP);
CREATE OR REPLACE PUMP "DISTINCT_USER_PUMP" AS
INSERT INTO "DISTINCT_USER_STREAM"
SELECT STREAM DISTINCT
"cognitoId",
"device",
"os",
"quadrant",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO SECOND)
FROM "SOURCE_SQL_STREAM_001";
CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM
COUNT("DISTINCT_USER_STREAM".COGNITO_ID) AS UNIQUE_USER_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'Android' THEN COGNITO_ID ELSE null END)) AS ANDROID_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'iOS' THEN COGNITO_ID ELSE null END)) AS IOS_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'Windows Phone' THEN COGNITO_ID ELSE null END)) AS WINDOWS_PHONE_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'other' THEN COGNITO_ID ELSE null END)) AS OTHER_OS_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'A' THEN COGNITO_ID ELSE null END)) AS QUADRANT_A_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'B' THEN COGNITO_ID ELSE null END)) AS QUADRANT_B_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'C' THEN COGNITO_ID ELSE null END)) AS QUADRANT_C_COUNT,
COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'D' THEN COGNITO_ID ELSE null END)) AS QUADRANT_D_COUNT,
ROWTIME
FROM "DISTINCT_USER_STREAM"
GROUP BY
FLOOR("DISTINCT_USER_STREAM".ROWTIME TO SECOND);

Big data does not mean just batch
 Can be streamed in
 Processed in real time
 Can be used to respond quickly to requests and
actionable events, generate business value.
You can mix and match
 On-premises and cloud
 Custom development and managed services
Agility
& actionable
insights

Putting it together
Choice and selection

1-click deployment to launch, in
multiple regions around the world
Pay-as-you-go pricing with no long
term contracts required
2,000+ product listings to browse,
test, and buy software; 290
specific to big data.
Advanced Analytics
Database and Data Enablement
Business Intelligence
Putting it together: choice and selection
AWS Marketplace: Software store with simplified procurement

Largest ecosystem of ISVs & integrators
Tens of thousands of consulting and technology partners

We have a retail mindset
Use our managed big data services
Build or bring your own
Or access thousands in our marketplace
Each customer decides for themselves
Choice &
selection

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Richard T. Freeman, Ph.D., Lead Data Engineer and Architect, JustGiving
November 29, 2016
JustGiving:
Event-Driven Data Platform
BDM205

We are
A tech-for-good platform for
events-based fundraising,
charities, and crowdfunding
“Ensure no good cause
goes unfunded”
• The #1 platform for online
social giving in the world
• Peaks in traffic: Ice bucket,
natural disasters
• Raised $4.2bn in donations
• 28.5m users
• 196 countries
• 27,000 good causes
• GiveGraph
• 91 million nodes
• 0.53 billion relationships

Our requirements
• Limitation in existing SQL Server data warehouse
• Long-running and complex queries for data scientists
• New data sources: API, clickstream, unstructured, log, behavioral
data, etc.
• Easy to add data sources and pipelines
• Reduce time spent on data preparation and experiments
Machine
learning
Graph
processing
Natural language
processing
Stream processing
Data
ingestion
Data
preparation
Automated Pipelines
Insight
Predictions
Measure
Recommendations
Data-driven

Event-driven data platform at JustGiving [1 of 2]
• JustGiving developed in-house analytics and data science
platform in AWS called RAVEN.
• Reporting, Analytics, Visualization, Experimental, Networks
• Uses event-driven and serverless pipelines rather than
workflows or DAGs
• Messaging, queues, pub/sub patterns
• Separate storage from compute
• Supports scalable event driven
• ETL / ELT
• Machine learning
• Natural language processing
• Graph processing
• Allows users to consume raw tables, data blocks, metrics,
KPIs, insight, reports etc.

Event-driven data platform at JustGiving [2 of 2]

Serverless streaming analytics and persist stream

The outcome
• Ingest full clickstream
• Near real-time streaming analytics
• Persist streams to Amazon S3 and Amazon Redshift
Amazon Kinesis
• AWS managed services
• Event-driven and serverless
• Scale out and automate complex queries
• Improved productivity
• Data-driven: Measure, insight, predict, recommend
RAVEN platform:
scalable event-driven data platform in AWS

Thank you!
“Ensure no good cause goes unfunded”
Contact:
https://linkedin.com/in/
drfreeman
BDM303 - JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing
Tuesday 2:30 PM - 3:30 PM
Wednesday, 3:30 PM - 4:30 PM [repeat]

Proven customer success
The vast majority of big data use cases deployed in the cloud
today run on AWS.

Big Data Mini Con sessions
Mirage, Bermuda A Mirage, St. Croix B Mirage, Event Center B Mirage, Barbados A
1:00 PM
Beeswax: Building a Real-
Time Streaming Data
Platform on AWS
Big Data Architectural
Patterns and Best
Practices on AWS
Deep Dive: Amazon
EMR Best Practices &
Design Patterns Workshop: Building
Your First Big Data
Application with AWS
2:30 PM
JustGiving: Serverless Data
Pipelines, Event-Driven ETL,
and Stream Processing
Best Practices for
Apache Spark on
Amazon EMR
Understanding IoT
Data: How to Leverage
Amazon Kinesis in Building
an IoT Analytics Platform
on AWS
4:00 PM
Analyzing Streaming Data in
Real-time with Amazon
Kinesis Analytics
FINRA: Building a
Secure Data Science
Platform on AWS
Best Practices for Data
Warehousing with
Amazon Redshift Workshop: Building
Your First Big Data
Application with AWS
5:30 PM
Real-Time Data Exploration
and Analytics with Amazon
Elasticsearch Service and
Kibana
Netflix: Using Amazon
S3 as the fabric of our
big data ecosystem
Visualizing Big Data
Insights with Amazon
QuickSight
Plus, repeats for many sessions throughout the week!

Get started with Big Data on AWS
aws.amazon.com/big-data
Big Data Quest
Learn at your own pace and practice working with AWS
services for big data on QwikLABS. (3 Hours | Online)
qwiklabs.com/quests/1
Big Data on AWS
How to use AWS services to process data with Hadoop &
create big data environments (3 Days | Classroom )
aws.amazon.com/training/course-descriptions/bigdata/
Big Data Technology Fundamentals FREE!
Overview of AWS big data solutions for architects or data
scientists new to big data. (3 Hours | Online)
AWS Courses
Self-paced Online Labs

Remember to complete
your evaluations!

AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Similar to AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205) (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)