What we’ll cover
• Big Data and why organizations care
• Common Challenges - Which,What,Hows…
• AWS Big Data Solutions
• Big Data Driving Machine Learning
• Final Design Tenets
Data is a strategic asset for every organization
The world’s most
valuable resource is
no longer oil, but data.*
*Copyright: The Economist, 2017, David Parkins
“
”
Organizations that successfully generate business
value from their data will outperform their peers.
An Aberdeen survey saw organizations who
implemented a data lake outperforming similar
companies by 9% in organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: Driving Value from Data
Customers want more value from their data
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications
Data Lakes Extend Traditional Approaches
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
Data lakes on AWS
Durable and available; EB scale
Secure, compliant, auditable
Object-level controls for fine-grain access
Fast performance by retrieving subsets of data
The most ways to bring data in
2x as many integrations with partners
Broad set of analytics and ML services
S3
Lake Formation & Glue
Snowball Kinesis
Data Streams
Snowmobile Kinesis
Data Firehose
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
SageMaker
Comprehend
Rekognition
WHAT Data Do I Have?
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them inefficient."
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration,
and scaling to run your ETL jobs
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats and suggest schemas and
transformations
Automates the effort in building,
maintaining and running ETL jobs
“Beeswax uses Amazon S3 and
AWS Glue Data Catalog to build a
highly reliable data lake that is
fully managed by AWS. Our
platform leverages the AWS Glue
Data Catalog integration with
Amazon EMR in Hive and Spark
SQL applications to deliver
reporting and optimization
features to our customers.”
—Ram Kumar Rengaswamy, CTO, Beeswax
MOST Important: Selecting an Agile Framework
Start with a tool that will serve the purpose
Experiment, Test, Iterate, Adopt.
HOW can I get started?
Let’s look at an example:
Evolution of Netflix Data pipeline
Aggregate and upload events to
Hadoop/Hive for batch processing
EXPERIMENT new things
Batch Batch+ Real-time
“Amazon Kinesis Streams processes multiple terabytes of log data each day, yet
events show up in our analytics in seconds. We can discover and respond to
issues in real time, ensuring high availability and a great customer experience.”
FOCUS on business value
Big data processing with Apache Spark & Hadoop
with Amazon EMR
• Easy to use notebooks
• Low cost vs on-premises
• Elastic autoscaling
• Reliable 99.9% SLA
• Secure with encryption and keys
• Flexible, open source choice
Enterprise-grade Easy Lowest cost
FINRA’s legacy system did not
scale to handle 135 billion
events per day. They needed to
run complex surveillance queries
over 20+ PB of data
FINRA migrated their big data
appliance to a S3 Data Lake
and uses EMR for ingestion
and processing
The Forrester Wave
Cloud Hadoop/Spark Platforms
Q1 2019
The 11 Providers That Matter Most
and How They Stack Up
by Noel Yuhanna and Mike Gualtieri
February 13, 2019
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and
Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™
is a graphical representation of Forrester's call on a market and is plotted using a
detailed spreadsheet with exposed scores, weightings, and comments. Forrester
does not endorse any vendor, product, or service depicted in the Forrester Wave™.
Information is based on best available resources. Opinions reflect judgment at the
time and are subject to change.
Data warehouse for business reporting
with Amazon RedShift
• Fast—up to 10x faster than
traditional data warehouses
• Easy to setup, deploy and manage
• Cost-effective
• Scale on-demand for large data
volume and high query concurrency
• Query data in open formats directly
from the data lake
“20 percent of our queries now
complete in less than one
second. Best of all, we didn’t
have to change anything to
get this speed-up with
Redshift, which supports our
mission-critical workloads.”
—Greg Rokita, Executive Director
of Technology, Edmunds
Real-time analytics for timely insights
with Amazon Kinesis
• Make streaming data available to
multiple real-time analytics
applications
• Run streaming applications without
managing any infrastructure
• Durable to reduce the probability
of data loss
• Scalable to process data from
hundreds
of thousands of sources with low
latencies
“Amazon Kinesis makes it simple
to scale our solution end to end,
including the capture, processing,
and delivery of actionable
insights. This empowers our
customers to better understand
their user base.”
— Indu Narayan, Director of Data, Yieldmo
Operational analytics for logs and search
with Amazon Elasticsearch
• Fully managed; deploy
production-ready cluster
in minutes
• Direct access to
Elasticsearch
open-source APIs,
Logstash
and Kibana
• VPC support; at-rest and
in-transit encryption
• Scale up and down easily
“Ultimately, we are improving our
software products and offering
better service to our customers
because of the real-time visibility
we’re getting into log data.”
“Amazon Elasticsearch Service
enables data forensic activities
to take place and help find and
fix application problems faster.”
—Tommy Li, Senior Software Architect,
Autodesk
Interactive analysis
with Amazon Athena
• Interactive query service to
analyze data in Amazon S3 using
standard SQL
• No infrastructure to set up or
manage and no data to load
• Ability to run SQL queries on data
archived in Amazon Glacier
(coming soon)
“We only pay when we’re actually
querying the data, and we don’t
have to keep a cluster running all
the time. Using Amazon Athena,
we’re able to query seven years’
worth of data—adding up to
hundreds of terabytes—get
results at least 50 percent faster,
and save nearly $15,000 per
month.”
—Matt Chesler, director of DevOps
at Movable Ink
Serverless analytics
Deliver on-demand analytics on the data lake
S3
Data lake
Glue
(ETL &
Data Catalog)
Athena
QuickSight
Serverless. Zero
infrastructure. Zero
administration
Never pay for
idle resources
$
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
AWS IoT
AI/ML
Devices Web Sensors Social
Big Data driving Machine Learning
Better
Decisions
Better
Products
More
Users
More
DataClick stream
User activity
Generated content
Purchases
Clicks
Likes
Sensor data
Object Storage
Databases
Data warehouse
Streaming analytics
BI
Hadoop
Spark/Presto
Elasticsearch
Machine Learning
Deep Learning/ AI
M L F R A M E W O R K S &
I N F R A S T R U C T U R E
A I S E R V I C E S
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N
V I D E O
Vision Speech Language Chatbots
A M A Z O N
S A G E M A K E R
B U I L D T R A I N
F O R E C A S T
Forecasting
T E X T R A C T P E R S O N A L I Z E
Recommendations
D E P L O Y
Pre-built algorithms & notebooks
Data labeling (G R O U N D T R U T H )
One-click model training & tuning
Optimization (N E O )
One-click deployment & hosting
M L S E R V I C E S
F r a m e w o r k s I n t e r f a c e s I n f r a s t r u c t u r e
E C 2 P 3
& P 3 N
E C 2 C 5 F P G A s G R E E N G R A S S E L A S T I C
I N F E R E N C E
Reinforcement learningAlgorithms & models ( A W S M A R K E T P L A C E
F O R M A C H I N E L E A R N I N G )
Agility in Machine Learning – for all users
Visual insights for everyone
with Amazon QuickSight
• Pay only for what you use
• Scale to tens of thousands of
users
• Embedded analytics
• Build end-to-end BI solutions
• ML Insights
RNIB creates and distributes
accessible information in the form
of synthesized content
• Largest library of audiobooks in the
UK for nearly 2 million people with
sight loss
• Naturalness of generated speech is
critical to captivate and engage
readers
• No restrictions on speech
redistributions
Supporting people with sight loss using Amazon
Polly
Amazon Polly delivers
incredibly lifelike voices
which captivate and
engage our readers.
John Worsfold
Solutions Implementation Manager, RNIB
““
Saving lives with Amazon SageMaker
“The scalability of Amazon SageMaker, and its ability
to integrate with native AWS services, adds
enormous value for us. We are excited about how our
continued collaboration between the GE Health Cloud
and Amazon SageMaker will drive better outcomes
for our healthcare provider partners and deliver
improved patient care.”
- Sharath Pasupunuti, AI Engineering Leader
Core Tenets
• Build decoupled systems
• Use the right tool for the job
• Leverage managed and serverless services
• Use event-journal design patterns
• Be cost-conscious
• Machine learning (ML) enable your application
• Replace capacity planning with a consumption model
• Don’t forget metadata management