Modern Data Architectures for Business Insights at Scale

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Olivier Klein 奧樂凱
Emerging Technologies Solutions
Architect, Asia-Pacific
Modern Data Architectures for
Business Insights at Scale

Data analysis for a better customer experience
• Your business creates and stores
data and logs all the time
• Data points and logs allow you to
understand individual customer
experience and improve it
• Analysis of logs and trails help
gain insights

Ever Increasing Amount of Data
Volume
Velocity
Variety

Generation
Collection & Storage
Analytics & Computation
Collaboration & Sharing

More devices
Lower cost
Higher throughput
Generation

Highly constrained
More devices
Lower cost
Higher throughput
Generation

95% of the 1.2 zettabytes
of data in the digital
universe is unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates
of compound annual
growth (CAGR) at 62%
from 2008 – 2012.
Source: IDC
GB TB
PB
ZB
EB
Big Data: Unconstrained data growth

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020

Cloud Computing helps remove constraints

Big Data:
• Potentially massive datasets
• Iterative, experimental style of
data manipulation and analysis
• Frequently not a steady-state
workload; peaks and valleys
• Data is a combination of
structured and unstructured
data in many formats
AWS Cloud:
• Virtually unlimited capacity
• Iterative, experimental usage cost
through on-demand
infrastructure
• Fully scalable infrastructure for
highly variable workloads
• Tools & Services for managing
structured, unstructured and
stream data

Let’s talk business outcomes of data analytics!

Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and
create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven
automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical
infrastructure
Driving Business Outcomes via Data Analytics

Amazon Redshift Amazon Elastic
MapReduce
Data Warehouse Semi-structured
Amazon GlacierAmazon Simple
Storage Service
Data Storage Archive
Amazon
DynamoDB
Amazon Machine
Learning
Amazon Kinesis
NoSQL Predictive Models Other AppsStreaming
Use optimal combination of interoperable services

2 . S o u r c e D a t a
S 3 U p l o a d
K i n e s i s F i r e h o s e
n a m o D B S t r e a m s
S n o w b a l l
S n o w b a l l E d g e
S n o w m o b i l e
3 . L i f e c yc l e
m a n a g e m e n t
a n d c o l d s t o r a g e
5 . D a t a
g o v e r n a n c e ,
s e c u r i t y,
p r i v a c y
Analytics
D a t a b a s e
M i g r a t i o n
S e r v i c e
1 . I n g e s t i o n
D a t a s t o r e t a r g e t
4 .
M e t a d a t a
c a p t u r e
6 . S e l f - s e r v i c e
d i s c o v e r y, s e a r c h ,
a c c e s s
7 .
M a n a g i n g
d a t a
q u a l i t y
A W
S
G l u
e
S 3
E F S
D yn a m o D B
R D S
E B S
8 . P r e p a r i n g f o r
An a l yt i c s
9 .
O r c h e s t r a t i o n
a n d j o b
s c h e d u l i n g
1 0 .
C a p t u r i n g
d a t a
c h a n g e s
G l a c i e r E M R
At h e n a
E M R
E l a s t i c S e a r c h
R e d s h i f t
AI
M a c h i n e L e a r n i n g
Q u i c k s i g h t
Modern Data Architecture on AWS

Insights to enhance business applications, new digital services
Technology: Backend system integration, on-prem data center extension, business application
integration, BI provisioning, data lakes, external APIs, access control and logging
Common initiatives
Insights: 360 view of the business
• Legacy data systems migration to enable self-service for business analysts
• Integration of all customer data, from orders, payments, interactions
• Supplier performance for inventory and vendor management
Digitization: Web-service that gives on-demand insights
• Delivery of digital content, with behavior tracking, and upsell (or ads)
• Ordering system for enterprise customers or consumers
Data monetization: Enrich, aggregate, and sell business data
• External data enrichment API, including digital marketing platforms
• Purchasable data sets of anonymized, domain-enriched insights
Outcome 1 : Modernize and Consolidate

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Modernize and consolidate
Enhancing business applications and creating new digital services takes a few
steps. Business goals often consist of being an agile, well-run organization,
and to stop missing opportunities because people are making decisions
without accurate insights. These initiatives are focused on giving important
personas fast and secure access to business-relevant insights.

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
1. Define personas and use case requirements (including UI)
Data analysts

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
2. Locate the data sources that have the information to extract
Data analysts

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
3. Ingest data through incremental or full loads, across secure connections
Data analysts

Fluentd: Open Source Log Collection
https://github.com/fluent/fluentd/
• Fluentd is an open source
data collector to unify data
collection and consumption
• Integration into many data
sources (App Logs, Syslogs,
Twitter etc.)
• Direct integration into AWS
<source>
type tail
format apache2
path /var/log/apache2/access_log
tag s3.apache.access
</source>
<match s3.*.*>
type s3
s3_bucket myweblogs
path logs/
</match>

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
4. Use Hadoop for large scale ETL, data quality, and preparation [*EMRFS]
AWS Glue
Amazon S3
Raw Data
Amazon EMR
ETL
Data analysts
Amazon S3
Clean Data

Amazon S3
• Highly available object storage
• Designed for 99.999999999% annual
data durability
• Replicated across 3 facilities
• Virtually unlimited scale
• Pay only for what you use, you don’t
need to pre-provision
• Allows event notifications to trigger
further action
Amazon S3

Amazon EMR
• Amazon EMR is a fully managed
Hadoop cluster
• Transient and long running clusters
• Direct integration into Amazon S3
• Easy to scale and enable burstable
capacity
• Integration with AWS Spot Market

1 instance x 100 hours = 100 instances x 1 hour
(and with Spot Pricing not only faster but also cheaper)

Amazon EMR
• Amazon EMR supports all common
Hadoop Frameworks such as:
• Spark, Pig, Hive, Hue, Oozie …
• Hbase, Presto, Impala …
• Decouples storage from compute
• Allows independent scaling
• Direct Integration with DynamoDB
and S3
Amazon S3Amazon
DynamoDB
Amazon EMR

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
5. Stage all data into centralized, highly available, durable storage for further access
AWS Glue
Amazon S3
Raw Data
Data analysts
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
6. Load semi-structured into Hadoop, structured into the DWH, and application data
into managed legacy application databases
AWS Glue
Amazon S3
Raw Data
Amazon EMR
Semi-structured
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data

Amazon Redshift
• Fully managed petabyte-scale data
warehouse
• Scalable amount of cluster nodes
• ODBC/JDBC connector for BI tools
using SQL
• Supports Amazon DynamoDB and
Amazon S3 to load data
• Less than a 10th of a cost of traditional
solutions
Amazon Redshift

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
7. Data is protected through identity and access management and logging
AWS Glue
Amazon S3
Raw Data
Amazon EMR
Semi-structured
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
AWS
Cloud TrailAWS IAM
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data

AWS
Cloud TrailAWS IAM
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
8. Data analysts use BI tools of choice to access all serving services
AWS Glue
Amazon S3
Raw Data
Amazon EMR
Semi-structured
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
Amazon
QuickSight
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data

Amazon Quicksight
• Fast, cloud-powered, BI service that
makes it easy to build visualizations,
perform ad-hoc analysis, and get insights
from data.
• Connectors for files, third party platforms,
AWS services and other partner BI tools
• In-memory calculation engine (SPICE)
to accelerate analysis and visualization
• $9 per user per month

AWS Marketplace
• Pre-Configured machine images
ready to be launched into virtual
server instances
• Launch applications with 1-Click
• Pay software licenses by the
hour or bring your own license
(BYOL)

AWS
Cloud TrailAWS IAM
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
9. Business users have enterprise applications enhanced by analytics
AWS Glue
Amazon S3
Raw Data
Amazon EMR
Semi-structured
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
Amazon
QuickSight
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data

AWS
Cloud TrailAWS IAM
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
10. External parties can buy services or data in a governed, secure way
AWS Glue
Amazon S3
Raw Data
Amazon EMR
Semi-structured
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
Amazon
QuickSight
Amazon
API Gateway
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Business users
External buyers
Transactions
Web logs /
cookies
ERP
Ingest
AWS Database
Migration Service
AWS Direct
Connect
AWS Storage
Gateway
Internet
Interfaces
Changed Data
AWS Glue
Amazon S3
Raw Data
Amazon EMR
Semi-structured
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
Amazon
QuickSight
Amazon
API Gateway
AWS
Cloud TrailAWS IAM
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon Athena

Decouple Storage and Compute
Traditionally analytical workloads
required large databases or data
warehouses, with storage and
compute close to each other
Big Data often benefits from
decoupling storage and compute
Amazon S3 offers virtually unlimited
storage at a per GB/month rate

No need to
move data
Query S3 directly
& right away
No infrastructure to
setup & manage
Fast results
within seconds
Pay for just the
queries you run
Amazon Athena
Interactive query service that makes it
easy to analyze data in Amazon S3
using standard SQL

Athena & Quicksight Demo
Amazon
S3
Amazon
Athena
Amazon
Quicksight
Analyze past flight performance data stored in S3
Bureau of Transportation Flight Data Statistics
www.transtats.bts.gov
Create visualizations from S3 with Athena & Quicksight

Personalization, demand forecasting, risk analysis
Technology: Advanced analytics, customer segmentations, high volume transactional data, un/semi-
structured data, design of experiment, A/B & hypothesis testing, machine learning
Common initiatives
Personalization: Refine market approaches based on optimal segments
• Offer products to new customers based on clusters of similar individuals
• Launch share of wallet initiatives, understanding likely total spend
• Targeted marketing to capture interests and increase conversion rates
Predict demand: Guide business owners to select the best scenarios
• Launch items or promotions at the optimal time to maximize response
• Modeling for store assortment, product selection, and merchandizing
• New product design, based on known market propensities
Risk measurement: Create freedom to act by quantifying exposures
• Scenario simulation to encourage investments and new offerings
• Supply chain analytics allows for faster confirmation of goods to customers
Outcome 2 : Innovate for new revenues

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Innovate for new revenues
Driving net new revenues is realized by business teams that have access to
skilled analysts, using platforms that can scale up and out, without IT
bottlenecks. Organizations start operating based on what they know about
their customers, and can approach new ventures in terms of confidence
levels. Product launches, campaigns, supply chain management, packaged
services, and customized offerings are designed and executed based on
predictive models.

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
AWS
Cloud TrailAWS IAM
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Engagement platforms
1. Personas involved in generating new revenues are data scientists, data
analysts (often embedded), business users, and customers/suppliers

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Direct
Connect
AWS
Cloud TrailAWS IAM
Amazon
CloudWatch
Data analysts
Data scientists
Business users
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
AWS Glue
2. Advanced analytics are built from a base of traditional data processing
Amazon EMR
Amazon RedShift
Amazon RDS

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Direct
Connect
AWS
Cloud TrailAWS IAM
Amazon
CloudWatch
Data analysts
Data scientists
Business users
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
AWS Glue
3. On-premise storage and databases are connected and converted
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Database
Migration Service
AWS Storage
Gateway

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Web logs /
cookies
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
AWS Glue
4. Internet-native data sources, like web and mobile, are captured
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Database
Migration Service
AWS Storage
Gateway

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
AWS Glue
5. Streaming un/semi-structured data feeds, like social and devices are
captured
Amazon EMR
Amazon RedShift
Amazon RDS

Stream in Real Time: Amazon Kinesis
• Real-Time Data Processing over
large distributed streams
• Elastic capacity that scales to
millions of events per second
• React In real-time upon incoming
stream events
• Reliable stream storage
replicated across 3 facilities
Amazon Kinesis

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon S3
Schemaless
AWS Glue
6. Log files and other schemaless data converted to Parquet and staged
Amazon EMR
Amazon RedShift
Amazon RDS

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon ElasticSearch
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon S3
Schemaless
AWS Glue
7. Data scientists test hypothesis against un/semi-structured data
Amazon RedShift
Amazon RDS

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
Machine Learning
Amazon S3
Schemaless
AWS Glue
8. Simple analytical models are built against Amazon Machine Learning
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon Athena

Amazon Machine Learning
• Easy to use, managed machine
learning service built for developers
• Machine learning technology based
on Amazon’s internal systems
• Create models using data stored in
Amazon S3, Amazon RDS or Amazon
Redshift
• Request predictions on batch or real-
time
Amazon Machine
Learning

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
Machine LearningAmazon EMR
MLlib
Amazon S3
Schemaless
AWS Glue
9. Complex analytical models are built against EMR (Spark) clusters
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon Athena

Apache Spark
• In-memory analytics cluster using RDD
(Resilient Distributed Dataset) for fast
processing
• Spark MLlib offers machine learning out of the box
• Apache Spark can read directly from Amazon S3
data = sc.textFile("s3://...")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
model.save(sc, "MyModel")
sameModel = KMeansModel.load(sc, "MyModel")

Machine Learning Algorithms
• Classification
• Sentiment analysis – Do people like my new product?
• Linear Regression
• Trend prediction – How much revenue next month?
• Clustering
• Recommendation - Other people bought this!
• Association
• Market basket analysis – Bundled products
• Neural Networks
• Pattern recognition - Speech recognition
Amazon Machine
Learning
Amazon EMR +
Spark Mlib
GPU Optimized
EC2 Instance

Intel® Processor Technologies
Intel® AVX – Dramatically increases performance for highly parallel HPC workloads
such as life science engineering, data mining, financial analysis, media processing
Intel® AES-NI – Enhances security with new encryption instructions that reduce the
performance penalty associated with encrypting/decrypting data
Intel® Turbo Boost Technology – Increases computing power with performance that
adapts to spikes in workloads
Intel Transactional Synchronization (TSX) Extensions – Enables execution of
transactions that are independent to accelerate throughput
P state & C state control – provides granular performance tuning for cores and sleep
states to improve overall application performance

New X1 Instance - Tons of Memory
• Designed for large-scale, in-memory
applications in the cloud
• Ideal for in-memory databases like SAP
HANA and big data processing apps like
Spark and Presto
• Powered by Intel® Xeon® E7 8880 v3
Haswell processors
• Features up to 2TB of memory and up to
128 vCPUs per instance
• 8X the memory offered by any other Amazon EC2
instance

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
10. Predictive models are published to data staging
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon Athena

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
11. Analysts use DWH, EMR, ES to find patterns & measure performance
Amazon RedShift
Amazon RDS

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
12. Risk models evaluated to create new products and assess customers
Amazon RDS

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
13. Demand forecasts loaded into supply chain management systems

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
14. Personalized offers are broadcast out over notification channels
Amazon SNS
Amazon Pinpoint

Amazon SNS & Amazon Pinpoint
• Amazon SNS is a fully
managed, cross-platform
mobile push intermediary
service
• Fully scalable to millions
of devices
• Amazon Pinpoint allows
to created targeted
campaigns and measure
engagement and results
Amazon SNS
Apple APNS
Google GCM
Amazon ADM
Windows WNS and
MPNS
Baidu CP
Android Phones and Tablets
Apple iPhones and iPads
Kindle Fire Devices
Android Phones and Tablets in China
iOS
Windows Phone Devices
Amazon
SNS

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
Amazon SNS
Amazon Pinpoint

Elastic GPUs For EC2
U s e G r a p h i c s G P U s A s I f T h e y W e r e E B S Vo l u m e s

Elastic GPUs: GPU Acceleration on-demand
Current
Generation
EC2
Instance

1GiB
GPU Memory
2 GiB
4 GiB
8 GiB
Current
Generation
EC2
Instance
Elastic GPUs: GPU Acceleration on-demand

BREAK
Next up: Real-Time Analytics and Engagement

Interactive customer experience, event-driven automation, fraud detection
Technology: Clickstream/mobile apps/sensor/video (computer vision)/audio (intent comprehension), event
detection and pipelining, in-line scoring, serverless compute, computer vision, deep learning
Common initiatives
Interactive CX: Natural customer journeys with adaptive interfaces
• Behavior-based recommendations, improving personalization along the journey
• Seamless session transfer across UI, from browser to mobile to physical location
• Voice-driven commands, and use of gestures and other natural interfaces
Event-driven automation: Full execution of business process driven by an action
• Order fulfillment, with real-time update notifications to customer
• Fast response to customer complaints/comments over direct or social channels
Fraud detection: Protect customer and business w/ real-time anomaly detection
• Purchase and payment verification, using behavioral models and location assessment
• Application and account opening validation
Outcome 3 : Real-time Engagement

The Power of Speech: Alexa
Alexa, the voice service that powers
Echo, provides capabilities, or skills,
that enable customers to interact with
devices using voice
Alexa Skills Kit (ASK) allows everyone
to build and publish their own skills
Skills can be powered by AWS
Lambda

Build your own Alexa Skill!
Amazon
Echo
Alexa Skills
Kit
AWS Lambda Facebook
Page

Personalized content
- Account access
- Track spending
- Check balances
- Pay bills
- Prevent fraud

Unlimited
Replays
Returns an MP3
or audio stream
Lightning Fast
Response
Fully Managed and
Low Cost
Amazon Polly
Turn text into lifelike speech using deep
learning technologies to synthesize
speech that sounds like a human voice

Amazon Polly
“The temperature
in WA is 75°F”
“The temperature
in Washington is 75 degrees
Fahrenheit”
Amazon Polly: Text In, Life-like Speech Out

Amazon Lex
Conversational interfaces for your
applications, powered by the same
Natural Language Understanding
(NLU) & Automatic Speech Recognition
(ASR) models as Alexa
Integrated
development in
AWS console
Trigger AWS
Lambda
functions
Multi-step
conversations
Continually improving
ASR & NLU models
Enterprise
connectors
Fully Managed

Intents
A particular goal that the
user wants to achieve
Utterances
Spoken or typed phrases
that invoke your intent
Slots
Data the user must provide to fulfill the
intent
Prompts
Questions that ask the user to input
data
Fulfillment
The business logic required to fulfill the
user’s intent
BookHotel

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Real-time engagement
Provide superior customer service by responding to opportunities in real
time. Fulfill requests for products or services in an automated fashion to
create a strong competitive advantage over those that are unable to.
Assurance becomes a different challenge, when speeds increase, and fraud
prevention must be adaptive and fast. Adding another layer of opportunity and
complexity is the use of vast streams of data from devices that are
measuring location, video, behaviors, environmental conditions, and more.

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
AWS
Cloud TrailAWS IAM
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Automation / events
1. Real-time engagement requires personas that develop the analytics,
and platforms for engaging and automating processes

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
2. Real-time systems are built from a base of advanced data processing

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
AWS Glue
Amazon
Kinesis
3. Events are pipelined through Kinesis, into multiple streams, at scale

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
4. Event data is given context and structure in EMR and pushed for batch

Also possible with Spark Streaming!
Amazon
Kinesis
EMR with
Spark Streaming
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(‘Big Data’))
.countByWindow(Seconds(5))
Counting tweets on a sliding window

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
5. Kinesis Firehose pumps events into a DWH for near real-time analysis

Amazon Kinesis Firehose
• Fully managed data streaming service to ingest and
capture data into your storage or data warehouse
• Ability to batch load, compress or encrypt streaming
data
• Elastic to scale to any throughput (no more sharding)
• Charged only per GB processed ($0.035 per GB)

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
AWS Lambda
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
6. The event is streamed to a scoring server for processing

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
AWS Lambda
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
7. Language, intent, and image processing are run and sent for scoring

Amazon Rekognition
Image Recognitions and Analysis
powered by Deep Learning which
allows to search, verify and organize
millions of images
Easy to use Batch Analysis Real-time
Analysis
Continually Improving Low Cost

Maple
Villa
Plant
Garden
Water
Swimming Pool
Tree
Potted Plant
Backyard

Demographic Data
Facial Landmarks
Sentiment Expressed
Image Quality
Brightness: 25.84
Sharpness: 160
General Attributes

Serverless Rekognition Demo
Serverless website that uses Rekognition to identify
faces and classify pictures
Amazon S3
AWS Lambda
Amazon API
Gateway
Amazon
DynamoDB
Amazon
Rekognition
Mobile
CodeFor.Cloud/image

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
AWS Lambda
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
8. Simple analytical models are checked on-demand against Amazon ML

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
AWS Lambda
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
9. Complex analytical models are scored against coded models (PMML)

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
AWS Lambda AWS Lambda
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
10. Scored response to the event is processed to be pushed for action

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon DynamoDB
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
11. Recommendations are pushed to DynamoDB for low latency serving

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon SQS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
12. Actions are pushed to RDS and SQS for business process automation
Amazon DynamoDB

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon DynamoDB
Amazon SQS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis

Amazon
Kinesis
Twitter Stream Amazon
Lambda
Demo: Live Twitter Feed Analysis
* https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
Twitter Blog* - On a typical day (in 2013):
• More than 500 million Tweets sent
• Average 5,700 TPS
Amazon
Elasticsearch
Service

Kinesis
for Real-
Time
10TB/day
Amazon
S3

• Robinhood’s lean staff used AWS to create a
massively scalable securities trading app
with strong built-in security and compliance
features that supported hundreds of thousands
of users at launch
• Saved customers $22 million in commissions
since launch, and transacted over $1 billion. All
of this scaled up with 2 DevOps resources
• Amazon Redshift has allowed the data science
team to identify fraud and fight money
laundering, without needing to hire a data
science infrastructure team
Robinhood Launches Popular No-fee Brokerage Trading Platform on AWS
Robinhood is an investment platform that offers free
trades for everyone. It is based in Palo Alto, CA.
We can look at real-time
analytics and behaviors on
our platform, that wouldn't be
available at our scale if we
weren't using AWS.
”
“
Miles Wellesley
Head of Business Development

Automation of self-service, deployment, policy, and quality assurance
Technology: Self-service, on-demand provisioning, DevOps, spot pricing, Cloud Formations, security
automation, performance monitoring (CW&XR), global rollouts
Common initiatives
Self-service:
• Application catalog or portal for all employees, availability determined by role
• Service provisioning backed by automation of policy and governance
Agile development: Use of DevOps to allow very few resources to deploy globally
• CI/CD for software release, build/test, and deployment automation
• Templated infrastructure provisioning, and configuration management
• Business rules and policies are "gold coded" to be used for all deployments
• Use of Security by Design (SbD) to codify network, O/S, and encryption
Comprehensive monitoring: Assurance of SLA and issue remediation
• Logging and monitoring of all API calls and executions to ensure SLAs are met
• Analysis of performance variance for faster root cause analysis
Outcome 4 : Automate for expansive reach

Ingest ServingData
sources
Speed (Real-time)
Scale (Batch)
Automate for expansive reach
Automation of self-service, deployment, policy, and quality assurance
Transactions
AWS Database
Migration Service
AWS Direct
Connect
Internet
Interfaces
Amazon S3
Stream Data
AWS
Cloud TrailAWS IAM
Amazon
Kinesis
Amazon Athena
Amazon EMR
Amazon RedShift
Amazon RDS
Amazon DynamoDB
Amazon SQS
AWS Storage
Gateway
Amazon
CloudWatch
Amazon
Kinesis Firehose
Event Scoring
Amazon AI
Data analysts
Data scientists
Business users
Connected
devices
Web logs /
cookies
Social media
Automation / events
ERP
Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon S3
Clean Data
Amazon
MLlib
Amazon S3
Schemaless
Amazon EMR
AWS Glue
Amazon
Kinesis
AWS DevOps

AWS Glue
Easily understand your data sources,
prepare the data, and load it reliably to
data stores and your analytics pipeline
Integrated with:
S3, RDS, Redshift & any JDBC-
compliant data store

Generate And Edit
Transformations

AWS Lambda
• Use AWS Lambda to clean and
massage incoming data
• Write code to load data sources
(S3, DynamoDB) automatically in your
data warehouse (e.g. Amazon Redshift)
• React in real-time to incoming events in
Amazon Kinesis
Amazon Lambda
Amazon Redshift
Amazon
Kinesis

AdRoll: AWS Lambda for log files
Valentino Volonghi
CTO, AdRoll
“Polling is not a scalable strategy to
figure out when new files are added to S3,
especially when you add 17M of them per
month. So we moved Lambda in front of
S3.”
• Cross-platform, cross-device
advertising platform
• Offers retargeting based on
clickstream data
300TB
new
data/mont
h

Remember everything is an API: SDKs
Java Python (boto) PHP .NET Ruby Node.js
iOS Android Go
JavaScript
C++

Affordable Petabyte-scale Analytics
AWS helps customers maximize the value of Big Data
investments while reducing overall IT costs
Secure,
Highly Durable storage
$28.16 / TB / month
Data
Archiving
$7.16 / TB / month
Real-time
streaming data load
$0.035 / GB
10-node
Spark Cluster
$0.15 / hr
Petabyte-scale
Data Warehouse
$0.25 / hr
Amazon Glacier Amazon S3 Amazon RedshiftAmazon EMRAmazon Kinesis

Call To Action
• Attend the official AWS Training course organized by AWS Authorized local
training partner – Iverson Associates Sdn Bhd (www.iverson.com.my).
• Join the AWS Jumpstart (2 hr) session and hear from our customers and partners
on how they enabled their teams and successfully deployed on AWS. Also stand a
chance to win free seat to the above courses.
• Point of contact – Cheryl Wong - cheryl.wong@iverson.com.my
Courses Date
Architecting on AWS 28 Feb - 2 March
System Operations on AWS 8-10 March
Developing on AWS 15-17 March
Big Data on AWS 19-21 April
Date Venue
17 Mar 2017 Iverson Associates Sdn Bhd (303330-M), Suites T113-T114, 3rd Floor, Centrepoint,
Lebuh Bandar Utama, Bandar Utama, 47800 Petaling Jaya, Selangor

Join AWS User Group MY
https://www.facebook.com/groups/awsugmy/

Modern Data Architectures for Business Insights at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Modern Data Architectures for Business Insights at Scale

Similar to Modern Data Architectures for Business Insights at Scale (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Modern Data Architectures for Business Insights at Scale

Editor's Notes