AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015

심화 웨비나 시리즈 | 8 번째 강연
2015년 7월 9일 목요일 | 오후 2시
http://aws.amazon.com/ko
AWS를 활용한
첫 빅데이터 프로젝트 시작하기

김일호, Solutions Architect

이번 웨비나 에서 들으실 내용..
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift,
Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도
구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구
축하는 방법에 대해 소개합니다.

v

Agenda
•  AWS Big data building blocks
•  AWS Big data platform
•  Log data collection & storage
•  Introducing Amazon Kinesis
•  Data Analytics & Computation
•  Collaboration & sharing
•  Netflix Use-case

AWS Big data building blocks (brief)

Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Amazon
Elastic
MapReduce

Store anything
Object storage
Scalable
99.999999999%
durability
Amazon
S3

Real-time processing
High throughput; elastic
Easy to use
EMR, S3, Redshift,
DynamoDB Integrations
Amazon
Kinesis

NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond
latency
Amazon
DynamoDB

Relational data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift

Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce

HDFS
Amazon
RedShift
Amazon
RDS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS
Data
Pipeline

Data
management
Hadoop
Ecosystem
analy8cal
tools

Data

Sources

AWS Data
Pipeline

v

Generation
Collection & storage
Analytics & computation
Collaboration & sharing

v

a

Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
AWS
Direct Connect
AWS
Storage Gateway
AWS
Import/ Export
Amazon
Glacier
S3
Amazon
Kinesis
Amazon EMR
Generation

v

Amazon
EC2
Amazon
EMR

Amazon
Kinesis
Generation

v

Amazon
Redshift
Amazon
DynamoDB
Amazon

RDS

S3 Amazon
EC2
Amazon
EMR

Amazon

CloudFront

AWS

CloudForma8on

AWS

Data
Pipeline

Generation

The right tools.
At the right scale.
At the right time.

v

Collection of Data
Sources

Aggrega8on

Tool

Data
Sink

Web
Servers

Applica8on
servers

Connected
Devices

Mobile
Phones

Etc

Scalable
method
to

collect
and
aggregate

Flume,
KaGa,
Kinesis,

Queue

Reliable
and
durable

des8na8on
OR

Des8na8ons

Types of Data Ingest
•  Transactional
–  Database reads/
writes
•  File
–  Click-stream logs
•  Stream
–  Click-stream logs
Database

Cloud

Storage

Stream

Storage

Run your own log collector
Your
applica0on
Amazon S3
DynamoDB

Any
other
data

store

Amazon S3
Amazon
EC2

Use a Queue
Amazon
Simple

Queue
Service

(SQS)

Amazon S3
DynamoDB

Any
other
data

store

Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue Service
(SQS)
Amazon Simple Storage Service (S3)
Amazon Elastic MapReduce

Use a Tool like FLUME, KAFKA, HONU etc
Flume running
on EC2
Amazon S3
Any
other
data

store

HDFS

v

Choice of tools
•  (+) Pros / (-) Cons
•  (+) Flexibility: Customers select the most appropriate software and underlying
infrastructure
•  (+) Control: Software and hardware can be tuned to meet specific business and
scenario needs.
•  (-) Ongoing Operational Complexity: Deploy, and manage an end-to-end system
•  (-) Infrastructure planning and maintenance: Managing a reliable, scalable
infrastructure
•  (-) Developer/ IT staff overhead: Developers, Devops and IT staff time and
energy expended
•  (-) Unsupported Software: deprecated and/ pre-version 1 open source software
•  Future – Need for to stream data for real time

Stream

Storage

Database

Cloud

Storage

29
Why Stream Storage?
•  Convert multiple streams into
fewer persistent sequential
streams
•  Sequential streams are easier
to process
Amazon
Kinesis
or
KaGa

4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard
or
Par88on
1

Shard
or
Par88on
2

Producer
1

Producer
2

Producer
3

Producer
N

30
Amazon
Kinesis
or
KaGa

Why Stream Storage?
• Decouple producers and
consumers
• Buffer
• Preserve client ordering
• Streaming MapReduce
• Consumer replay /
reprocess
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer
1

Shard
or
Par88on
1

Shard
or
Par88on
2

Consumer
1

Count
of

Red
=
4

Count
of

Violet
=
4

Consumer
2

Count
of

Blue
=
4

Count
of

Green
=
4

Producer
2

Producer
3

Producer
N

Data

Sources

App.4

[Machine

Learning]

AWS
Endpoint

App.1

[Aggregate
&

De-‐Duplicate]

Data

Sources

Data

Sources

Data

Sources

App.2

[Metric

Extrac0on]

S3
DynamoDB

Redshift
App.3

[Sliding

Window

Analysis]

Data

Sources

Availability
Zone
Shard
1

Shard
2

Shard
N

Availability
Zone
Availability
Zone
Introducing Amazon Kinesis
Managed Service for Real-Time Processing of Big Data
EMR

Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts

Putting data into Kinesis
Managed Service for Ingesting Fast Moving Data
•  Streams
are
made
of
Shards

⁻  A
Kinesis
Stream
is
composed
of
mul8ple
Shards

⁻  Each
Shard
ingests
up
to
1MB/sec
of
data,
and
up
to
1000
TPS

⁻  Each
Shard
emits
up
to
2
MB/sec
of
data

⁻  All
data
is
stored
for
24
hours

⁻  You
scale
Kinesis
streams
by
adding
or
removing
Shards

•  Simple
PUT
interface
to
store
data
in
Kinesis

⁻  Producers
use
a
PUT
call
to
store
data
in
a
Stream

⁻  A
Par00on
Key
is
used
to
distribute
the
PUTs
across
Shards

⁻  A
unique
Sequence
#
is
returned
to
the
Producer
upon
a
successful

PUT
call

Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis

POST / HTTP/1.1
Host: kinesis.<region>.<domain>
x-amz-Date: <Date>
Authorization: AWS4-HMAC-SHA256 Credential=<Credential>, SignedHeaders=content-type;date;host;user-
agent;x-amz-date;x-amz-target;x-amzn-requestid, Signature=<Signature>
User-Agent: <UserAgentString>
Content-Type: application/x-amz-json-1.1
Content-Length: <PayloadSizeBytes>
Connection: Keep-Alive
X-Amz-Target: Kinesis_20131202.PutRecord
{
"StreamName": "exampleStreamName",
"Data": "XzxkYXRhPl8x",
"PartitionKey": "partitionKey"
}

v

Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Apps
Client library for fault-tolerant, at least-once, real-time
processing
•  Key streaming application attributes:
•  Be distributed, to handle multiple shards
•  Be fault tolerant, to handle failures in hardware or software
•  Scale up and down as the number of shards increase or decrease
•  Kinesis Client Library (KCL) helps with distributed processing:
•  Automatically starts a Kinesis Worker for each shard
•  Simplifies reading from the stream by abstracting individual shards
•  Increases / Decreases Kinesis Workers as # of shards changes
•  Checkpoints to keep track of a Worker’s location in the stream
•  Restarts Workers if they fail
•  Use the KCL with Auto Scaling Groups
•  Auto Scaling policies will restart EC2 instances if they fail
•  Automatically add EC2 instances when load increases
•  KCL will redistributes Workers to use the new EC2 instances
OR
•  Use the Get APIs for raw reads of Kinesis data streams

37
Easy
Administra0on

Managed
service
for
real-‐8me
streaming
data

collec8on,
processing
and
analysis.
Simply

create
a
new
stream,
set
the
desired
level
of

capacity,
and
let
the
service
handle
the
rest.

Real-‐0me
Performance

Perform
con8nual
processing
on
streaming

big
data.
Processing
latencies
fall
to
a
few

seconds,
compared
with
the
minutes
or
hours

associated
with
batch
processing.

High
Throughput.
Elas0c

Seamlessly
scale
to
match
your
data

throughput
rate
and
volume.
You
can
easily

scale
up
to
gigabytes
per
second.
The
service

will
scale
up
or
down
based
on
your

opera8onal
or
business
needs.

S3,
EMR,
Storm,
RedshiY,
&
DynamoDB

Integra0on

Reliably
collect,
process,
and
transform
all
of

your
data
in
real-‐8me
&
deliver
to
AWS
data

stores
of
choice,
with
Connectors
for
S3,

Redshi],
and
DynamoDB.

Build
Real-‐0me
Applica0ons

Client
libraries
that
enable
developers
to

design
and
operate
real-‐8me
streaming
data

processing
applica8ons.

Low
Cost

Cost-‐eﬃcient
for
workloads
of
any
scale.
You

can
get
started
by
provisioning
a
small

stream,
and
pay
low
hourly
rates
only
for

what
you
use.

Amazon Kinesis: Key Developer Benefits

Customers using Amazon Kinesis
Mobile/
Social
Gaming
Digital
Adver0sing
Tech.

Deliver
con8nuous/
real-‐8me
delivery
of
game
insight

data
by
100’s
of
game
servers

Generate
real-‐8me
metrics,
KPIs
for
online
ad
performance

for
adver8sers/
publishers

Custom-‐built
solu8ons
opera8onally
complex
to

manage,
&
not
scalable

Store
+
Forward
ﬂeet
of
log
servers,
and
Hadoop
based

processing
pipeline

•  Delay
with
cri8cal
business
data
delivery

•  Developer
burden
in
building
reliable,
scalable

pladorm
for
real-‐8me
data
inges8on/
processing

•  Slow-‐down
of
real-‐8me
customer
insights

•  Lost
data
with
Store/
Forward
layer

•  Opera8onal
burden
in
managing
reliable,
scalable
pladorm

for
real-‐8me
data
inges8on/
processing

•  Batch-‐driven
real-‐8me
customer
insights

Accelerate
8me
to
market
of
elas8c,
real-‐8me

applica8ons
–
while
minimizing
opera8onal
overhead

Generate
freshest
analy8cs
on
adver8ser
performance
to

op8mize
marke8ng
spend,
and
increase
responsiveness
to

clients

Digital Ad. Tech Metering with Kinesis

Con0nuous
Ad

Metrics
Extrac0on

Incremental
Ad.

Sta0s0cs

Computa0on

Metering
Record
Archive

Ad
Analy0cs
Dashboard

Cloud

Database

&
Storage

Cloud Database and Storage Tier Anti-pattern
App/Web Tier
Client Tier
RDBMS

Database & Storage Tier = All-in-one?

Cloud Database and Storage Tier — Use the Right
Tool for the Job!
App/Web
Tier

Client
Tier

Data
Tier
Database
&
Storage
Tier

Search

Hadoop/HDFS

Cache

Blob
Store

SQL
NoSQL

App/Web
Tier

Client
Tier

Database

&

Storage
Tier

Amazon
RDS
Amazon

DynamoDB

Amazon

Elas0Cache

Amazon
S3

Amazon

Glacier

Amazon

CloudSearch

HDFS
on
Amazon
EMR

Cloud Database and Storage Tier — Use the Right
Tool for the Job!

v

What Database and Storage Should I
Use?
•  Data structure
•  Query complexity
•  Data characteristics: hot, warm, cold

Data Structure and Query Types vs Storage
Technology
Structured – Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
Structured – Complex Query
SQL
Amazon RDS
Search
Amazon CloudSearch
Unstructured – No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured – Custom Query
Hadoop/HDFS
Amazon Elastic MapReduce
Data
Structure
Complexity

Query
Structure
Complexity

What is the Temperature of Your
Data?

Amazon

RDS

Amazon
S3

Request
Rate

High
Low

Cost/GB

High
Low

Latency

Low
High

Data
Volume

Low
High

Amazon

Glacier

Amazon

CloudSearch

Structure

Low

High

Amazon

DynamoDB

Amazon

Elas8Cache

HDFS

What Data Store Should I Use?
Amazon

Elas0Cache

Amazon

DynamoDB

Amazon

RDS

Amazon

CloudSearch

Amazon

EMR
(HDFS)

Amazon
S3
Amazon
Glacier

Average

latency

ms
ms
ms,
sec
ms,sec
sec,min,hrs
ms,sec,min

(~
size)

hrs

Data
volume
GB
GB–TBs

(no
limit)

GB–TB

(3
TB
Max)

GB–TB
GB–PB

(~nodes)

GB–PB

(no
limit)

GB–PB

(no
limit)

Item
size
B-‐KB
KB

(64
KB
max)

KB

(~rowsize)

KB

(1
MB
max)

MB-‐GB
KB-‐GB

(5
TB
max)

GB

(40
TB
max)

Request
rate
Very
High
Very
High
High
High
Low
–
Very

High

Low–

Very
High

(no
limit)

Very
Low

(no
limit)

Storage
cost

$/GB/month

$$
¢¢
¢¢
$
¢
¢
¢

Durability
Low
-‐

Moderate

Very
High
High
High
High
Very
High
Very
High

Hot
Data
Warm
Data
Cold
Data

Decouple your storage and analysis engine
1.  Single Version of Truth
2.  Choice of multiple analytics Tools
3.  Parallel execution from different teams
4.  Lower cost
Learning

from

Nealix

v

S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3

Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Kinesis

Choose depending upon design

v

Process
• Answering questions about data
• Questions
•  Analytics: Think SQL/data warehouse
•  Classification: Think sentiment analysis
•  Prediction: Think page-views prediction
•  Etc

v

Processing Frameworks
Generally come in two major types:
• Batch processing
• Stream processing
• Interactive query

v

Batch Processing
• Take large amount of cold data and ask
questions
• Takes minutes or hours to get answers back
Example:
Genera-ng
hourly,
daily,
weekly

reports

v

Stream Processing (AKA Real Time)
• Take small amount of hot data and ask
questions
• Takes short amount of time to get your
answer back
Example:
1min
metrics

Processing Tools
•  Batch processing/analytic
–  Amazon Redshift
–  Amazon EMR
•  Hive/Tez, Pig, Spark,
Impala, Spark, Presto, ….
•  Stream processing
–  Apache Spark streaming
–  Apache Storm (+ Trident)
–  Amazon Kinesis client and
connector library

Amplab Big Data Benchmark
Scan
query
Aggregate
query
Join
query

hops://amplab.cs.berkeley.edu/benchmark/

v

What Batch Processing Technology Should I Use?
RedshiY
Impala
Presto
Spark
Hive

Query

Latency

Low
Low
Low
Low
-‐
Medium
Medium
-‐
High

Durability
High
High
High
High
High

Data

Volume

1.6PB
Max
~Nodes
~Nodes
~Nodes
~Nodes

Managed
Yes
EMR

bootstrap

EMR

bootstrap

EMR
bootstrap
Yes
(EMR)

Storage
Na8ve
HDFS
HDFS/S3
HDFS/S3
HDFS/S3

#
of
BI

Tools

High
Medium
High
Low
High

Query
Latency

(Low
is
beoer)

v

What Stream Processing Technology Should I Use?
Spark
Streaming
Apache
Storm
+

Trident

Kinesis
Client
Library

Scale/Throughput
~
Nodes
~
Nodes
~
Nodes

Data
Volume
~
Nodes
~
Nodes
~
Nodes

Manageability
Yes
(EMR
bootstrap)
Do
it
yourself
EC2
+
Auto
Scaling

Fault
Tolerance
Built-‐in
Built-‐in
KCL
Check
poin8ng

Programming

languages

Java,
Python,
Scala
Java,
Scala,

Clojure

Java,
Python

Hadoop based Analysis
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR

Your choice of tools on Hadoop/EMR
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR

Hadoop based Analysis
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Spark and Shark
Cloudera Impala

Hadoop is good for
1.  Ad Hoc Query analysis
2.  Large Unstructured Data Sets
3.  Machine Learning and Advanced Analytics
4.  Schema less

SQL based Low Latency Analytics on
structured data

SQL based processing
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
Redshift
Petabyte scale
Columnar Data -
warehouse

SQL based processing for unstructured data
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse

Your choice of BI Tools on the cloud
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift
Pre-processing
framework

Collaboration and Sharing insights
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift

Sharing results and visualizations
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools

Sharing results and visualizations and scale
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools

Sharing results and visualizations
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools

Geospatial Visualizations
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools

Rinse and Repeat
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline

The complete architecture
Amazon
SQS

Amazon S3
DynamoDB

Any
SQL
or
NO
SQL

Store

Log
Aggrega0on

tools

Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline

Reference
:
BDT403
Next
Genera8on
Big
Data
Pladorm
@
Nedlix

v

Big Data
• 10+ PB DW on S3
• 1.2 PB read daily
• 100 TB written daily
• ~ 200 billion events daily

Cloud

apps

Suro
Ursula

Cassandra

Aegisthus

Dimension
data

Event
Data

15
min

Daily

Amazon
S3

SS
tables

Data Pipelines

Amazon
S3

Storage
Compute
Service
Tools

@2013

Amazon
S3

v2.0
Storage
Compute
Service
Tools

@2014

온라인 자습 및 실습
다양한 온라인 강의 자
료 및 실습을 통해 AWS
에 대한 기초적인 사용
법 및 활용 방법을 익히
실 수 있습니다.
강의식 교육
AWS 전문 강사가 진행하는 강의를
통해 AWS 클라우드로 고가용성,
비용 효율성을 갖춘 안전한 애플리
케이션을 만드는 방법을 알아보세
요. 아키텍쳐 설계 및 구현에 대한
다양한 오프라인 강의가 개설되어
있습니다.
인증 시험을 통해 클라우
드에 대한 자신의 전문 지
식 및 경험을 공인받고 개
발 경력을 제시할 수 있습
니다.
AWS 공인 자격증
http://aws.amazon.com/ko/training
다양한 교육 프로그램

AWS 기초 웨비나 시리즈에 참여해 주셔서 감사합니다!
이번 웨비나가 여러분의 궁금증 해소에 도움이 되었길 바랍니다.
이후 이어질 설문 조사를 통해 오늘 웨비나에 대한 의견을 알려주세요.
aws-korea-marketing@amazon.com
http://twitter.com/AWSKorea
http://facebook.com/AmazonWebServices.ko
http://youtube.com/user/AWSKorea
http://slideshare.net/AWSKorea

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015

More Related Content

What's hot

Viewers also liked

Similar to AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015

More from Amazon Web Services Korea

Recently uploaded

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015