심화 웨비나 시리즈 | 8 번째 강연
2015년 7월 9일 목요일 | 오후 2시
http://aws.amazon.com/ko
AWS를 활용한 
첫 빅데이터 프로젝트 시작하기	
  
김일호, Solutions Architect
이번 웨비나 에서 들으실 내용..
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift,
Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도
구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구
축하는 방법에 대해 소개합니다.
v	
  
Agenda
•  AWS Big data building blocks
•  AWS Big data platform
•  Log data collection & storage
•  Introducing Amazon Kinesis
•  Data Analytics & Computation
•  Collaboration & sharing
•  Netflix Use-case
AWS Big data building blocks (brief)
Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Amazon
Elastic
MapReduce
Store anything
Object storage
Scalable
99.999999999%
durability
Amazon
S3
Real-time processing
High throughput; elastic
Easy to use
EMR, S3, Redshift,
DynamoDB Integrations
Amazon
Kinesis
NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond
latency
Amazon
DynamoDB
Relational data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift
Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
HDFS
Amazon
RedShift
Amazon
RDS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS	
  Data	
  Pipeline	
  
Data	
  management	
   Hadoop	
  Ecosystem	
  analy8cal	
  tools	
  
Data	
  
Sources	
  
AWS Data
Pipeline
v	
  
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v	
  
a	
  
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
AWS
Direct Connect
AWS
Storage Gateway
AWS
Import/ Export
Amazon
Glacier
S3
Amazon
Kinesis
Amazon EMR
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v	
  
Amazon	
  EC2	
   Amazon	
  EMR	
  
Amazon
Kinesis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v	
  
Amazon
Redshift
Amazon
DynamoDB
Amazon	
  	
  
RDS	
  
S3 Amazon	
  EC2	
   Amazon	
  EMR	
  
Amazon	
  
CloudFront	
  
AWS	
  
CloudForma8on	
  
AWS	
  
	
  Data	
  Pipeline	
  
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The right tools.
At the right scale.
At the right time.
AWS Big data platform
v	
  
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v	
  
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v	
  
Collection of Data
Sources	
  
Aggrega8on	
  
Tool	
  
Data	
  Sink	
  
Web	
  Servers	
  
Applica8on	
  servers	
  
Connected	
  Devices	
  
Mobile	
  Phones	
  
Etc	
  
Scalable	
  method	
  to	
  
collect	
  and	
  aggregate	
  
Flume,	
  KaGa,	
  Kinesis,	
  
Queue	
  
Reliable	
  and	
  durable	
  
des8na8on	
  OR	
  
Des8na8ons	
  	
  
Types of Data Ingest
•  Transactional
–  Database reads/
writes
•  File
–  Click-stream logs
•  Stream
–  Click-stream logs
Database	
  
Cloud	
  
Storage	
  
Stream	
  
Storage	
  
Run your own log collector
Your	
  applica0on	
   Amazon S3
DynamoDB	
  
Any	
  other	
  data	
  
store	
  
Amazon S3
Amazon	
  EC2	
  	
  
Use a Queue
Amazon	
  Simple	
  
Queue	
  Service	
  
(SQS)	
  
Amazon S3
DynamoDB	
  
Any	
  other	
  data	
  
store	
  
Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue Service
(SQS)
Amazon Simple Storage Service (S3)
Amazon Elastic MapReduce
Use a Tool like FLUME, KAFKA, HONU etc
Flume running
on EC2
Amazon S3
Any	
  other	
  data	
  
store	
  
HDFS
v	
  
Choice of tools 
•  (+) Pros / (-) Cons
•  (+) Flexibility: Customers select the most appropriate software and underlying
infrastructure
•  (+) Control: Software and hardware can be tuned to meet specific business and
scenario needs.
•  (-) Ongoing Operational Complexity: Deploy, and manage an end-to-end system
•  (-) Infrastructure planning and maintenance: Managing a reliable, scalable
infrastructure
•  (-) Developer/ IT staff overhead: Developers, Devops and IT staff time and
energy expended
•  (-) Unsupported Software: deprecated and/ pre-version 1 open source software
•  Future – Need for to stream data for real time
Stream	
  
Storage	
  
Database	
  
Cloud	
  
Storage	
  
29
Why Stream Storage?
•  Convert multiple streams into
fewer persistent sequential
streams
•  Sequential streams are easier
to process
Amazon	
  Kinesis	
  or	
  KaGa	
  
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard	
  or	
  Par88on	
  1	
  
Shard	
  or	
  Par88on	
  2	
  
Producer	
  1	
  
Producer	
  2	
  
Producer	
  3	
  
Producer	
  N	
  
30
Amazon	
  Kinesis	
  or	
  KaGa	
  
Why Stream Storage?
• Decouple producers and
consumers
• Buffer
• Preserve client ordering
• Streaming MapReduce
• Consumer replay /
reprocess
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer	
  1	
  
Shard	
  or	
  Par88on	
  1	
  
Shard	
  or	
  Par88on	
  2	
  
Consumer	
  1	
  
Count	
  of	
  
Red	
  =	
  4	
  
Count	
  of	
  
Violet	
  =	
  4	
  
Consumer	
  2	
  
Count	
  of	
  
Blue	
  =	
  4	
  
Count	
  of	
  
Green	
  =	
  4	
  
Producer	
  2	
  
Producer	
  3	
  
Producer	
  N	
  
Introducing Amazon Kinesis
 Data	
  
Sources	
  
App.4	
  
	
  
[Machine	
  
Learning]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  
	
  
	
  
AWS	
  Endpoint	
  
App.1	
  
	
  
[Aggregate	
  &	
  
De-­‐Duplicate]	
  
	
  Data	
  
Sources	
  
Data	
  
Sources	
  
	
  Data	
  
Sources	
  
App.2	
  
	
  
[Metric	
  
Extrac0on]	
  
S3
DynamoDB	
  
Redshift
App.3	
  
[Sliding	
  
Window	
  
Analysis]	
  
	
  Data	
  
Sources	
  
Availability
Zone
Shard	
  1	
  
Shard	
  2	
  
Shard	
  N	
  
Availability
Zone
Availability
Zone
Introducing Amazon Kinesis
Managed Service for Real-Time Processing of Big Data 
EMR
Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
Putting data into Kinesis
Managed Service for Ingesting Fast Moving Data
•  Streams	
  are	
  made	
  of	
  Shards	
  
⁻  A	
  Kinesis	
  Stream	
  is	
  composed	
  of	
  mul8ple	
  Shards	
  	
  
⁻  Each	
  Shard	
  ingests	
  up	
  to	
  1MB/sec	
  of	
  data,	
  and	
  up	
  to	
  1000	
  TPS	
  
⁻  Each	
  Shard	
  emits	
  up	
  to	
  2	
  MB/sec	
  of	
  data	
  
⁻  All	
  data	
  is	
  stored	
  for	
  24	
  hours	
  
⁻  You	
  scale	
  Kinesis	
  streams	
  by	
  adding	
  or	
  removing	
  Shards	
  
•  Simple	
  PUT	
  interface	
  to	
  store	
  data	
  in	
  Kinesis	
  
⁻  Producers	
  use	
  a	
  PUT	
  call	
  to	
  store	
  data	
  in	
  a	
  Stream	
  
⁻  A	
  Par00on	
  Key	
  is	
  used	
  to	
  distribute	
  the	
  PUTs	
  across	
  Shards	
  
⁻  A	
  unique	
  Sequence	
  #	
  is	
  returned	
  to	
  the	
  Producer	
  upon	
  a	
  successful	
  
PUT	
  call	
  
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
POST / HTTP/1.1
Host: kinesis.<region>.<domain>
x-amz-Date: <Date>
Authorization: AWS4-HMAC-SHA256 Credential=<Credential>, SignedHeaders=content-type;date;host;user-
agent;x-amz-date;x-amz-target;x-amzn-requestid, Signature=<Signature>
User-Agent: <UserAgentString>
Content-Type: application/x-amz-json-1.1
Content-Length: <PayloadSizeBytes>
Connection: Keep-Alive
X-Amz-Target: Kinesis_20131202.PutRecord
{
"StreamName": "exampleStreamName",
"Data": "XzxkYXRhPl8x",
"PartitionKey": "partitionKey"
}
v	
  
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Apps
Client library for fault-tolerant, at least-once, real-time
processing 
•  Key streaming application attributes:
•  Be distributed, to handle multiple shards
•  Be fault tolerant, to handle failures in hardware or software
•  Scale up and down as the number of shards increase or decrease
•  Kinesis Client Library (KCL) helps with distributed processing:
•  Automatically starts a Kinesis Worker for each shard
•  Simplifies reading from the stream by abstracting individual shards
•  Increases / Decreases Kinesis Workers as # of shards changes
•  Checkpoints to keep track of a Worker’s location in the stream
•  Restarts Workers if they fail
•  Use the KCL with Auto Scaling Groups
•  Auto Scaling policies will restart EC2 instances if they fail
•  Automatically add EC2 instances when load increases
•  KCL will redistributes Workers to use the new EC2 instances
OR
•  Use the Get APIs for raw reads of Kinesis data streams
37
Easy	
  Administra0on	
  
	
  	
  
Managed	
  service	
  for	
  real-­‐8me	
  streaming	
  data	
  
collec8on,	
  processing	
  and	
  analysis.	
  Simply	
  
create	
  a	
  new	
  stream,	
  set	
  the	
  desired	
  level	
  of	
  
capacity,	
  and	
  let	
  the	
  service	
  handle	
  the	
  rest.	
  
	
  
	
  
	
  
Real-­‐0me	
  Performance	
  	
  
	
  	
  
Perform	
  con8nual	
  processing	
  on	
  streaming	
  
big	
  data.	
  Processing	
  latencies	
  fall	
  to	
  a	
  few	
  
seconds,	
  compared	
  with	
  the	
  minutes	
  or	
  hours	
  
associated	
  with	
  batch	
  processing.	
  	
  
	
  	
  
	
  	
  
High	
  Throughput.	
  Elas0c	
  	
  
	
  	
  
Seamlessly	
  scale	
  to	
  match	
  your	
  data	
  
throughput	
  rate	
  and	
  volume.	
  You	
  can	
  easily	
  
scale	
  up	
  to	
  gigabytes	
  per	
  second.	
  The	
  service	
  
will	
  scale	
  up	
  or	
  down	
  based	
  on	
  your	
  
opera8onal	
  or	
  business	
  needs.	
  
	
  	
  
S3,	
  EMR,	
  Storm,	
  RedshiY,	
  &	
  DynamoDB	
  
Integra0on	
  
	
  	
  
Reliably	
  collect,	
  process,	
  and	
  transform	
  all	
  of	
  
your	
  data	
  in	
  real-­‐8me	
  &	
  deliver	
  to	
  AWS	
  data	
  
stores	
  of	
  choice,	
  with	
  Connectors	
  for	
  S3,	
  
Redshi],	
  and	
  DynamoDB.	
  
	
  	
  
	
  	
  
Build	
  Real-­‐0me	
  Applica0ons	
  
	
  	
  
Client	
  libraries	
  that	
  enable	
  developers	
  to	
  
design	
  and	
  operate	
  real-­‐8me	
  streaming	
  data	
  
processing	
  applica8ons.	
  
	
  	
  
	
  	
  
	
  	
  
	
  	
  
Low	
  Cost	
  
	
  	
  
Cost-­‐efficient	
  for	
  workloads	
  of	
  any	
  scale.	
  You	
  
can	
  get	
  started	
  by	
  provisioning	
  a	
  small	
  
stream,	
  and	
  pay	
  low	
  hourly	
  rates	
  only	
  for	
  
what	
  you	
  use.	
  
	
  	
  
	
  	
  
	
  	
  
Amazon Kinesis: Key Developer Benefits
Customers using Amazon Kinesis
Mobile/	
  Social	
  Gaming	
   Digital	
  Adver0sing	
  Tech.	
  
	
  	
  
Deliver	
  con8nuous/	
  real-­‐8me	
  delivery	
  of	
  game	
  insight	
  
data	
  by	
  100’s	
  of	
  game	
  servers	
  
Generate	
  real-­‐8me	
  metrics,	
  KPIs	
  for	
  online	
  ad	
  performance	
  
for	
  adver8sers/	
  publishers	
  	
  
Custom-­‐built	
  solu8ons	
  opera8onally	
  complex	
  to	
  
manage,	
  &	
  not	
  scalable	
  
Store	
  +	
  Forward	
  fleet	
  of	
  log	
  servers,	
  and	
  Hadoop	
  based	
  
processing	
  pipeline	
  
•  Delay	
  with	
  cri8cal	
  business	
  data	
  delivery	
  
•  Developer	
  burden	
  in	
  building	
  reliable,	
  scalable	
  
pladorm	
  for	
  real-­‐8me	
  data	
  inges8on/	
  processing	
  
•  Slow-­‐down	
  of	
  real-­‐8me	
  customer	
  insights	
  
	
  
•  Lost	
  data	
  with	
  Store/	
  Forward	
  layer	
  
•  Opera8onal	
  burden	
  in	
  managing	
  reliable,	
  scalable	
  pladorm	
  
for	
  real-­‐8me	
  data	
  inges8on/	
  processing	
  
•  Batch-­‐driven	
  real-­‐8me	
  customer	
  insights	
  
Accelerate	
  8me	
  to	
  market	
  of	
  elas8c,	
  real-­‐8me	
  
applica8ons	
  –	
  while	
  minimizing	
  opera8onal	
  overhead	
  	
  
Generate	
  freshest	
  analy8cs	
  on	
  adver8ser	
  performance	
  to	
  
op8mize	
  marke8ng	
  spend,	
  and	
  increase	
  responsiveness	
  to	
  
clients	
  
Digital Ad. Tech Metering with Kinesis
	
  Con0nuous	
  Ad	
  
	
  Metrics	
  Extrac0on	
  
Incremental	
  Ad.	
  
Sta0s0cs	
  
Computa0on	
  
Metering	
  Record	
  Archive	
  
Ad	
  Analy0cs	
  Dashboard	
  
v	
  
Collection of Data
Sources	
  
Aggrega8on	
  
Tool	
  
Data	
  Sink	
  
Web	
  Servers	
  
Applica8on	
  servers	
  
Connected	
  Devices	
  
Mobile	
  Phones	
  
Etc	
  
Scalable	
  method	
  to	
  
collect	
  and	
  aggregate	
  
Flume,	
  KaGa,	
  Kinesis,	
  
Queue	
  
Reliable	
  and	
  durable	
  
des8na8on	
  OR	
  
Des8na8ons	
  	
  
Cloud	
  
Database	
  
&	
  Storage	
  
Cloud Database and Storage Tier Anti-pattern
App/Web Tier
Client Tier
RDBMS	
  
Database & Storage Tier = All-in-one?
Cloud Database and Storage Tier — Use the Right
Tool for the Job!
App/Web	
  Tier	
  
Client	
  Tier	
  
Data	
  Tier	
  Database	
  &	
  Storage	
  Tier	
  
Search	
  
Hadoop/HDFS	
  
Cache	
  
Blob	
  Store	
  
SQL	
   NoSQL	
  
App/Web	
  Tier	
  
Client	
  Tier	
  
Database	
  	
  &	
  	
  Storage	
  Tier	
  
Amazon	
  RDS	
  Amazon	
  	
  
DynamoDB	
  
Amazon	
  	
  
Elas0Cache	
  
Amazon	
  S3	
  
Amazon	
  	
  
Glacier	
  
Amazon	
  	
  
CloudSearch	
  
HDFS	
  on	
  Amazon	
  EMR	
  
Cloud Database and Storage Tier — Use the Right
Tool for the Job!
v	
  
What Database and Storage Should I
Use?
•  Data structure
•  Query complexity
•  Data characteristics: hot, warm, cold
Data Structure and Query Types vs Storage
Technology
Structured – Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
Structured – Complex Query
SQL
Amazon RDS
Search
Amazon CloudSearch
Unstructured – No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured – Custom Query
Hadoop/HDFS
Amazon Elastic MapReduce
Data	
  Structure	
  Complexity	
  
Query	
  Structure	
  Complexity	
  
What is the Temperature of Your
Data?
Amazon	
  
RDS	
  
Amazon	
  S3	
  
Request	
  Rate	
  
High	
   Low	
  
Cost/GB	
  
High	
   Low	
  
Latency	
  
Low	
   High	
  
Data	
  Volume	
  
Low	
   High	
  
Amazon	
  
Glacier	
  
Amazon	
  
CloudSearch	
  
Structure	
  
Low	
  
High	
  
Amazon	
  
DynamoDB	
  
Amazon	
  
Elas8Cache	
  
HDFS	
  
What Data Store Should I Use?
Amazon	
  
Elas0Cache	
  
Amazon	
  
DynamoDB	
  
Amazon	
  
RDS	
  
Amazon	
  
CloudSearch	
  
Amazon	
  	
  
EMR	
  (HDFS)	
  
Amazon	
  S3	
   Amazon	
  Glacier	
  
Average	
  
latency	
  
ms	
   ms	
   ms,	
  sec	
   ms,sec	
   sec,min,hrs	
   ms,sec,min	
  
(~	
  size)	
  
hrs	
  
Data	
  volume	
   GB	
   GB–TBs	
  
(no	
  limit)	
  
GB–TB	
  
(3	
  TB	
  Max)	
  
GB–TB	
   GB–PB	
  
(~nodes)	
  
GB–PB	
  
(no	
  limit)	
  
GB–PB	
  
(no	
  limit)	
  
Item	
  size	
   B-­‐KB	
   KB	
  
(64	
  KB	
  max)	
  
KB	
  
(~rowsize)	
  
KB	
  
(1	
  MB	
  max)	
  
MB-­‐GB	
   KB-­‐GB	
  
(5	
  TB	
  max)	
  
GB	
  
(40	
  TB	
  max)	
  
Request	
  rate	
   Very	
  High	
   Very	
  High	
   High	
   High	
   Low	
  –	
  Very	
  
High	
  
Low–	
  
Very	
  High	
  
(no	
  limit)	
  
Very	
  Low	
  
(no	
  limit)	
  
	
  
Storage	
  cost	
  
	
  $/GB/month	
  
$$	
   ¢¢	
   ¢¢	
   $	
   ¢	
   ¢	
   ¢	
  
Durability	
   Low	
  -­‐	
  
Moderate	
  
Very	
  High	
   High	
   High	
   High	
   Very	
  High	
   Very	
  High	
  
Hot	
  Data	
   Warm	
  Data	
   Cold	
  Data	
  
Decouple your storage and analysis engine
1.  Single Version of Truth
2.  Choice of multiple analytics Tools
3.  Parallel execution from different teams
4.  Lower cost
Learning	
  
from	
  
Nealix	
  
v	
  
S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Kinesis	
  
Choose depending upon design
v	
  
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v	
  
Process
• Answering questions about data
• Questions
•  Analytics: Think SQL/data warehouse
•  Classification: Think sentiment analysis
•  Prediction: Think page-views prediction
•  Etc
v	
  
Processing Frameworks
Generally come in two major types:
• Batch processing
• Stream processing
• Interactive query
v	
  
Batch Processing
• Take large amount of cold data and ask
questions
• Takes minutes or hours to get answers back
Example:	
  Genera-ng	
  hourly,	
  daily,	
  weekly	
  
reports	
  
Process	
  
v	
  
Stream Processing (AKA Real Time)
• Take small amount of hot data and ask
questions
• Takes short amount of time to get your
answer back
Example:	
  1min	
  metrics	
  
Processing Tools
•  Batch processing/analytic
–  Amazon Redshift
–  Amazon EMR
•  Hive/Tez, Pig, Spark,
Impala, Spark, Presto, ….
•  Stream processing
–  Apache Spark streaming
–  Apache Storm (+ Trident)
–  Amazon Kinesis client and
connector library
Amplab Big Data Benchmark
Scan	
  query	
   Aggregate	
  query	
   Join	
  query	
  
hops://amplab.cs.berkeley.edu/benchmark/	
  
v	
  
What Batch Processing Technology Should I Use?
RedshiY	
   Impala	
   Presto	
   Spark	
   Hive	
  
Query	
  
Latency	
  
Low	
   Low	
   Low	
   Low	
  -­‐	
  Medium	
   Medium	
  -­‐	
  High	
  
Durability	
   High	
   High	
   High	
   High	
   High	
  
Data	
  
Volume	
  
1.6PB	
  Max	
   ~Nodes	
   ~Nodes	
   ~Nodes	
   ~Nodes	
  
Managed	
   Yes	
   EMR	
  
bootstrap	
  
EMR	
  
bootstrap	
  
EMR	
  bootstrap	
   Yes	
  (EMR)	
  
Storage	
   Na8ve	
   HDFS	
   HDFS/S3	
   HDFS/S3	
   HDFS/S3	
  
#	
  of	
  BI	
  
Tools	
  
High	
   Medium	
   High	
   Low	
   High	
  
Query	
  Latency	
  	
  
(Low	
  is	
  beoer)	
  
v	
  
What Stream Processing Technology Should I Use?
Spark	
  Streaming	
   Apache	
  Storm	
  +	
  
Trident	
  
Kinesis	
  Client	
  Library	
  
Scale/Throughput	
   ~	
  Nodes	
   ~	
  Nodes	
   ~	
  Nodes	
  
Data	
  Volume	
   ~	
  Nodes	
   ~	
  Nodes	
   ~	
  Nodes	
  
Manageability	
   Yes	
  (EMR	
  bootstrap)	
   Do	
  it	
  yourself	
   EC2	
  +	
  Auto	
  Scaling	
  
Fault	
  Tolerance	
   Built-­‐in	
   Built-­‐in	
   KCL	
  Check	
  poin8ng	
  
Programming	
  
languages	
  
Java,	
  Python,	
  Scala	
   Java,	
  Scala,	
  
Clojure	
  
Java,	
  Python	
  
Hadoop based Analysis
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Your choice of tools on Hadoop/EMR
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Hadoop based Analysis
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Hadoop based Analysis
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Spark and Shark
Cloudera Impala
Hadoop is good for
1.  Ad Hoc Query analysis
2.  Large Unstructured Data Sets
3.  Machine Learning and Advanced Analytics
4.  Schema less
SQL based Low Latency Analytics on
structured data
SQL based processing
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
Redshift
Petabyte scale
Columnar Data -
warehouse
SQL based processing for unstructured data
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
Your choice of BI Tools on the cloud
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
v	
  
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Sharing results and visualizations
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
Sharing results and visualizations and scale
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
Sharing results and visualizations
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
Geospatial Visualizations
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
Rinse and Repeat
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
The complete architecture
Amazon	
  SQS	
  
Amazon S3
DynamoDB	
  
Any	
  SQL	
  or	
  NO	
  SQL	
  
Store	
  
Log	
  Aggrega0on	
  	
  
tools	
  
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
Reference	
  :	
  BDT403	
  Next	
  Genera8on	
  Big	
  Data	
  Pladorm	
  @	
  Nedlix	
  
Architecture
v	
  
Big Data
• 10+ PB DW on S3
• 1.2 PB read daily
• 100 TB written daily
• ~ 200 billion events daily
Cloud	
  
apps	
  
Suro	
   Ursula	
  
Cassandra	
  
Aegisthus	
  
Dimension	
  data	
  
Event	
  Data	
  
15	
  min	
  
Daily	
  
Amazon	
  S3	
  
SS	
  tables	
  
Data Pipelines
Amazon	
  S3	
  
Storage	
   Compute	
   Service	
   Tools	
  
@2013
Amazon	
  S3	
  
v2.0
Storage	
   Compute	
   Service	
   Tools	
  
@2014
온라인 자습 및 실습
다양한 온라인 강의 자
료 및 실습을 통해 AWS
에 대한 기초적인 사용
법 및 활용 방법을 익히
실 수 있습니다.
강의식 교육
AWS 전문 강사가 진행하는 강의를
통해 AWS 클라우드로 고가용성,
비용 효율성을 갖춘 안전한 애플리
케이션을 만드는 방법을 알아보세
요. 아키텍쳐 설계 및 구현에 대한
다양한 오프라인 강의가 개설되어
있습니다.
인증 시험을 통해 클라우
드에 대한 자신의 전문 지
식 및 경험을 공인받고 개
발 경력을 제시할 수 있습
니다.
AWS 공인 자격증
http://aws.amazon.com/ko/training
다양한 교육 프로그램
AWS 기초 웨비나 시리즈에 참여해 주셔서 감사합니다!
이번 웨비나가 여러분의 궁금증 해소에 도움이 되었길 바랍니다.
이후 이어질 설문 조사를 통해 오늘 웨비나에 대한 의견을 알려주세요.
aws-korea-marketing@amazon.com
http://twitter.com/AWSKorea
http://facebook.com/AmazonWebServices.ko
http://youtube.com/user/AWSKorea
http://slideshare.net/AWSKorea

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015

  • 1.
    심화 웨비나 시리즈| 8 번째 강연 2015년 7월 9일 목요일 | 오후 2시 http://aws.amazon.com/ko AWS를 활용한 첫 빅데이터 프로젝트 시작하기  
  • 2.
  • 3.
    이번 웨비나 에서들으실 내용.. 이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도 구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구 축하는 방법에 대해 소개합니다.
  • 4.
    v   Agenda •  AWSBig data building blocks •  AWS Big data platform •  Log data collection & storage •  Introducing Amazon Kinesis •  Data Analytics & Computation •  Collaboration & sharing •  Netflix Use-case
  • 5.
    AWS Big databuilding blocks (brief)
  • 6.
    Use the righttools Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon Redshift Amazon Elastic MapReduce
  • 7.
  • 8.
    Real-time processing High throughput;elastic Easy to use EMR, S3, Redshift, DynamoDB Integrations Amazon Kinesis
  • 9.
    NoSQL Database Seamless scalability Zeroadmin Single digit millisecond latency Amazon DynamoDB
  • 10.
    Relational data warehouse Massivelyparallel Petabyte scale Fully managed $1,000/TB/Year Amazon Redshift
  • 11.
    Hadoop/HDFS clusters Hive, Pig,Impala, Hbase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  • 12.
    HDFS Amazon RedShift Amazon RDS Amazon S3 Amazon DynamoDB AmazonEMR Amazon Kinesis AWS  Data  Pipeline   Data  management   Hadoop  Ecosystem  analy8cal  tools   Data   Sources   AWS Data Pipeline
  • 13.
    v   Generation Collection &storage Analytics & computation Collaboration & sharing
  • 14.
    v   a   Amazon DynamoDB Amazon RDS Amazon Redshift AWS DirectConnect AWS Storage Gateway AWS Import/ Export Amazon Glacier S3 Amazon Kinesis Amazon EMR Generation Collection & storage Analytics & computation Collaboration & sharing
  • 15.
    v   Amazon  EC2   Amazon  EMR   Amazon Kinesis Generation Collection & storage Analytics & computation Collaboration & sharing
  • 16.
    v   Amazon Redshift Amazon DynamoDB Amazon     RDS   S3 Amazon  EC2   Amazon  EMR   Amazon   CloudFront   AWS   CloudForma8on   AWS    Data  Pipeline   Generation Collection & storage Analytics & computation Collaboration & sharing
  • 17.
    The right tools. Atthe right scale. At the right time.
  • 18.
    AWS Big dataplatform
  • 19.
    v   Generation Collection &storage Analytics & computation Collaboration & sharing
  • 20.
    v   Generation Collection &storage Analytics & computation Collaboration & sharing
  • 21.
    v   Collection ofData Sources   Aggrega8on   Tool   Data  Sink   Web  Servers   Applica8on  servers   Connected  Devices   Mobile  Phones   Etc   Scalable  method  to   collect  and  aggregate   Flume,  KaGa,  Kinesis,   Queue   Reliable  and  durable   des8na8on  OR   Des8na8ons    
  • 22.
    Types of DataIngest •  Transactional –  Database reads/ writes •  File –  Click-stream logs •  Stream –  Click-stream logs Database   Cloud   Storage   Stream   Storage  
  • 23.
    Run your ownlog collector Your  applica0on   Amazon S3 DynamoDB   Any  other  data   store   Amazon S3 Amazon  EC2    
  • 24.
    Use a Queue Amazon  Simple   Queue  Service   (SQS)   Amazon S3 DynamoDB   Any  other  data   store  
  • 25.
    Agency Customer: VideoAnalytics on AWS Elastic Load Balancer Edge Servers on EC2 Workers on EC2 Logs Reports HDFS Cluster Amazon Simple Queue Service (SQS) Amazon Simple Storage Service (S3) Amazon Elastic MapReduce
  • 26.
    Use a Toollike FLUME, KAFKA, HONU etc Flume running on EC2 Amazon S3 Any  other  data   store   HDFS
  • 27.
    v   Choice oftools •  (+) Pros / (-) Cons •  (+) Flexibility: Customers select the most appropriate software and underlying infrastructure •  (+) Control: Software and hardware can be tuned to meet specific business and scenario needs. •  (-) Ongoing Operational Complexity: Deploy, and manage an end-to-end system •  (-) Infrastructure planning and maintenance: Managing a reliable, scalable infrastructure •  (-) Developer/ IT staff overhead: Developers, Devops and IT staff time and energy expended •  (-) Unsupported Software: deprecated and/ pre-version 1 open source software •  Future – Need for to stream data for real time
  • 28.
    Stream   Storage   Database   Cloud   Storage  
  • 29.
    29 Why Stream Storage? • Convert multiple streams into fewer persistent sequential streams •  Sequential streams are easier to process Amazon  Kinesis  or  KaGa   4 4 3 3 2 2 1 14 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Shard  or  Par88on  1   Shard  or  Par88on  2   Producer  1   Producer  2   Producer  3   Producer  N  
  • 30.
    30 Amazon  Kinesis  or  KaGa   Why Stream Storage? • Decouple producers and consumers • Buffer • Preserve client ordering • Streaming MapReduce • Consumer replay / reprocess 4 4 3 3 2 2 1 14 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer  1   Shard  or  Par88on  1   Shard  or  Par88on  2   Consumer  1   Count  of   Red  =  4   Count  of   Violet  =  4   Consumer  2   Count  of   Blue  =  4   Count  of   Green  =  4   Producer  2   Producer  3   Producer  N  
  • 31.
  • 32.
     Data   Sources   App.4     [Machine   Learning]                                       AWS  Endpoint   App.1     [Aggregate  &   De-­‐Duplicate]    Data   Sources   Data   Sources    Data   Sources   App.2     [Metric   Extrac0on]   S3 DynamoDB   Redshift App.3   [Sliding   Window   Analysis]    Data   Sources   Availability Zone Shard  1   Shard  2   Shard  N   Availability Zone Availability Zone Introducing Amazon Kinesis Managed Service for Real-Time Processing of Big Data EMR
  • 33.
    Kinesis Architecture Amazon WebServices AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  • 34.
    Putting data intoKinesis Managed Service for Ingesting Fast Moving Data •  Streams  are  made  of  Shards   ⁻  A  Kinesis  Stream  is  composed  of  mul8ple  Shards     ⁻  Each  Shard  ingests  up  to  1MB/sec  of  data,  and  up  to  1000  TPS   ⁻  Each  Shard  emits  up  to  2  MB/sec  of  data   ⁻  All  data  is  stored  for  24  hours   ⁻  You  scale  Kinesis  streams  by  adding  or  removing  Shards   •  Simple  PUT  interface  to  store  data  in  Kinesis   ⁻  Producers  use  a  PUT  call  to  store  data  in  a  Stream   ⁻  A  Par00on  Key  is  used  to  distribute  the  PUTs  across  Shards   ⁻  A  unique  Sequence  #  is  returned  to  the  Producer  upon  a  successful   PUT  call   Producer Shard 1 Shard 2 Shard 3 Shard n Shard 4 Producer Producer Producer Producer Producer Producer Producer Producer Kinesis
  • 35.
    POST / HTTP/1.1 Host:kinesis.<region>.<domain> x-amz-Date: <Date> Authorization: AWS4-HMAC-SHA256 Credential=<Credential>, SignedHeaders=content-type;date;host;user- agent;x-amz-date;x-amz-target;x-amzn-requestid, Signature=<Signature> User-Agent: <UserAgentString> Content-Type: application/x-amz-json-1.1 Content-Length: <PayloadSizeBytes> Connection: Keep-Alive X-Amz-Target: Kinesis_20131202.PutRecord { "StreamName": "exampleStreamName", "Data": "XzxkYXRhPl8x", "PartitionKey": "partitionKey" }
  • 36.
    v   Shard 1 Shard2 Shard 3 Shard n Shard 4 KCL Worker 1 KCL Worker 2 EC2 Instance KCL Worker 3 KCL Worker 4 EC2 Instance KCL Worker n EC2 Instance Kinesis Building Kinesis Apps Client library for fault-tolerant, at least-once, real-time processing •  Key streaming application attributes: •  Be distributed, to handle multiple shards •  Be fault tolerant, to handle failures in hardware or software •  Scale up and down as the number of shards increase or decrease •  Kinesis Client Library (KCL) helps with distributed processing: •  Automatically starts a Kinesis Worker for each shard •  Simplifies reading from the stream by abstracting individual shards •  Increases / Decreases Kinesis Workers as # of shards changes •  Checkpoints to keep track of a Worker’s location in the stream •  Restarts Workers if they fail •  Use the KCL with Auto Scaling Groups •  Auto Scaling policies will restart EC2 instances if they fail •  Automatically add EC2 instances when load increases •  KCL will redistributes Workers to use the new EC2 instances OR •  Use the Get APIs for raw reads of Kinesis data streams
  • 37.
    37 Easy  Administra0on       Managed  service  for  real-­‐8me  streaming  data   collec8on,  processing  and  analysis.  Simply   create  a  new  stream,  set  the  desired  level  of   capacity,  and  let  the  service  handle  the  rest.         Real-­‐0me  Performance         Perform  con8nual  processing  on  streaming   big  data.  Processing  latencies  fall  to  a  few   seconds,  compared  with  the  minutes  or  hours   associated  with  batch  processing.             High  Throughput.  Elas0c         Seamlessly  scale  to  match  your  data   throughput  rate  and  volume.  You  can  easily   scale  up  to  gigabytes  per  second.  The  service   will  scale  up  or  down  based  on  your   opera8onal  or  business  needs.       S3,  EMR,  Storm,  RedshiY,  &  DynamoDB   Integra0on       Reliably  collect,  process,  and  transform  all  of   your  data  in  real-­‐8me  &  deliver  to  AWS  data   stores  of  choice,  with  Connectors  for  S3,   Redshi],  and  DynamoDB.           Build  Real-­‐0me  Applica0ons       Client  libraries  that  enable  developers  to   design  and  operate  real-­‐8me  streaming  data   processing  applica8ons.                   Low  Cost       Cost-­‐efficient  for  workloads  of  any  scale.  You   can  get  started  by  provisioning  a  small   stream,  and  pay  low  hourly  rates  only  for   what  you  use.               Amazon Kinesis: Key Developer Benefits
  • 38.
    Customers using AmazonKinesis Mobile/  Social  Gaming   Digital  Adver0sing  Tech.       Deliver  con8nuous/  real-­‐8me  delivery  of  game  insight   data  by  100’s  of  game  servers   Generate  real-­‐8me  metrics,  KPIs  for  online  ad  performance   for  adver8sers/  publishers     Custom-­‐built  solu8ons  opera8onally  complex  to   manage,  &  not  scalable   Store  +  Forward  fleet  of  log  servers,  and  Hadoop  based   processing  pipeline   •  Delay  with  cri8cal  business  data  delivery   •  Developer  burden  in  building  reliable,  scalable   pladorm  for  real-­‐8me  data  inges8on/  processing   •  Slow-­‐down  of  real-­‐8me  customer  insights     •  Lost  data  with  Store/  Forward  layer   •  Opera8onal  burden  in  managing  reliable,  scalable  pladorm   for  real-­‐8me  data  inges8on/  processing   •  Batch-­‐driven  real-­‐8me  customer  insights   Accelerate  8me  to  market  of  elas8c,  real-­‐8me   applica8ons  –  while  minimizing  opera8onal  overhead     Generate  freshest  analy8cs  on  adver8ser  performance  to   op8mize  marke8ng  spend,  and  increase  responsiveness  to   clients  
  • 39.
    Digital Ad. TechMetering with Kinesis  Con0nuous  Ad    Metrics  Extrac0on   Incremental  Ad.   Sta0s0cs   Computa0on   Metering  Record  Archive   Ad  Analy0cs  Dashboard  
  • 40.
    v   Collection ofData Sources   Aggrega8on   Tool   Data  Sink   Web  Servers   Applica8on  servers   Connected  Devices   Mobile  Phones   Etc   Scalable  method  to   collect  and  aggregate   Flume,  KaGa,  Kinesis,   Queue   Reliable  and  durable   des8na8on  OR   Des8na8ons    
  • 41.
  • 42.
    Cloud Database andStorage Tier Anti-pattern App/Web Tier Client Tier RDBMS   Database & Storage Tier = All-in-one?
  • 43.
    Cloud Database andStorage Tier — Use the Right Tool for the Job! App/Web  Tier   Client  Tier   Data  Tier  Database  &  Storage  Tier   Search   Hadoop/HDFS   Cache   Blob  Store   SQL   NoSQL  
  • 44.
    App/Web  Tier   Client  Tier   Database    &    Storage  Tier   Amazon  RDS  Amazon     DynamoDB   Amazon     Elas0Cache   Amazon  S3   Amazon     Glacier   Amazon     CloudSearch   HDFS  on  Amazon  EMR   Cloud Database and Storage Tier — Use the Right Tool for the Job!
  • 45.
    v   What Databaseand Storage Should I Use? •  Data structure •  Query complexity •  Data characteristics: hot, warm, cold
  • 46.
    Data Structure andQuery Types vs Storage Technology Structured – Simple Query NoSQL Amazon DynamoDB Cache Amazon ElastiCache Structured – Complex Query SQL Amazon RDS Search Amazon CloudSearch Unstructured – No Query Cloud Storage Amazon S3 Amazon Glacier Unstructured – Custom Query Hadoop/HDFS Amazon Elastic MapReduce Data  Structure  Complexity   Query  Structure  Complexity  
  • 47.
    What is theTemperature of Your Data?
  • 48.
    Amazon   RDS   Amazon  S3   Request  Rate   High   Low   Cost/GB   High   Low   Latency   Low   High   Data  Volume   Low   High   Amazon   Glacier   Amazon   CloudSearch   Structure   Low   High   Amazon   DynamoDB   Amazon   Elas8Cache   HDFS  
  • 49.
    What Data StoreShould I Use? Amazon   Elas0Cache   Amazon   DynamoDB   Amazon   RDS   Amazon   CloudSearch   Amazon     EMR  (HDFS)   Amazon  S3   Amazon  Glacier   Average   latency   ms   ms   ms,  sec   ms,sec   sec,min,hrs   ms,sec,min   (~  size)   hrs   Data  volume   GB   GB–TBs   (no  limit)   GB–TB   (3  TB  Max)   GB–TB   GB–PB   (~nodes)   GB–PB   (no  limit)   GB–PB   (no  limit)   Item  size   B-­‐KB   KB   (64  KB  max)   KB   (~rowsize)   KB   (1  MB  max)   MB-­‐GB   KB-­‐GB   (5  TB  max)   GB   (40  TB  max)   Request  rate   Very  High   Very  High   High   High   Low  –  Very   High   Low–   Very  High   (no  limit)   Very  Low   (no  limit)     Storage  cost    $/GB/month   $$   ¢¢   ¢¢   $   ¢   ¢   ¢   Durability   Low  -­‐   Moderate   Very  High   High   High   High   Very  High   Very  High   Hot  Data   Warm  Data   Cold  Data  
  • 50.
    Decouple your storageand analysis engine 1.  Single Version of Truth 2.  Choice of multiple analytics Tools 3.  Parallel execution from different teams 4.  Lower cost Learning   from   Nealix  
  • 51.
    v   S3 asa “single source of truth” Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html S3
  • 52.
    Amazon  SQS   AmazonS3 DynamoDB   Any  SQL  or  NO  SQL   Store   Kinesis   Choose depending upon design
  • 53.
    v   Generation Collection &storage Analytics & computation Collaboration & sharing
  • 54.
    v   Process • Answering questionsabout data • Questions •  Analytics: Think SQL/data warehouse •  Classification: Think sentiment analysis •  Prediction: Think page-views prediction •  Etc
  • 55.
    v   Processing Frameworks Generallycome in two major types: • Batch processing • Stream processing • Interactive query
  • 56.
    v   Batch Processing • Takelarge amount of cold data and ask questions • Takes minutes or hours to get answers back Example:  Genera-ng  hourly,  daily,  weekly   reports  
  • 57.
  • 58.
    v   Stream Processing(AKA Real Time) • Take small amount of hot data and ask questions • Takes short amount of time to get your answer back Example:  1min  metrics  
  • 59.
    Processing Tools •  Batchprocessing/analytic –  Amazon Redshift –  Amazon EMR •  Hive/Tez, Pig, Spark, Impala, Spark, Presto, …. •  Stream processing –  Apache Spark streaming –  Apache Storm (+ Trident) –  Amazon Kinesis client and connector library
  • 60.
    Amplab Big DataBenchmark Scan  query   Aggregate  query   Join  query   hops://amplab.cs.berkeley.edu/benchmark/  
  • 61.
    v   What BatchProcessing Technology Should I Use? RedshiY   Impala   Presto   Spark   Hive   Query   Latency   Low   Low   Low   Low  -­‐  Medium   Medium  -­‐  High   Durability   High   High   High   High   High   Data   Volume   1.6PB  Max   ~Nodes   ~Nodes   ~Nodes   ~Nodes   Managed   Yes   EMR   bootstrap   EMR   bootstrap   EMR  bootstrap   Yes  (EMR)   Storage   Na8ve   HDFS   HDFS/S3   HDFS/S3   HDFS/S3   #  of  BI   Tools   High   Medium   High   Low   High   Query  Latency     (Low  is  beoer)  
  • 62.
    v   What StreamProcessing Technology Should I Use? Spark  Streaming   Apache  Storm  +   Trident   Kinesis  Client  Library   Scale/Throughput   ~  Nodes   ~  Nodes   ~  Nodes   Data  Volume   ~  Nodes   ~  Nodes   ~  Nodes   Manageability   Yes  (EMR  bootstrap)   Do  it  yourself   EC2  +  Auto  Scaling   Fault  Tolerance   Built-­‐in   Built-­‐in   KCL  Check  poin8ng   Programming   languages   Java,  Python,  Scala   Java,  Scala,   Clojure   Java,  Python  
  • 63.
    Hadoop based Analysis Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR
  • 64.
    Your choice oftools on Hadoop/EMR Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR
  • 65.
    Hadoop based Analysis Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR
  • 66.
    Hadoop based Analysis Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Spark and Shark Cloudera Impala
  • 67.
    Hadoop is goodfor 1.  Ad Hoc Query analysis 2.  Large Unstructured Data Sets 3.  Machine Learning and Advanced Analytics 4.  Schema less
  • 68.
    SQL based LowLatency Analytics on structured data
  • 69.
    SQL based processing Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon Redshift Petabyte scale Columnar Data - warehouse
  • 70.
    SQL based processingfor unstructured data Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Pre-processing framework Petabyte scale Columnar Data - warehouse
  • 71.
    Your choice ofBI Tools on the cloud Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Pre-processing framework
  • 72.
    v   Generation Collection &storage Analytics & computation Collaboration & sharing
  • 73.
    Collaboration and Sharinginsights Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift
  • 74.
    Sharing results andvisualizations Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Web App Server Visualization tools
  • 75.
    Sharing results andvisualizations and scale Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Web App Server Visualization tools
  • 76.
    Sharing results andvisualizations Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools
  • 77.
    Geospatial Visualizations Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Visualization tools
  • 78.
    Rinse and Repeat Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  • 79.
    The complete architecture Amazon  SQS   Amazon S3 DynamoDB   Any  SQL  or  NO  SQL   Store   Log  Aggrega0on     tools   Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  • 81.
    Reference  :  BDT403  Next  Genera8on  Big  Data  Pladorm  @  Nedlix  
  • 83.
  • 84.
    v   Big Data • 10+PB DW on S3 • 1.2 PB read daily • 100 TB written daily • ~ 200 billion events daily
  • 85.
    Cloud   apps   Suro   Ursula   Cassandra   Aegisthus   Dimension  data   Event  Data   15  min   Daily   Amazon  S3   SS  tables   Data Pipelines
  • 86.
    Amazon  S3   Storage   Compute   Service   Tools   @2013
  • 87.
    Amazon  S3   v2.0 Storage   Compute   Service   Tools   @2014
  • 88.
    온라인 자습 및실습 다양한 온라인 강의 자 료 및 실습을 통해 AWS 에 대한 기초적인 사용 법 및 활용 방법을 익히 실 수 있습니다. 강의식 교육 AWS 전문 강사가 진행하는 강의를 통해 AWS 클라우드로 고가용성, 비용 효율성을 갖춘 안전한 애플리 케이션을 만드는 방법을 알아보세 요. 아키텍쳐 설계 및 구현에 대한 다양한 오프라인 강의가 개설되어 있습니다. 인증 시험을 통해 클라우 드에 대한 자신의 전문 지 식 및 경험을 공인받고 개 발 경력을 제시할 수 있습 니다. AWS 공인 자격증 http://aws.amazon.com/ko/training 다양한 교육 프로그램
  • 89.
    AWS 기초 웨비나시리즈에 참여해 주셔서 감사합니다! 이번 웨비나가 여러분의 궁금증 해소에 도움이 되었길 바랍니다. 이후 이어질 설문 조사를 통해 오늘 웨비나에 대한 의견을 알려주세요. aws-korea-marketing@amazon.com http://twitter.com/AWSKorea http://facebook.com/AmazonWebServices.ko http://youtube.com/user/AWSKorea http://slideshare.net/AWSKorea