AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

December 10, 2014 | Korea
김 일호, Solutions Architect

BDT201 - Big Data and HPC State of the Union BDT202 - HPC Now Means 'High Personal Computing' BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data BDT205 - Your First Big Data Application on AWS BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise BDT207 - Use Streaming Analytics to Exploit Perishable Insights BDT208 - Finding High Performance in the Cloud for HPC BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis BDT307 - Running NoSQL on Amazon EC2 BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track BDT310 - Big Data Architectural Patterns and Best Practices on AWS BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads BDT312 - Using the Cloud to Scale from a Database to a Data Platform BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm BDT403 - Netflix's Next Generation Big Data Platform

EMR Redshift EC2
Process & Analyze
Store
AWS Direct Connect
S3
Amazon Kinesis
Glacier
AWS Import/Export
DynamoDB
Collect
Automate
AWS Data Pipeline

Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools
Amazon data pipeline

Log4J
EMR-Kinesis Connector
Hive with
Amazon S3
Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state

Launch a 3-instance Hadoop 2.4 cluster with Hive installed: m3.xlarge YOUR-AWS-REGIONYOUR-AWS-SSH-KEY

Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream
--stream-name AccessLogStream
--shard-count 2

YOUR-IAM-ACCESS-KEY YOUR-IAM-SECRET-KEY

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME
Start Hive:
hive

YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY; YOUR-AWS-REGION
hive>
hive>
hive>
hive>
hive>
hive>

hive>
STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

-- return the first row in the stream
hive>
-- return count all items in the Stream
hive>
-- return count of all rows with given host hive>

http://127.0.0.1:19026/cluster
http://127.0.0.1:19101

hive> YOUR-S3-BUCKET/emroutput

-- set up Hive's "dynamic partioning"
-- splits output files when writing to Amazon S3
hive>
hive>

-- compress output files on Amazon S3 using Gzip
hive>
hive>
hive>
hive>

-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
hive>

Log4J
Hive with
Amazon S3

# using the PostgreSQL CLI YOUR-REDSHIFT-ENDPOINT
Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Redshift support
•Aginity Workbench for Amazon Redshift
•SQL Workbench/J

YOUR-S3-BUCKETYOUR-IAM-ACCESS_KEYYOUR-IAM-SECRET-KEY

-- show all requests from a given IP address
-- count all requests on a given day
-- show all requests referred from other sites

Log4J
Hive with
Amazon S3
Amazon Redshift
parallel COPY from
Amazon S3

hive>
hive>
hive>
hive>
hive>

-- Create an external table on Amazon S3
-- to hold query results.
-- Partition (split files on Amazon S3) by iteration
hive> YOUR-S3-BUCKET

-- set up a first iteration
-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

-- set up a second iteration over the data in the Kinesis Stream
-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

YOUR-S3-BUCKETYOUR-PREFIX.gz . YOUR-PREFIX.gz

DataXu Records
tx_id: "AFTfN0uAWZ"
exchange: “APPNEXUS"
request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9”
….
adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1- 839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”, campaign_id: "C0513n7”, creative_id: “R53a537”}
…
time_stamp: 1415393474434
serviced_by_host: "cr02.us-east-01”
Confirmation Record
[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET /rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=1415502000191662 HTTP/1.1" 302 - "http://ads- by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e- 1831-4eba-b78d-cd99188e951a" "OWW=-"
Fraud Record

Continuous
Processing
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
KCL Apps Apps
Archiver
Amazon Kinesis Event Replay
Amazon S3
Producers Aggregator
Storage Analytics
Redshift

Client/Sensor Aggregator Continuous
Processing
Storage Analytics +
Reporting

https://github.com/awslabs/kinesis-log4j-appender

Amazon Kinesis storage is replicated across
Availability Zones
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts

0
200000
400000
600000
800000
1000000
1200000
0
100
200
300
400
500
600
700
800
900
1000
1100
1KB Messages/sec
Shards
TCO for average 1M events/second:
with 50:1 packing and 10:1 compression: $6351/month
raw: $28610/month

Amazon Kinesis
23 21 18 17 14
Shard-i
10 8 5 3 2
Shard
ID
Lock Seq
num
Shard-i
Host A
Host B
Shard ID Last Archived
Shard-i
0
10
X18 2
3
5
8
10
14
17
18
21
23
0
130
Host BA
{Event 10, …}
1203
14
17
18
21
23

CDN
Real Time
Bidding
Retargetin
g
Platform
Reporting
Qubole
Real Time
KCL Apps Apps
Archiver
Kinesis Event Replay
S3

Continuous
Processing
Storage Analytics
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
KCL Apps Apps
Archiver
Amazon S3
Amazon
Redshift

Continuous
Processing
Storage Analytics
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
KCL Apps Apps
Archiver
Amazon S3
Redshift

•Unordered processing
–Randomize partition key to distribute events over many shards and use multiple workers
•Exact order processing
–Control the partition key to ensure events are grouped onto the same shard and read by the same worker.
•Need both? Get global sequence number
Producer
Get Global Sequence
Unordered Stream
Campaign Centric Stream
Fraud Inspection Stream
Get Event Metadata
Id
event
Stream – partition key
1
confirmation
Campaign-centric stream - UUID
2
fraud
Unordered Stream Fraud-inspection stream – sessionid

HTTP
Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Apache
Storm
Amazon
Elastic
MapReduce
Sending Reading
Amazon EMR
Playback
Amazon S3
Archiver

General Purpose: M1, M3 (, T2)
Compute Optimized: C1, CC2, C3, C4
Memory Optimized: M2, CR1, R3
Storage Optimized: HI1, HS1, I2
GPU: CG1, G2
Micro: T1, T2

2006
2007
2008
2009
2011
2012-2013
December, 2014
m1.small
m1.xlarge
m1.large
m1.small
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
cc2.8xlarge
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
c3.large
c3.xlarge
c3.2xlarge
c3.4xlarge
c3.8xlarge
cr1.8xlarge hs1.8xlarge
m3.xlarge
m3.2xlarge
hi1.4xlarge
m1.medium
cc2.8xlarge
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
c3.large
c3.xlarge
c3.2xlarge
c3.4xlarge
c3.8xlarge
hs1.8xlarge
m3.xlarge
m3.2xlarge
hi1.4xlarge
m1.medium
cc2.8xlarge
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
new
existing
g2.2xlarge
m3.medium
m3.large
i2.large
i2.xlarge
i2.4xlarge
i2.8xlarge
r3.large
r3.xlarge
r3.2xlarge
r3.4xlarge
r3.8xlarge
t2.micro
t2.small
t2.medium
c4.large
c4.xlarge
c4.2xlarge
c4.4xlarge
c4.8xlarge
2010
introducing now

78 The next generation of Amazon EC2 Compute-optimized instances
•Based on Intel Xeon E5-2666 v3 (Haswell) processors
•2.9 GHz – peaking at 3.5 GHz with Turbo Boost Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads. EBS-optimized by default… and at no additional cost!
Instance Name
vCPU Count
RAM
Network Performance
c4.large
2
3.75 GiB
Moderate
c4.xlarge
4
7.5 GiB
Moderate
c4.2xlarge
8
15 GiB
High
c4.4xlarge
16
30 GiB
High
c4.8xlarge
36
60 GiB
10 Gbps
Preliminary specifications. May change prior to release

79 Increases to the performance and capacity of General Purpose (SSD) and Provisioned IOPS (SSD) volumes.
EBS Name
Capacity
IOPS
Throughput
Amazon EBS General Purpose (SSD)
16 TB
(up from 1TB)
10000 IOPS
(up from 3000 IOPS)
160 MBps *
Amazon EBS Provisioned IOPS (SSD)
16 TB
(up from 1TB)
20000 IOPS
(up from 4000 IOPS)
320 MBps *
* When attached to EBS Optimized instances

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Similar to AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호 (20)

More from Amazon Web Services Korea

More from Amazon Web Services Korea (20)

Recently uploaded

Recently uploaded (20)

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호