SlideShare a Scribd company logo
December 10, 2014 | Korea 
김 일호, Solutions Architect
BDT201 - Big Data and HPC State of the Union BDT202 - HPC Now Means 'High Personal Computing' BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data BDT205 - Your First Big Data Application on AWS BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise BDT207 - Use Streaming Analytics to Exploit Perishable Insights BDT208 - Finding High Performance in the Cloud for HPC BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis BDT307 - Running NoSQL on Amazon EC2 BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track BDT310 - Big Data Architectural Patterns and Best Practices on AWS BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads BDT312 - Using the Cloud to Scale from a Database to a Data Platform BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm BDT403 - Netflix's Next Generation Big Data Platform
EMR Redshift EC2 
Process & Analyze 
Store 
AWS Direct Connect 
S3 
Amazon Kinesis 
Glacier 
AWS Import/Export 
DynamoDB 
Collect 
Automate 
AWS Data Pipeline
Amazon SQS 
Amazon S3 
DynamoDB 
Any SQL or NO SQL 
Store 
Log Aggregation 
tools 
Amazon 
EMR 
Amazon 
Redshift 
Visualization tools 
Business 
Intelligence Tools 
Business 
Intelligence Tools 
GIS tools 
Amazon data pipeline
Log4J 
EMR-Kinesis Connector 
Hive with 
Amazon S3 
Amazon Redshift 
parallel COPY from 
Amazon S3 
Amazon Kinesis 
processing state
Launch a 3-instance Hadoop 2.4 cluster with Hive installed: m3.xlarge YOUR-AWS-REGIONYOUR-AWS-SSH-KEY
YOUR-BUCKET-NAME
Create an Amazon Kinesis stream to hold incoming data: 
aws kinesis create-stream  
--stream-name AccessLogStream  
--shard-count 2
CHOOSE-A-REDSHIFT-PASSWORD
YOUR-IAM-ACCESS-KEY YOUR-IAM-SECRET-KEY
Log4J
YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME 
Start Hive: 
hive
YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY; YOUR-AWS-REGION 
hive> 
hive> 
hive> 
hive> 
hive> 
hive>
hive> 
STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler' 
TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");
-- return the first row in the stream 
hive> 
-- return count all items in the Stream 
hive> 
-- return count of all rows with given host hive>
Log4J 
EMR-Kinesis Connector
http://127.0.0.1:19026/cluster 
http://127.0.0.1:19101
hive> YOUR-S3-BUCKET/emroutput
-- set up Hive's "dynamic partioning" 
-- splits output files when writing to Amazon S3 
hive> 
hive>
-- compress output files on Amazon S3 using Gzip 
hive> 
hive> 
hive> 
hive>
-- convert the Apache log timestamp to a UNIX timestamp 
-- split files in Amazon S3 by the hour in the log lines 
hive>
Log4J 
EMR-Kinesis Connector 
Hive with 
Amazon S3
YOUR-S3-BUCKETYOUR-S3-BUCKET
# using the PostgreSQL CLI YOUR-REDSHIFT-ENDPOINT 
Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Redshift support 
•Aginity Workbench for Amazon Redshift 
•SQL Workbench/J
YOUR-S3-BUCKETYOUR-IAM-ACCESS_KEYYOUR-IAM-SECRET-KEY
-- show all requests from a given IP address 
-- count all requests on a given day 
-- show all requests referred from other sites
Log4J 
EMR-Kinesis Connector 
Hive with 
Amazon S3 
Amazon Redshift 
parallel COPY from 
Amazon S3
Bonus:
hive> 
hive> 
hive> 
hive> 
hive>
-- Create an external table on Amazon S3 
-- to hold query results. 
-- Partition (split files on Amazon S3) by iteration 
hive> YOUR-S3-BUCKET
-- set up a first iteration 
-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0
-- set up a second iteration over the data in the Kinesis Stream 
-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data
Log4J 
EMR-Kinesis Connector 
Hive with 
Amazon S3 
Amazon Redshift 
parallel COPY from 
Amazon S3 
Amazon Kinesis 
processing state
YOUR-S3-BUCKETYOUR-S3-BUCKET
YOUR-S3-BUCKETYOUR-PREFIX.gz . YOUR-PREFIX.gz
DataXu
DataXu Records 
tx_id: "AFTfN0uAWZ" 
exchange: “APPNEXUS" 
request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9” 
…. 
adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1- 839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”, campaign_id: "C0513n7”, creative_id: “R53a537”} 
… 
time_stamp: 1415393474434 
serviced_by_host: "cr02.us-east-01” 
Confirmation Record 
[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET /rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=1415502000191662 HTTP/1.1" 302 - "http://ads- by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e- 1831-4eba-b78d-cd99188e951a" "OWW=-" 
Fraud Record
Continuous 
Processing 
CDN 
Real-time 
Bidding 
Retargeting 
Platform 
Reporting 
Qubole 
Real Time 
KCL Apps Apps 
Archiver 
Amazon Kinesis Event Replay 
Amazon S3 
Producers Aggregator 
Storage Analytics 
Redshift
Client/Sensor Aggregator Continuous 
Processing 
Storage Analytics + 
Reporting
https://github.com/awslabs/kinesis-log4j-appender
Client/Sensor Aggregator Continuous 
Processing 
Storage Analytics + 
Reporting
Amazon Kinesis storage is replicated across 
Availability Zones 
Amazon Web Services 
AZ AZ AZ 
Durable, highly consistent storage replicates data 
across three data centers (availability zones) 
Aggregate and 
archive to S3 
Millions of 
sources producing 
100s of terabytes 
per hour 
Front 
End 
Authentication 
Authorization 
Ordered stream 
of events supports 
multiple readers 
Real-time 
dashboards 
and alarms 
Machine learning 
algorithms or 
sliding window 
analytics 
Aggregate analysis 
in Hadoop or a 
data warehouse 
Inexpensive: $0.028 per million puts
0 
200000 
400000 
600000 
800000 
1000000 
1200000 
0 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1100 
1KB Messages/sec 
Shards 
TCO for average 1M events/second: 
with 50:1 packing and 10:1 compression: $6351/month 
raw: $28610/month
Client/Sensor Aggregator Continuous 
Processing 
Storage Analytics + 
Reporting
Amazon Kinesis 
23 21 18 17 14 
Shard-i 
10 8 5 3 2 
Shard 
ID 
Lock Seq 
num 
Shard-i 
Host A 
Host B 
Shard ID Last Archived 
Shard-i 
0 
10 
X18 2 
3 
5 
8 
10 
14 
17 
18 
21 
23 
0 
130 
Host BA 
{Event 10, …} 
1203 
14 
17 
18 
21 
23
Client/Sensor Aggregator Continuous 
Processing 
Storage Analytics + 
Reporting
CDN 
Real Time 
Bidding 
Retargetin 
g 
Platform 
Reporting 
Qubole 
Real Time 
KCL Apps Apps 
Archiver 
Kinesis Event Replay 
S3
Producers Aggregator 
Continuous 
Processing 
Storage Analytics 
CDN 
Real-time 
Bidding 
Retargeting 
Platform 
Reporting 
Qubole 
Real Time 
KCL Apps Apps 
Archiver 
Amazon Kinesis Event Replay 
Amazon S3 
Amazon 
Redshift
Producers Aggregator 
Continuous 
Processing 
Storage Analytics 
CDN 
Real-time 
Bidding 
Retargeting 
Platform 
Reporting 
Qubole 
Real Time 
KCL Apps Apps 
Archiver 
Amazon Kinesis Event Replay 
Amazon S3 
Redshift
Client/Sensor Aggregator Continuous 
Processing 
Storage Analytics + 
Reporting
•Unordered processing 
–Randomize partition key to distribute events over many shards and use multiple workers 
•Exact order processing 
–Control the partition key to ensure events are grouped onto the same shard and read by the same worker. 
•Need both? Get global sequence number 
Producer 
Get Global Sequence 
Unordered Stream 
Campaign Centric Stream 
Fraud Inspection Stream 
Get Event Metadata 
Id 
event 
Stream – partition key 
1 
confirmation 
Campaign-centric stream - UUID 
2 
fraud 
Unordered Stream Fraud-inspection stream – sessionid
HTTP 
Post 
AWS SDK 
LOG4J 
Flume 
Fluentd 
Get* APIs 
Apache 
Storm 
Amazon 
Elastic 
MapReduce 
Sending Reading 
Amazon EMR 
Playback 
Amazon S3 
Archiver
Client/Sensor Aggregator Continuous 
Processing 
Storage Analytics + 
Reporting
http://bit.ly/aws-bdt205
General Purpose: M1, M3 (, T2) 
Compute Optimized: C1, CC2, C3, C4 
Memory Optimized: M2, CR1, R3 
Storage Optimized: HI1, HS1, I2 
GPU: CG1, G2 
Micro: T1, T2
2006 
2007 
2008 
2009 
2011 
2012-2013 
December, 2014 
m1.small 
m1.xlarge 
m1.large 
m1.small 
m2.2xlarge 
m2.4xlarge 
c1.medium 
c1.xlarge 
m1.xlarge 
m1.large 
m1.small 
cc2.8xlarge 
cc1.4xlarge 
cg1.4xlarge 
t1.micro 
m2.xlarge 
m2.2xlarge 
m2.4xlarge 
c1.medium 
c1.xlarge 
m1.xlarge 
m1.large 
m1.small 
c3.large 
c3.xlarge 
c3.2xlarge 
c3.4xlarge 
c3.8xlarge 
cr1.8xlarge hs1.8xlarge 
m3.xlarge 
m3.2xlarge 
hi1.4xlarge 
m1.medium 
cc2.8xlarge 
cc1.4xlarge 
cg1.4xlarge 
t1.micro 
m2.xlarge 
m2.2xlarge 
m2.4xlarge 
c1.medium 
c1.xlarge 
m1.xlarge 
m1.large 
m1.small 
cc1.4xlarge 
cg1.4xlarge 
t1.micro 
m2.xlarge 
m2.2xlarge 
m2.4xlarge 
c1.medium 
c1.xlarge 
m1.xlarge 
m1.large 
m1.small 
c3.large 
c3.xlarge 
c3.2xlarge 
c3.4xlarge 
c3.8xlarge 
hs1.8xlarge 
m3.xlarge 
m3.2xlarge 
hi1.4xlarge 
m1.medium 
cc2.8xlarge 
cc1.4xlarge 
cg1.4xlarge 
t1.micro 
m2.xlarge 
m2.2xlarge 
m2.4xlarge 
c1.medium 
c1.xlarge 
m1.xlarge 
m1.large 
m1.small 
c1.medium 
c1.xlarge 
m1.xlarge 
m1.large 
m1.small 
new 
existing 
g2.2xlarge 
m3.medium 
m3.large 
i2.large 
i2.xlarge 
i2.4xlarge 
i2.8xlarge 
r3.large 
r3.xlarge 
r3.2xlarge 
r3.4xlarge 
r3.8xlarge 
t2.micro 
t2.small 
t2.medium 
c4.large 
c4.xlarge 
c4.2xlarge 
c4.4xlarge 
c4.8xlarge 
2010 
introducing now
78 The next generation of Amazon EC2 Compute-optimized instances 
•Based on Intel Xeon E5-2666 v3 (Haswell) processors 
•2.9 GHz – peaking at 3.5 GHz with Turbo Boost Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads. EBS-optimized by default… and at no additional cost! 
Instance Name 
vCPU Count 
RAM 
Network Performance 
c4.large 
2 
3.75 GiB 
Moderate 
c4.xlarge 
4 
7.5 GiB 
Moderate 
c4.2xlarge 
8 
15 GiB 
High 
c4.4xlarge 
16 
30 GiB 
High 
c4.8xlarge 
36 
60 GiB 
10 Gbps 
Preliminary specifications. May change prior to release
79 Increases to the performance and capacity of General Purpose (SSD) and Provisioned IOPS (SSD) volumes. 
EBS Name 
Capacity 
IOPS 
Throughput 
Amazon EBS General Purpose (SSD) 
16 TB 
(up from 1TB) 
10000 IOPS 
(up from 3000 IOPS) 
160 MBps * 
Amazon EBS Provisioned IOPS (SSD) 
16 TB 
(up from 1TB) 
20000 IOPS 
(up from 4000 IOPS) 
320 MBps * 
* When attached to EBS Optimized instances

More Related Content

What's hot

Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
Roger Barga
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Trieu Nguyen
 

What's hot (20)

Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adx
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlMonitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDB
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 

Viewers also liked

AWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVP
Amazon Web Services
 

Viewers also liked (15)

Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
 
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
 
AWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVPAWS Cloud Kata | Bangkok - Getting to MVP
AWS Cloud Kata | Bangkok - Getting to MVP
 
AWS Security Best Practices (March 2017)
AWS Security Best Practices (March 2017)AWS Security Best Practices (March 2017)
AWS Security Best Practices (March 2017)
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
AWS 101: Cloud Computing Seminar (2012)
AWS 101: Cloud Computing Seminar (2012)AWS 101: Cloud Computing Seminar (2012)
AWS 101: Cloud Computing Seminar (2012)
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 

Similar to AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Amazon Web Services
 

Similar to AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호 (20)

Amazed by AWS Series #4
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4
 
20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
The Scout24 Data Platform (A Technical Deep Dive)
The Scout24 Data Platform (A Technical Deep Dive)The Scout24 Data Platform (A Technical Deep Dive)
The Scout24 Data Platform (A Technical Deep Dive)
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
AWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksAWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacks
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
 
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Dbs302 driving a realtime personalization engine with cloud bigtable
Dbs302  driving a realtime personalization engine with cloud bigtableDbs302  driving a realtime personalization engine with cloud bigtable
Dbs302 driving a realtime personalization engine with cloud bigtable
 
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data WarehouseGPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 

More from Amazon Web Services Korea

More from Amazon Web Services Korea (20)

AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2
 
AWS Modern Infra with Storage Roadshow 2023 - Day 1
AWS Modern Infra with Storage Roadshow 2023 - Day 1AWS Modern Infra with Storage Roadshow 2023 - Day 1
AWS Modern Infra with Storage Roadshow 2023 - Day 1
 
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
 
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
 
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
 
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
 
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
 
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
 
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
 
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
 
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
 
From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...
 
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
 
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
 
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
 
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
 
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
 
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

  • 1. December 10, 2014 | Korea 김 일호, Solutions Architect
  • 2. BDT201 - Big Data and HPC State of the Union BDT202 - HPC Now Means 'High Personal Computing' BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data BDT205 - Your First Big Data Application on AWS BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise BDT207 - Use Streaming Analytics to Exploit Perishable Insights BDT208 - Finding High Performance in the Cloud for HPC BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis BDT307 - Running NoSQL on Amazon EC2 BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track BDT310 - Big Data Architectural Patterns and Best Practices on AWS BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads BDT312 - Using the Cloud to Scale from a Database to a Data Platform BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm BDT403 - Netflix's Next Generation Big Data Platform
  • 3. EMR Redshift EC2 Process & Analyze Store AWS Direct Connect S3 Amazon Kinesis Glacier AWS Import/Export DynamoDB Collect Automate AWS Data Pipeline
  • 4. Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools Amazon data pipeline
  • 5.
  • 6.
  • 7. Log4J EMR-Kinesis Connector Hive with Amazon S3 Amazon Redshift parallel COPY from Amazon S3 Amazon Kinesis processing state
  • 8.
  • 9. Launch a 3-instance Hadoop 2.4 cluster with Hive installed: m3.xlarge YOUR-AWS-REGIONYOUR-AWS-SSH-KEY
  • 11. Create an Amazon Kinesis stream to hold incoming data: aws kinesis create-stream --stream-name AccessLogStream --shard-count 2
  • 14.
  • 15. Log4J
  • 17.
  • 19.
  • 20. hive> STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler' TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");
  • 21. -- return the first row in the stream hive> -- return count all items in the Stream hive> -- return count of all rows with given host hive>
  • 24.
  • 26. -- set up Hive's "dynamic partioning" -- splits output files when writing to Amazon S3 hive> hive>
  • 27. -- compress output files on Amazon S3 using Gzip hive> hive> hive> hive>
  • 28.
  • 29. -- convert the Apache log timestamp to a UNIX timestamp -- split files in Amazon S3 by the hour in the log lines hive>
  • 30. Log4J EMR-Kinesis Connector Hive with Amazon S3
  • 31.
  • 33. # using the PostgreSQL CLI YOUR-REDSHIFT-ENDPOINT Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Redshift support •Aginity Workbench for Amazon Redshift •SQL Workbench/J
  • 34.
  • 35.
  • 37. -- show all requests from a given IP address -- count all requests on a given day -- show all requests referred from other sites
  • 38.
  • 39. Log4J EMR-Kinesis Connector Hive with Amazon S3 Amazon Redshift parallel COPY from Amazon S3
  • 41.
  • 42. hive> hive> hive> hive> hive>
  • 43. -- Create an external table on Amazon S3 -- to hold query results. -- Partition (split files on Amazon S3) by iteration hive> YOUR-S3-BUCKET
  • 44. -- set up a first iteration -- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0
  • 45. -- set up a second iteration over the data in the Kinesis Stream -- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data
  • 46. Log4J EMR-Kinesis Connector Hive with Amazon S3 Amazon Redshift parallel COPY from Amazon S3 Amazon Kinesis processing state
  • 49.
  • 51.
  • 52. DataXu Records tx_id: "AFTfN0uAWZ" exchange: “APPNEXUS" request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9” …. adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1- 839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”, campaign_id: "C0513n7”, creative_id: “R53a537”} … time_stamp: 1415393474434 serviced_by_host: "cr02.us-east-01” Confirmation Record [- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET /rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=1415502000191662 HTTP/1.1" 302 - "http://ads- by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e- 1831-4eba-b78d-cd99188e951a" "OWW=-" Fraud Record
  • 53. Continuous Processing CDN Real-time Bidding Retargeting Platform Reporting Qubole Real Time KCL Apps Apps Archiver Amazon Kinesis Event Replay Amazon S3 Producers Aggregator Storage Analytics Redshift
  • 54.
  • 55.
  • 56. Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 57.
  • 59. Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 60. Amazon Kinesis storage is replicated across Availability Zones Amazon Web Services AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  • 61.
  • 62. 0 200000 400000 600000 800000 1000000 1200000 0 100 200 300 400 500 600 700 800 900 1000 1100 1KB Messages/sec Shards TCO for average 1M events/second: with 50:1 packing and 10:1 compression: $6351/month raw: $28610/month
  • 63. Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 64.
  • 65. Amazon Kinesis 23 21 18 17 14 Shard-i 10 8 5 3 2 Shard ID Lock Seq num Shard-i Host A Host B Shard ID Last Archived Shard-i 0 10 X18 2 3 5 8 10 14 17 18 21 23 0 130 Host BA {Event 10, …} 1203 14 17 18 21 23
  • 66. Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 67. CDN Real Time Bidding Retargetin g Platform Reporting Qubole Real Time KCL Apps Apps Archiver Kinesis Event Replay S3
  • 68. Producers Aggregator Continuous Processing Storage Analytics CDN Real-time Bidding Retargeting Platform Reporting Qubole Real Time KCL Apps Apps Archiver Amazon Kinesis Event Replay Amazon S3 Amazon Redshift
  • 69. Producers Aggregator Continuous Processing Storage Analytics CDN Real-time Bidding Retargeting Platform Reporting Qubole Real Time KCL Apps Apps Archiver Amazon Kinesis Event Replay Amazon S3 Redshift
  • 70. Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 71. •Unordered processing –Randomize partition key to distribute events over many shards and use multiple workers •Exact order processing –Control the partition key to ensure events are grouped onto the same shard and read by the same worker. •Need both? Get global sequence number Producer Get Global Sequence Unordered Stream Campaign Centric Stream Fraud Inspection Stream Get Event Metadata Id event Stream – partition key 1 confirmation Campaign-centric stream - UUID 2 fraud Unordered Stream Fraud-inspection stream – sessionid
  • 72. HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Apache Storm Amazon Elastic MapReduce Sending Reading Amazon EMR Playback Amazon S3 Archiver
  • 73. Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 75.
  • 76. General Purpose: M1, M3 (, T2) Compute Optimized: C1, CC2, C3, C4 Memory Optimized: M2, CR1, R3 Storage Optimized: HI1, HS1, I2 GPU: CG1, G2 Micro: T1, T2
  • 77. 2006 2007 2008 2009 2011 2012-2013 December, 2014 m1.small m1.xlarge m1.large m1.small m2.2xlarge m2.4xlarge c1.medium c1.xlarge m1.xlarge m1.large m1.small cc2.8xlarge cc1.4xlarge cg1.4xlarge t1.micro m2.xlarge m2.2xlarge m2.4xlarge c1.medium c1.xlarge m1.xlarge m1.large m1.small c3.large c3.xlarge c3.2xlarge c3.4xlarge c3.8xlarge cr1.8xlarge hs1.8xlarge m3.xlarge m3.2xlarge hi1.4xlarge m1.medium cc2.8xlarge cc1.4xlarge cg1.4xlarge t1.micro m2.xlarge m2.2xlarge m2.4xlarge c1.medium c1.xlarge m1.xlarge m1.large m1.small cc1.4xlarge cg1.4xlarge t1.micro m2.xlarge m2.2xlarge m2.4xlarge c1.medium c1.xlarge m1.xlarge m1.large m1.small c3.large c3.xlarge c3.2xlarge c3.4xlarge c3.8xlarge hs1.8xlarge m3.xlarge m3.2xlarge hi1.4xlarge m1.medium cc2.8xlarge cc1.4xlarge cg1.4xlarge t1.micro m2.xlarge m2.2xlarge m2.4xlarge c1.medium c1.xlarge m1.xlarge m1.large m1.small c1.medium c1.xlarge m1.xlarge m1.large m1.small new existing g2.2xlarge m3.medium m3.large i2.large i2.xlarge i2.4xlarge i2.8xlarge r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge t2.micro t2.small t2.medium c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge 2010 introducing now
  • 78. 78 The next generation of Amazon EC2 Compute-optimized instances •Based on Intel Xeon E5-2666 v3 (Haswell) processors •2.9 GHz – peaking at 3.5 GHz with Turbo Boost Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads. EBS-optimized by default… and at no additional cost! Instance Name vCPU Count RAM Network Performance c4.large 2 3.75 GiB Moderate c4.xlarge 4 7.5 GiB Moderate c4.2xlarge 8 15 GiB High c4.4xlarge 16 30 GiB High c4.8xlarge 36 60 GiB 10 Gbps Preliminary specifications. May change prior to release
  • 79. 79 Increases to the performance and capacity of General Purpose (SSD) and Provisioned IOPS (SSD) volumes. EBS Name Capacity IOPS Throughput Amazon EBS General Purpose (SSD) 16 TB (up from 1TB) 10000 IOPS (up from 3000 IOPS) 160 MBps * Amazon EBS Provisioned IOPS (SSD) 16 TB (up from 1TB) 20000 IOPS (up from 4000 IOPS) 320 MBps * * When attached to EBS Optimized instances