Big Data Analytics
Abhishek Sinha
Business Development Manager,
AWS
@abysinha
sinhaar@amazon.com
An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, orga...
What does big data look like ?
Volume
Velocity
Variety
3Vs
Where is this data coming from ?
Human generated
Machine generated
Tweet
Surf the internet
Buy and sell products
Upload images and videos
Play games
Check ...
Human generated
Machine generated
Networks and security devices
Mobile phones
Cell phone towers
Smart grids
Smart meters
T...
What are people using this for ?
Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analy...
Why is big data hard ?
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Highly
constrained
Lower cost,
higher thro...
Big Gap in turning data into actionable
information
Amazon Web Services helps remove constraints
Big Data + Cloud = Awesome Combination
Big data:
• Potentially massive datasets
• Iterative, experimental style
of data ma...
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B che...
Stack
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files...
Stack – Front end Application
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
M...
Stack – Collection and Storage
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
...
Stack – analysis and sharing
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mo...
Users Overtime
“Who is using our
service?”
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
Autocomplete Search
Recommendations
Automatic spelling
corrections
“What kind of movies do people
like ?”
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 ...
10 TB of streaming data per day
Data consumed in multiple ways
S3
EMR
Prod Cluster
(EMR)
Recommendati
on Engine
Ad-hoc
Analysis
Personalization
AWS
Import/Export
Corporate
data center
Amazon
Elastic
MapReduce
Amazon
Simple
Storage
Service (S3)
BI Users
Clickstream d...
“Who buys video games?”
Who is Razorfish
• Full service Digital Agency
• Developed an Ad-Serving Platform compatible with most browsers
• Clickstr...
3.5 billion records
13 TB of click stream logs
71 million unique cookies
Per day:
Previously in 2009
Today
Today
This happens in 8 hours everyday
Why AWS + EMR
• Prefect Clarity of Cost
• No upfront infrastructure investment
• No client processing contention
• Without...
Playfish improves in-game experience for its users
through data mining
Challenge:
Must understand player usage trends acro...
Data Driven Game Design
Data is being used to understand what gamers are doing
inside the game (behavioral analysis)
- Wha...
Building a big data architecture
Design Patterns
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Co...
Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
2
Queue , pre-process and then write to data source
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
stor...
Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster...
Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data
store
HDFS
4
What is Flume
• Collection, Aggregation of streaming Event Data
– Typically used for log data, sensor data , GPS data etc
...
Typical Aggregation Flow
[Client]+  Agent [ Agent]*  Destination
Flume uses a multi-tier approach where multiple agents...
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Choose depending upon design
Choice of storage systems (Structure and Volume)
Structure
LowHigh
Large
Small
Size
S3
RDS
Dynamo DB
NoSQL
EBS
1
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
A framework
Splits data into pieces
Lets processing occur
Gathers the results
distributed computing
Difficulty
Number of Machines
1
1
Difficulty
Number of Machines
1
1
106
2
Difficulty
Number of Machines
1
1
106
2
distributed computing
is hard
distributed computing
requires god-like engineers
Innovation #1:
Hadoop is…
The MapReduce computational paradigm
Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable,
Fault-tolerant, Distributed Sys...
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 1...
Person Start End Duration
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17...
Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11...
Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17
Charlie 11:17:24...
Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17 18
Charlie 11:17...
Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 1...
Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charl...
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 1...
Person Total
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 1...
Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
...
Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Total
Alice 25
Bob 49
Charlie 63
David 29
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
...
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 1...
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
map
reduce
Works on one record. In this case it
does “end time minus start time”
In parallel over all the records
Group to...
Hadoop is…
The MapReduce computational paradigm
Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable,
Fault-tolerant, Distributed Sys...
distributed computing
requires god-like engineers
distributed computing (with Hadoop)
requires god-like talented engineers
Launch a Hadoop cluster from the CLI (
elastic-mapreduce --create --alive 
--instance-type m1.xlarge 
--num-instances 5
The Hadoop Ecosystem
EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data ...
R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other...
Hive
Schema on read
Launch a Hive cluster from the CLI (step 1/1)
./elastic-mapreduce --create --alive 
--name "Test Hive" 
--hadoop-version 0...
SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on
S3
Languag...
SQL HiveQL
Updates UPDATE, INSERT,
DELETE
INSERT, OVERWRITE
TABLE
Transactions Supported Not supported
Indexes Supported N...
./elastic-mapreduce –create
--name "Hive job flow”
--hive-script
--args s3://myawsbucket/myquery.q
--args -d,INPUT=s3://my...
./elastic-mapreduce
--create
--alive
--name "Hive job flow”
--num-instances 5 --instance-type m1.large 
--hive-interactive...
114
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "aj...
115
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "aj...
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent ...
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent ...
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent ...
Hadoop lowers the cost of developing
a distributed system.
hive> select * from impressions limit 5;
Selecting from source
data directly via Hadoop
What about the cost of operating
a distributed system?
November traffic at amazon.com
November traffic at amazon.com
November traffic at amazon.com
76%
24%
Innovation #2:
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
1 instance x 100 hours = 100 instances x 1 hour
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc....
S3
What can you run on EMR…
EMR Cluster
Resize Nodes
EMR Cluster
You can easily add and
remove nodes
On and Off Fast Growth
Predictable peaksVariable peaks
WASTE
Fast GrowthOn and Off
Predictable peaksVariable peaks
Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshif...
Massively Parallel Columnar Datawarehouses
• Columnar Data stores
• MPP
– Parallel Ingest
– Parallel Query
– Scale Out
– P...
Columnar data stores
• Data alignment
and block size in
row stores vs.
column stores
• Compression
based on each
column
MPP Data warehouse parallelizes and distributes
everything
• Query
• Load
• Backup
• Restore
• Resize
10 GigE
(HPC)
Ingest...
But Data-warehouses are
• Hard to manage
• Very expensive
• Difficult to scale
• Difficult to get performance
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Parallelize ...
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Parallelize ...
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Protect Oper...
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Protect Oper...
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Start Small ...
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Easy to prov...
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL
Single Node
Effective Hourly Price
Pe...
Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EM...
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
...
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
...
Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
A...
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
...
Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Re...
Rinse Repeat every day or hour
Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Vi...
The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Re...
How do you start ?
Where do you start ?
• Where is your data ? (S3, SQL, NoSQL ?)
– Are you collecting all your data ?
– What is the format (...
Thank You
sinhaar@amazon.com
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Upcoming SlideShare
Loading in...5
×

Big data on_aws in korea by abhishek sinha (lunch and learn)

471

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
471
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Big data on_aws in korea by abhishek sinha (lunch and learn)

  1. 1. Big Data Analytics Abhishek Sinha Business Development Manager, AWS @abysinha sinhaar@amazon.com
  2. 2. An engineer’s definition When your data sets become so large that you have to start innovating how to collect, store, organize, analyze and share it
  3. 3. What does big data look like ?
  4. 4. Volume Velocity Variety 3Vs
  5. 5. Where is this data coming from ?
  6. 6. Human generated Machine generated Tweet Surf the internet Buy and sell products Upload images and videos Play games Check in at restaurants Search for cafes Find deals Watch content online Look for directions Use social media
  7. 7. Human generated Machine generated Networks and security devices Mobile phones Cell phone towers Smart grids Smart meters Telematics from cars Sensors on machines Videos from traffic and security cameras
  8. 8. What are people using this for ?
  9. 9. Big Data Verticals and Use cases Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendati ons Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographi cs Usage analysis In-game metrics
  10. 10. Why is big data hard ?
  11. 11. Generation Collection & storage Analytics & computation Collaboration & sharing
  12. 12. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  13. 13. Generation Collection & storage Analytics & computation Collaboration & sharing Highly constrained Lower cost, higher throughput
  14. 14. Big Gap in turning data into actionable information
  15. 15. Amazon Web Services helps remove constraints
  16. 16. Big Data + Cloud = Awesome Combination Big data: • Potentially massive datasets • Iterative, experimental style of data manipulation and analysis • Frequently not a steady-state workload; peaks and valleys • Data is a combination of structured and unstructured data in many formats AWS Cloud: • Massive, virtually unlimited capacity • Iterative, experimental style of infrastructure deployment/usage • At its most efficient with highly variable workloads • Tools for managing structured and unstructured data
  17. 17. Generation Collection & storage Analytics & computation Collaboration & sharing
  18. 18. Data size • Global reach • Native app for almost every smartphone, SMS, web, mobile-web • 10M+ users, 15M+ venues, ~1B check-ins • Terabytes of log data
  19. 19. Stack ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  20. 20. Stack – Front end Application ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  21. 21. Stack – Collection and Storage ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  22. 22. Stack – analysis and sharing ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  23. 23. Users Overtime
  24. 24. “Who is using our service?”
  25. 25. Identified early mobile usage Invested heavily in mobile development Finding signal in the noise of logs
  26. 26. 9,432,061 unique mobile devices used the Yelp mobile app. 4 million+ calls. 5 million+ directions. In January 2013
  27. 27. Autocomplete Search Recommendations Automatic spelling corrections
  28. 28. “What kind of movies do people like ?”
  29. 29. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  30. 30. 10 TB of streaming data per day
  31. 31. Data consumed in multiple ways S3 EMR Prod Cluster (EMR) Recommendati on Engine Ad-hoc Analysis Personalization
  32. 32. AWS Import/Export Corporate data center Amazon Elastic MapReduce Amazon Simple Storage Service (S3) BI Users Clickstream data from 500+ websites and VoD platform
  33. 33. “Who buys video games?”
  34. 34. Who is Razorfish • Full service Digital Agency • Developed an Ad-Serving Platform compatible with most browsers • Clickstream analysis of data , current historical trends and segmentation of users • Segmentation is used to serve ads and cross sell • 45TB of Log data • Problems at scale – Giant Datasets – Building Infrastructure requires large continuous investment – Build for peak holiday season – Traditional Data stores are not scaling
  35. 35. 3.5 billion records 13 TB of click stream logs 71 million unique cookies Per day:
  36. 36. Previously in 2009
  37. 37. Today
  38. 38. Today
  39. 39. This happens in 8 hours everyday
  40. 40. Why AWS + EMR • Prefect Clarity of Cost • No upfront infrastructure investment • No client processing contention • Without EMR/Hadoop it takes 3 days , with EMR 8 hours – Scalability 1 node x 100 hours = 100 nodes x 1 hour • Meet SLA
  41. 41. Playfish improves in-game experience for its users through data mining Challenge: Must understand player usage trends across 50M month users, multiple platforms, 10s of games, and in the face of rapid growth. This drives both in-game improvements and defines what games to target next. Solution: EMR provides Playfish the flexibility to experiment and rapidly ask new questions. All usage data is stored in S3 and analysts run ad-hoc hive queries that can slice the data by time, game, and user.
  42. 42. Data Driven Game Design Data is being used to understand what gamers are doing inside the game (behavioral analysis) - What features people like (rely on data instead of forum posts) - What features are abandoned - A/B testing - Monetization – In Game Analytics
  43. 43. Building a big data architecture Design Patterns
  44. 44. Generation Collection & storage Analytics & computation Collaboration & sharing
  45. 45. Generation Collection & storage Analytics & computation Collaboration & sharing
  46. 46. Getting your Data into AWS Amazon S3 Corporate Data Center • Console Upload • FTP • AWS Import Export • S3 API • Direct Connect • Storage Gateway • 3rd Party Commercial Apps • Tsunami UDP 1
  47. 47. Write directly to a data source Your application Amazon S3 DynamoDB Any other data store Amazon S3 Amazon EC2 2
  48. 48. Queue , pre-process and then write to data source Amazon Simple Queue Service (SQS) Amazon S3 DynamoDB Any other data store 3
  49. 49. Agency Customer: Video Analytics on AWS Elastic Load Balancer Edge Servers on EC2 Workers on EC2 Logs Reports HDFS Cluster Amazon Simple Queue Service (SQS) Amazon Simple Storage Service (S3) Amazon Elastic MapReduce
  50. 50. Aggregate and write to data source Flume running on EC2 Amazon S3 Any other data store HDFS 4
  51. 51. What is Flume • Collection, Aggregation of streaming Event Data – Typically used for log data, sensor data , GPS data etc • Significant advantages over ad-hoc solutions – Reliable, Scalable, Manageable, Customizable and High Performance – Declarative, Dynamic Configuration – Contextual Routing – Feature rich – Fully extensible
  52. 52. Typical Aggregation Flow [Client]+  Agent [ Agent]*  Destination Flume uses a multi-tier approach where multiple agents can send data to another agent which acts as a aggregator. For each agent , data can from either an agent or a client or can be sent to another agent or a sink
  53. 53. Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html S3 as a “single source of truth” S3
  54. 54. Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Choose depending upon design
  55. 55. Choice of storage systems (Structure and Volume) Structure LowHigh Large Small Size S3 RDS Dynamo DB NoSQL EBS 1
  56. 56. Generation Collection & storage Analytics & computation Collaboration & sharing
  57. 57. Hadoop based Analysis Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  58. 58. EMR is Hadoop in the Cloud What is Amazon Elastic MapReduce (EMR)?
  59. 59. A framework Splits data into pieces Lets processing occur Gathers the results
  60. 60. distributed computing
  61. 61. Difficulty Number of Machines 1 1
  62. 62. Difficulty Number of Machines 1 1 106 2
  63. 63. Difficulty Number of Machines 1 1 106 2
  64. 64. distributed computing is hard
  65. 65. distributed computing requires god-like engineers
  66. 66. Innovation #1:
  67. 67. Hadoop is… The MapReduce computational paradigm
  68. 68. Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
  69. 69. Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  70. 70. Person Start End Duration Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  71. 71. Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  72. 72. Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  73. 73. Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 18 Charlie 11:17:24 11:17:38 14 Bob 11:23:10 11:23:25 15 Alice 16:26:46 16:26:54 8 David 17:20:28 17:20:45 17 Alice 18:16:53 18:17:00 7 Charlie 19:33:44 19:33:59 15 Bob 21:13:32 21:13:43 11 David 22:36:22 22:36:34 12 Alice 23:42:01 23:42:11 10
  74. 74. Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
  75. 75. Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10 Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11 map
  76. 76. Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
  77. 77. Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  78. 78. Person Total Alice 25 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  79. 79. Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17 Person Total Bob 49 Alice 25
  80. 80. Person Total Charlie 63 Bob 49 Alice 25 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  81. 81. Person Total David 29 Charlie 63 Bob 49 Alice 25 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  82. 82. Person Total David 29 Charlie 63 Bob 49 Alice 25
  83. 83. Person Total Alice 25 Bob 49 Charlie 63 David 29 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17 reduce
  84. 84. Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  85. 85. Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  86. 86. map reduce Works on one record. In this case it does “end time minus start time” In parallel over all the records Group together common records (e.g “Alice, Bob”) and add all the results
  87. 87. Hadoop is… The MapReduce computational paradigm
  88. 88. Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
  89. 89. distributed computing requires god-like engineers
  90. 90. distributed computing (with Hadoop) requires god-like talented engineers
  91. 91. Launch a Hadoop cluster from the CLI ( elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5
  92. 92. The Hadoop Ecosystem
  93. 93. EMR makes it easy to use Hive and Pig Pig: • High-level programming language (Pig Latin) • Supports UDFs • Ideal for data flow/ETL Hive: • Data Warehouse for Hadoop • SQL-like query language (HiveQL)
  94. 94. R: • Language and software environment for statistical computing and graphics • Open source EMR makes it easy to use other tools and applications Mahout: • Machine learning library • Supports recommendation mining, clustering, classification, and frequent itemset mining
  95. 95. Hive Schema on read
  96. 96. Launch a Hive cluster from the CLI (step 1/1) ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1
  97. 97. SQL Interface for working with data Simple way to use Hadoop Create Table statement references data location on S3 Language called HiveQL, similar to SQL An example of a query could be: SELECT COUNT(1) FROM sometable; Requires to setup a mapping to the input data Uses SerDe:s to make different input formats queryable Powerful data types (Array & Map..)
  98. 98. SQL HiveQL Updates UPDATE, INSERT, DELETE INSERT, OVERWRITE TABLE Transactions Supported Not supported Indexes Supported Not supported Latency Sub-second Minutes Functions Hundreds Dozens Multi-table inserts Not supported Supported Create table as select Not valid SQL-92 Supported
  99. 99. ./elastic-mapreduce –create --name "Hive job flow” --hive-script --args s3://myawsbucket/myquery.q --args -d,INPUT=s3://myawsbucket/input,- d,OUTPUT=s3://myawsbucket/output HiveQL to execute
  100. 100. ./elastic-mapreduce --create --alive --name "Hive job flow” --num-instances 5 --instance-type m1.large --hive-interactive Interactive hive session
  101. 101. 114 { requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", searchPhrase: "digital cameras" adId: "jalhdahu789asashja", impresssionId: "hjakhlasuhiouasd897asdh", referrer: "http://cooking.com/recipe?id=10231", hostname: "ec2-12-12-12-12.ec2.amazonaws.com", modelId: "asdjhklasd7812hjkasdhl", processId: "12901", threadId: "112121", timers: { requestTime: "1910121", modelLookup: "1129101" } counters: { heapSpace: "1010120912012" } }
  102. 102. 115 { requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", adId: "jalhdahu789asashja", impresssionId: hjakhlasuhiouasd897asdh", clickId: "ashda8ah8asdp1uahipsd", referrer: "http://recipes.com/", directedTo: "http://cooking.com/" }
  103. 103. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ;
  104. 104. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Table structure to create (happens fast as just mapping to source)
  105. 105. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Source data in S3
  106. 106. Hadoop lowers the cost of developing a distributed system.
  107. 107. hive> select * from impressions limit 5; Selecting from source data directly via Hadoop
  108. 108. What about the cost of operating a distributed system?
  109. 109. November traffic at amazon.com
  110. 110. November traffic at amazon.com
  111. 111. November traffic at amazon.com 76% 24%
  112. 112. Innovation #2:
  113. 113. EMR is Hadoop in the Cloud What is Amazon Elastic MapReduce (EMR)?
  114. 114. 1 instance x 100 hours = 100 instances x 1 hour
  115. 115. EMR Cluster S3 Put the data into S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Get the output from S3 Launch the cluster using the EMR console, CLI, SDK, or APIs You can also store everything in HDFS How does EMR work ?
  116. 116. S3 What can you run on EMR… EMR Cluster
  117. 117. Resize Nodes EMR Cluster You can easily add and remove nodes
  118. 118. On and Off Fast Growth Predictable peaksVariable peaks WASTE
  119. 119. Fast GrowthOn and Off Predictable peaksVariable peaks
  120. 120. Your choice of tools on Hadoop/EMR Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  121. 121. SQL based processing Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework Petabyte scale Columnar Data - warehouse
  122. 122. Massively Parallel Columnar Datawarehouses • Columnar Data stores • MPP – Parallel Ingest – Parallel Query – Scale Out – Parallel Backup
  123. 123. Columnar data stores • Data alignment and block size in row stores vs. column stores • Compression based on each column
  124. 124. MPP Data warehouse parallelizes and distributes everything • Query • Load • Backup • Restore • Resize 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  125. 125. But Data-warehouses are • Hard to manage • Very expensive • Difficult to scale • Difficult to get performance
  126. 126. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
  127. 127. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Parallelize and Distribute Everything Dramatically Reduce I/O MPP Load Query Resize Backup Restore
  128. 128. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Parallelize and Distribute Everything Dramatically Reduce I/O MPP Load Query Resize Backup Restore Direct-attached storage Large data block sizes Column data store Data compression Zone maps
  129. 129. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Protect Operations Simplify Provisioning Redshift data is encrypted Continuously backed up to S3 Automatic node recovery Transparent disk failure
  130. 130. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Protect Operations Simplify Provisioning Redshift data is encrypted Continuously backed up to S3 Automatic node recovery Transparent disk failure Create a cluster in minutes Automatic OS and software patching Scale up to 1.6PB with a few clicks and no downtime
  131. 131. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Start Small and Grow Big Extra Large Node (XL) 3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE 1 node (2TB)  2-32 node cluster (64TB) 8 Extra Large Node (8XL) 24 spindles, 16TB, 120GiB RAM 16 virtual cores, 10GigE 2-100 node cluster (1.6PB)
  132. 132. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Easy to provision and scale No upfront costs, pay as you go High performance at a low price Open and flexible with support for popular BI tools
  133. 133. Amazon Redshift is priced to let you analyze all your data Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999 Simple Pricing Number of Nodes x Cost per Hour No charge for Leader Node No upfront costs Pay as you go
  134. 134. Your choice of BI Tools on the cloud Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework
  135. 135. Generation Collection & storage Analytics & computation Collaboration & sharing
  136. 136. Collaboration and Sharing insights Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift
  137. 137. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  138. 138. Sharing results and visualizations and scale Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  139. 139. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools
  140. 140. Geospatial Visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Visualization tools
  141. 141. Rinse Repeat every day or hour
  142. 142. Rinse and Repeat Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  143. 143. The complete architecture Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  144. 144. How do you start ?
  145. 145. Where do you start ? • Where is your data ? (S3, SQL, NoSQL ?) – Are you collecting all your data ? – What is the format (structured or unstructured) – How much is this data going to grow ? • How do you want to process it ? – SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ? • How do you want to use this data – Visualization tools • Do you yourself or engage an AWS partner • Write to me sinhaar@amazon.com
  146. 146. Thank You sinhaar@amazon.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×