Abhishek Sinha
Business Development Manager, AWS
July 18, 2013
@abysinha
sinhaar@amazon.com
Big Data Analytics
Overview
• The Big Data Challenge
• Turning data into actionable information
• Building a big data platform
• Mobilewalla–...
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Highly constrained
Lower cost,
higher thro...
Generated data
Available for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastr...
Big Gap in turning data into actionable
information
Amazon Web Services helps remove
constraints
1 instance x 100 hours = 100 instances x 1 hour
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendation
Transac...
From data to
actionable information
“Who is using our service?”
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
“What kind of movies do people like ?”
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 ...
Query complements the R3 solution by providing granular search-and-
retrieval functionality for structured and unstructure...
Building a
Big-Data Architecture
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Co...
Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
2
Queue , pre-process and then write to data source
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
stor...
Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster...
Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data
store
HDFS
4
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Choose depending upon design
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc....
S3
What can you run on EMR…
EMR Cluster
Resize Nodes
EMR Cluster
You can easily add and
remove nodes
On and Off Fast Growth
Predictable peaksVariable peaks
WASTE
Fast GrowthOn and Off
Predictable peaksVariable peaks
Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshif...
What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale
data warehouse service in ...
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL
Single Node
Effective Hourly Price
Pe...
Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EM...
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
...
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
...
Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
A...
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
...
Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Re...
Rinse Repeat every day or hour
Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Vi...
The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Re...
Kaushik Dutta
CTO
18 July, 2013
Mobilewalla – App Audience Measurement
With Amazon EC2 Infrastructure
Mobilewalla
• Seattle-based big data venture that has accumulated the largest volumetric
database of app market data in th...
Traditional audience measurement - Panels & Popularity
Persistence
Fundamental to panel driven measurement
Idea of popular...
Mobilewalla Use Case – App Publishers
• How is my app doing?
– Rank by Category and Country, Reviews, Ratings, Feature men...
Mobilewalla Use Case – Mobile Ad Tech
• New Publisher Acquisition
– Top N apps & Publishers for a Category / Geography
– T...
Mobilewalla Approach
Social media / web Web Crawler Cloud Storage
Amazon S3 Amazon EBS
Amazon RDS
Mobilewalla Approach – Map-Reduce based analytics
Analytics
Analytics
Analytics
Analytics
Map Reduce Analytics
Cloud Stora...
Mobilewalla – Amazon EC2 Infrastructure
Web Crawler
• 700+ micro to small instances
• Elastic map-reduce – flexibility of
...
Mobilewalla – Amazon EC2 Infrastructure
Cloud Storage
• 50+ Medium to Large instances
• Cassandra DB Nodes – EBS backed
• ...
Mobilewalla – Amazon EC2 Infrastructure
Map Reduce Framework
• Complex analytics jobs on Hadoop
systems in EC2 nodes
• Ela...
Mobilewalla – Amazon EC2 Infrastructure
Analytics Delivery
• Multiple application servers with
load balancers
• High read ...
Amazon Web Services for Mobilewalla - Advantages
• On-Demand and reserved nodes
– Flexibility to add, modify, delete nodes...
Amazon Web Services for Mobilewalla - Suggestions
• Take the initial time to explore all the various offerings of Amazon i...
Thank You
Big Data Analytics
Eddie Toh
Regional Platform Marketing Manager
Pricing & Product Marketing Group
Intel APAC
July 18, 2013
Create new business
models and improve
organizational
processes.
Enhance scientific
understanding, drive
innovation, and
a...
Unlock Value in
Silicon
Support Open
Platforms
Deliver Software Value
Democratizing Analytics gets Value out of Big Data
Intel at the Intersection of Big Data
Enabling exascale computing on massive data sets
Helping enterprises
build open
inte...
Intel at the Heart of the Cloud
Server
Storage
Network
Scale-Out Platform Optimizations for Big Data
Cost-effective performance
• Intel® Advanced Vector Extension
Technology
• I...
Intel® Advanced Vector Extensions Technology
1 : Performance comparison using Linpack benchmark. See backup for configurat...
More Performance
Higher turbo speeds maximize
performance for single and
multi-threaded applications
Intel® Turbo Boost Te...
Intel® Advanced Encryption Standard New Instructions
• Processor assistance for performing
AES encryption - 7 new instruct...
Richer
user
experiences
4HRS
50%
Reduction
~7MIN
80%
Reduction 50%
Reduction 40%
Reduction
TeraSort for
1TB sort
Intel®
Xe...
Cloud
Intelligent Systems
Clients
Virtuous Cycle of Data-Driven Experience
Thank You
Technical Track
Break
Technical Track
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla
Upcoming SlideShare
Loading in...5
×

AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla

1,086

Published on

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
  • great presentaion
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,086
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla

  1. 1. Abhishek Sinha Business Development Manager, AWS July 18, 2013 @abysinha sinhaar@amazon.com Big Data Analytics
  2. 2. Overview • The Big Data Challenge • Turning data into actionable information • Building a big data platform • Mobilewalla– Big data system in AWS for mobile app audience measurement • Intel technology on big data.
  3. 3. Generation Collection & storage Analytics & computation Collaboration & sharing
  4. 4. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  5. 5. Generation Collection & storage Analytics & computation Collaboration & sharing Highly constrained Lower cost, higher throughput
  6. 6. Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  7. 7. Big Gap in turning data into actionable information
  8. 8. Amazon Web Services helps remove constraints
  9. 9. 1 instance x 100 hours = 100 instances x 1 hour
  10. 10. Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendation Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographics Usage analysis In-game metrics Big Data Verticals and Use cases
  11. 11. From data to actionable information
  12. 12. “Who is using our service?”
  13. 13. Identified early mobile usage Invested heavily in mobile development Finding signal in the noise of logs
  14. 14. 9,432,061 unique mobile devices used the Yelp mobile app. 4 million+ calls. 5 million+ directions. In January 2013
  15. 15. “What kind of movies do people like ?”
  16. 16. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  17. 17. Query complements the R3 solution by providing granular search-and- retrieval functionality for structured and unstructured data stored in FinQloud
  18. 18. Building a Big-Data Architecture
  19. 19. Generation Collection & storage Analytics & computation Collaboration & sharing
  20. 20. Generation Collection & storage Analytics & computation Collaboration & sharing
  21. 21. Getting your Data into AWS Amazon S3 Corporate Data Center • Console Upload • FTP • AWS Import Export • S3 API • Direct Connect • Storage Gateway • 3rd Party Commercial Apps • Tsunami UDP 1
  22. 22. Write directly to a data source Your application Amazon S3 DynamoDB Any other data store Amazon S3 Amazon EC2 2
  23. 23. Queue , pre-process and then write to data source Amazon Simple Queue Service (SQS) Amazon S3 DynamoDB Any other data store 3
  24. 24. Agency Customer: Video Analytics on AWS Elastic Load Balancer Edge Servers on EC2 Workers on EC2 Logs Reports HDFS Cluster Amazon Simple Queue Service (SQS) Amazon Simple Storage Service (S3) Amazon Elastic MapReduce
  25. 25. Aggregate and write to data source Flume running on EC2 Amazon S3 Any other data store HDFS 4
  26. 26. Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html S3 as a “single source of truth” S3
  27. 27. Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Choose depending upon design
  28. 28. Generation Collection & storage Analytics & computation Collaboration & sharing
  29. 29. Hadoop based Analysis Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  30. 30. EMR is Hadoop in the Cloud What is Amazon Elastic MapReduce (EMR)?
  31. 31. EMR Cluster S3 Put the data into S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Get the output from S3 Launch the cluster using the EMR console, CLI, SDK, or APIs You can also store everything in HDFS How does EMR work ?
  32. 32. S3 What can you run on EMR… EMR Cluster
  33. 33. Resize Nodes EMR Cluster You can easily add and remove nodes
  34. 34. On and Off Fast Growth Predictable peaksVariable peaks WASTE
  35. 35. Fast GrowthOn and Off Predictable peaksVariable peaks
  36. 36. Your choice of tools on Hadoop/EMR Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  37. 37. SQL based processing Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework Petabyte scale Columnar Data - warehouse
  38. 38. What is Amazon Redshift ? Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Easy to provision and scale No upfront costs, pay as you go High performance at a low price Open and flexible with support for popular BI tools
  39. 39. Amazon Redshift is priced to let you analyze all your data Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999 Simple Pricing Number of Nodes x Cost per Hour No charge for Leader Node No upfront costs Pay as you go
  40. 40. Your choice of BI Tools on the cloud Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework
  41. 41. Generation Collection & storage Analytics & computation Collaboration & sharing
  42. 42. Collaboration and Sharing insights Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift
  43. 43. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  44. 44. Sharing results and visualizations and scale Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  45. 45. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools
  46. 46. Geospatial Visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Visualization tools
  47. 47. Rinse Repeat every day or hour
  48. 48. Rinse and Repeat Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  49. 49. The complete architecture Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  50. 50. Kaushik Dutta CTO 18 July, 2013 Mobilewalla – App Audience Measurement With Amazon EC2 Infrastructure
  51. 51. Mobilewalla • Seattle-based big data venture that has accumulated the largest volumetric database of app market data in the industry. • Applying data science techniques on this data, Mobilewalla generates actionable intelligence of importance to ad agencies, ad tech companies, and app publishers • Measuring audience in mobile apps
  52. 52. Traditional audience measurement - Panels & Popularity Persistence Fundamental to panel driven measurement Idea of popularity persistence Large pool of options “small” set of popular choices 99 – 1 rule Objects popular today  popular 30-60-90 days from today • Panel can be assumed to eventually gravitate towards the persistent popular set
  53. 53. Mobilewalla Use Case – App Publishers • How is my app doing? – Rank by Category and Country, Reviews, Ratings, Feature mentions, Sentiment Analysis, Social Media, Audience Profile, Negative Review Analysis, Upgrades • Competitive Tracking – All of the above for competitors presented as overlays • Audience Analysis – Demographics, Psychographics • Alerts – Notifications upon specific events: review spikes, Twitter spikes
  54. 54. Mobilewalla Use Case – Mobile Ad Tech • New Publisher Acquisition – Top N apps & Publishers for a Category / Geography – Top publishers by audience • Optimal Traffic Allocation – Related apps by content – Related apps by Audience profile – Behavioral profiles of network apps • Real-Time, Programmatic Delivery – API driven access – Sub 100ms response times
  55. 55. Mobilewalla Approach Social media / web Web Crawler Cloud Storage Amazon S3 Amazon EBS Amazon RDS
  56. 56. Mobilewalla Approach – Map-Reduce based analytics Analytics Analytics Analytics Analytics Map Reduce Analytics Cloud Storage ( 30+ Terabyte) Amazon S3 Amazon EBS Amazon RDS
  57. 57. Mobilewalla – Amazon EC2 Infrastructure Web Crawler • 700+ micro to small instances • Elastic map-reduce – flexibility of allocating a large number instances for a distributed program running for short time • Spot Instance – reduces the cost
  58. 58. Mobilewalla – Amazon EC2 Infrastructure Cloud Storage • 50+ Medium to Large instances • Cassandra DB Nodes – EBS backed • Distributed in two availability zones in two different geographical regions • Flexibility to add nodes as and when required – allows you to grow with the business • Region based fail-over • Tier Storage systems – Local storage – Elastic Block Storage – S3 Storage • Considering Amazon Redshift Amazon S3 Amazon EBS Amazon RDS
  59. 59. Mobilewalla – Amazon EC2 Infrastructure Map Reduce Framework • Complex analytics jobs on Hadoop systems in EC2 nodes • Elastic map-reduce for jobs requiring large number of nodes on S3 storage systems Analytics Analytics Analytics Analytics
  60. 60. Mobilewalla – Amazon EC2 Infrastructure Analytics Delivery • Multiple application servers with load balancers • High read throughput from data nodes • Load balancers (ELB) and fail-over
  61. 61. Amazon Web Services for Mobilewalla - Advantages • On-Demand and reserved nodes – Flexibility to add, modify, delete nodes as your business changes • Tiered storage systems to store and manage terabytes of data – Flexibility to change the data parameters (reliability, read-throughput, write throughput) by varying the storage systems of your choice • Elastic Map-Reduce – Large scale map-reduce cluster without getting details into managing individual nodes and map-reduce framework Amazon EC2 allowed us to size our infrastructure as per our need and data growth.
  62. 62. Amazon Web Services for Mobilewalla - Suggestions • Take the initial time to explore all the various offerings of Amazon in data storage and management, before developing a solution • Changing solution architecture for terabytes of data at later time is a challenge
  63. 63. Thank You
  64. 64. Big Data Analytics Eddie Toh Regional Platform Marketing Manager Pricing & Product Marketing Group Intel APAC July 18, 2013
  65. 65. Create new business models and improve organizational processes. Enhance scientific understanding, drive innovation, and accelerate medical cures. Increase public safety and improve energy efficiency with smart grids. Analysis of Data Can Transform Society
  66. 66. Unlock Value in Silicon Support Open Platforms Deliver Software Value Democratizing Analytics gets Value out of Big Data
  67. 67. Intel at the Intersection of Big Data Enabling exascale computing on massive data sets Helping enterprises build open interoperable clouds CloudHPC Contributing code and fostering ecosystem Open Source
  68. 68. Intel at the Heart of the Cloud Server Storage Network
  69. 69. Scale-Out Platform Optimizations for Big Data Cost-effective performance • Intel® Advanced Vector Extension Technology • Intel® Turbo Boost Technology 2.0 • Intel® Advanced Encryption Standard New Instructions Technology
  70. 70. Intel® Advanced Vector Extensions Technology 1 : Performance comparison using Linpack benchmark. See backup for configuration details. For more legal information on performance forecasts go to http://www.intel.com/performance 76 • Newest in a long line of processor instruction innovations • Increases floating point operations per clock up to 2X1 performance Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  71. 71. More Performance Higher turbo speeds maximize performance for single and multi-threaded applications Intel® Turbo Boost Technology 2.0
  72. 72. Intel® Advanced Encryption Standard New Instructions • Processor assistance for performing AES encryption - 7 new instructions • Makes enabled encryption software faster and stronger
  73. 73. Richer user experiences 4HRS 50% Reduction ~7MIN 80% Reduction 50% Reduction 40% Reduction TeraSort for 1TB sort Intel® Xeon® Processor E5 2600 Solid-State Drive 10G Ethernet Intel® Distribution for Apache Hadoop Previous Intel® Xeon® Processor Power of the Platform built by Intel
  74. 74. Cloud Intelligent Systems Clients Virtuous Cycle of Data-Driven Experience
  75. 75. Thank You
  76. 76. Technical Track
  77. 77. Break Technical Track

×