Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
2. Overview
• The Big Data Challenge
• Turning data into actionable information
• Building a big data platform
• Mobilewalla– Big data system in AWS for mobile app audience measurement
• Intel technology on big data.
7. Generated data
Available for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
8. Big Gap in turning data into actionable
information
12. Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendation
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial
Services
Monte Carlo
Simulations
Risk Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demographics
Usage
analysis
In-game
metrics
Big Data Verticals and Use cases
18. More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 months
4 million ratings per day
3 million searches
Device location , time , day,
week etc.
Social data
19.
20.
21.
22. Query complements the R3 solution by providing granular search-and-
retrieval functionality for structured and unstructured data stored in FinQloud
27. Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
1
28. Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
2
29. Queue , pre-process and then write to data source
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
store
3
30. Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue
Service (SQS)
Amazon Simple Storage Service
(S3)
Amazon Elastic MapReduce
31. Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data
store
HDFS
4
36. EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
37. EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
How does EMR work ?
42. Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
43. SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
44. What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale
data warehouse service in the AWS cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
45. Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL
Single Node
Effective Hourly Price
Per TB
Effective Annual Price
per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
46. Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
48. Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
49. Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
50. Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
51. Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
52. Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
54. Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
55. The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
57. Mobilewalla
• Seattle-based big data venture that has accumulated the largest volumetric
database of app market data in the industry.
• Applying data science techniques on this data, Mobilewalla generates
actionable intelligence of importance to ad agencies, ad tech companies, and
app publishers
• Measuring audience in mobile apps
58. Traditional audience measurement - Panels & Popularity
Persistence
Fundamental to panel driven measurement
Idea of popularity persistence
Large pool
of options
“small” set of
popular choices
99 – 1 rule
Objects popular today popular 30-60-90 days from today
• Panel can be assumed to eventually gravitate towards the persistent
popular set
59. Mobilewalla Use Case – App Publishers
• How is my app doing?
– Rank by Category and Country, Reviews, Ratings, Feature mentions,
Sentiment Analysis, Social Media, Audience Profile, Negative Review
Analysis, Upgrades
• Competitive Tracking
– All of the above for competitors presented as overlays
• Audience Analysis
– Demographics, Psychographics
• Alerts
– Notifications upon specific events: review spikes, Twitter spikes
60. Mobilewalla Use Case – Mobile Ad Tech
• New Publisher Acquisition
– Top N apps & Publishers for a Category / Geography
– Top publishers by audience
• Optimal Traffic Allocation
– Related apps by content
– Related apps by Audience profile
– Behavioral profiles of network apps
• Real-Time, Programmatic Delivery
– API driven access
– Sub 100ms response times
63. Mobilewalla – Amazon EC2 Infrastructure
Web Crawler
• 700+ micro to small instances
• Elastic map-reduce – flexibility of
allocating a large number instances
for a distributed program running for
short time
• Spot Instance – reduces the cost
64. Mobilewalla – Amazon EC2 Infrastructure
Cloud Storage
• 50+ Medium to Large instances
• Cassandra DB Nodes – EBS backed
• Distributed in two availability zones in two
different geographical regions
• Flexibility to add nodes as and when required
– allows you to grow with the business
• Region based fail-over
• Tier Storage systems
– Local storage
– Elastic Block Storage
– S3 Storage
• Considering Amazon Redshift
Amazon S3 Amazon EBS
Amazon RDS
65. Mobilewalla – Amazon EC2 Infrastructure
Map Reduce Framework
• Complex analytics jobs on Hadoop
systems in EC2 nodes
• Elastic map-reduce for jobs
requiring large number of nodes on
S3 storage systems
Analytics
Analytics
Analytics
Analytics
66. Mobilewalla – Amazon EC2 Infrastructure
Analytics Delivery
• Multiple application servers with
load balancers
• High read throughput from data
nodes
• Load balancers (ELB) and fail-over
67. Amazon Web Services for Mobilewalla - Advantages
• On-Demand and reserved nodes
– Flexibility to add, modify, delete nodes as your business changes
• Tiered storage systems to store and manage terabytes of data
– Flexibility to change the data parameters (reliability, read-throughput, write
throughput) by varying the storage systems of your choice
• Elastic Map-Reduce
– Large scale map-reduce cluster without getting details into managing individual
nodes and map-reduce framework
Amazon EC2 allowed us to size our infrastructure as per our need and data
growth.
68. Amazon Web Services for Mobilewalla - Suggestions
• Take the initial time to explore all the various offerings of Amazon in data
storage and management, before developing a solution
• Changing solution architecture for terabytes of data at later time is a
challenge
70. Big Data Analytics
Eddie Toh
Regional Platform Marketing Manager
Pricing & Product Marketing Group
Intel APAC
July 18, 2013
71. Create new business
models and improve
organizational
processes.
Enhance scientific
understanding, drive
innovation, and
accelerate medical cures.
Increase public safety
and improve
energy efficiency with
smart grids.
Analysis of Data Can Transform Society
73. Intel at the Intersection of Big Data
Enabling exascale computing on massive data sets
Helping enterprises
build open
interoperable clouds
CloudHPC
Contributing code
and fostering
ecosystem
Open Source
74. Intel at the Heart of the Cloud
Server
Storage
Network
75. Scale-Out Platform Optimizations for Big Data
Cost-effective performance
• Intel® Advanced Vector Extension
Technology
• Intel® Turbo Boost Technology 2.0
• Intel® Advanced Encryption Standard New
Instructions Technology
76. Intel® Advanced Vector Extensions Technology
1 : Performance comparison using Linpack benchmark. See backup for configuration details.
For more legal information on performance forecasts go to http://www.intel.com/performance
76
• Newest in a long line of
processor instruction
innovations
• Increases floating point
operations per clock up to
2X1 performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products.
77. More Performance
Higher turbo speeds maximize
performance for single and
multi-threaded applications
Intel® Turbo Boost Technology 2.0
78. Intel® Advanced Encryption Standard New Instructions
• Processor assistance for performing
AES encryption - 7 new instructions
• Makes enabled encryption software
faster and stronger