AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla

1,614 views
1,303 views

Published on

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
  • great presentaion
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,614
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
0
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and Mobilewalla

  1. 1. Abhishek Sinha Business Development Manager, AWS July 18, 2013 @abysinha sinhaar@amazon.com Big Data Analytics
  2. 2. Overview • The Big Data Challenge • Turning data into actionable information • Building a big data platform • Mobilewalla– Big data system in AWS for mobile app audience measurement • Intel technology on big data.
  3. 3. Generation Collection & storage Analytics & computation Collaboration & sharing
  4. 4. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  5. 5. Generation Collection & storage Analytics & computation Collaboration & sharing Highly constrained Lower cost, higher throughput
  6. 6. Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  7. 7. Big Gap in turning data into actionable information
  8. 8. Amazon Web Services helps remove constraints
  9. 9. 1 instance x 100 hours = 100 instances x 1 hour
  10. 10. Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendation Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographics Usage analysis In-game metrics Big Data Verticals and Use cases
  11. 11. From data to actionable information
  12. 12. “Who is using our service?”
  13. 13. Identified early mobile usage Invested heavily in mobile development Finding signal in the noise of logs
  14. 14. 9,432,061 unique mobile devices used the Yelp mobile app. 4 million+ calls. 5 million+ directions. In January 2013
  15. 15. “What kind of movies do people like ?”
  16. 16. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  17. 17. Query complements the R3 solution by providing granular search-and- retrieval functionality for structured and unstructured data stored in FinQloud
  18. 18. Building a Big-Data Architecture
  19. 19. Generation Collection & storage Analytics & computation Collaboration & sharing
  20. 20. Generation Collection & storage Analytics & computation Collaboration & sharing
  21. 21. Getting your Data into AWS Amazon S3 Corporate Data Center • Console Upload • FTP • AWS Import Export • S3 API • Direct Connect • Storage Gateway • 3rd Party Commercial Apps • Tsunami UDP 1
  22. 22. Write directly to a data source Your application Amazon S3 DynamoDB Any other data store Amazon S3 Amazon EC2 2
  23. 23. Queue , pre-process and then write to data source Amazon Simple Queue Service (SQS) Amazon S3 DynamoDB Any other data store 3
  24. 24. Agency Customer: Video Analytics on AWS Elastic Load Balancer Edge Servers on EC2 Workers on EC2 Logs Reports HDFS Cluster Amazon Simple Queue Service (SQS) Amazon Simple Storage Service (S3) Amazon Elastic MapReduce
  25. 25. Aggregate and write to data source Flume running on EC2 Amazon S3 Any other data store HDFS 4
  26. 26. Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html S3 as a “single source of truth” S3
  27. 27. Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Choose depending upon design
  28. 28. Generation Collection & storage Analytics & computation Collaboration & sharing
  29. 29. Hadoop based Analysis Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  30. 30. EMR is Hadoop in the Cloud What is Amazon Elastic MapReduce (EMR)?
  31. 31. EMR Cluster S3 Put the data into S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Get the output from S3 Launch the cluster using the EMR console, CLI, SDK, or APIs You can also store everything in HDFS How does EMR work ?
  32. 32. S3 What can you run on EMR… EMR Cluster
  33. 33. Resize Nodes EMR Cluster You can easily add and remove nodes
  34. 34. On and Off Fast Growth Predictable peaksVariable peaks WASTE
  35. 35. Fast GrowthOn and Off Predictable peaksVariable peaks
  36. 36. Your choice of tools on Hadoop/EMR Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  37. 37. SQL based processing Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework Petabyte scale Columnar Data - warehouse
  38. 38. What is Amazon Redshift ? Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Easy to provision and scale No upfront costs, pay as you go High performance at a low price Open and flexible with support for popular BI tools
  39. 39. Amazon Redshift is priced to let you analyze all your data Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999 Simple Pricing Number of Nodes x Cost per Hour No charge for Leader Node No upfront costs Pay as you go
  40. 40. Your choice of BI Tools on the cloud Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework
  41. 41. Generation Collection & storage Analytics & computation Collaboration & sharing
  42. 42. Collaboration and Sharing insights Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift
  43. 43. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  44. 44. Sharing results and visualizations and scale Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  45. 45. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools
  46. 46. Geospatial Visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Visualization tools
  47. 47. Rinse Repeat every day or hour
  48. 48. Rinse and Repeat Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  49. 49. The complete architecture Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  50. 50. Kaushik Dutta CTO 18 July, 2013 Mobilewalla – App Audience Measurement With Amazon EC2 Infrastructure
  51. 51. Mobilewalla • Seattle-based big data venture that has accumulated the largest volumetric database of app market data in the industry. • Applying data science techniques on this data, Mobilewalla generates actionable intelligence of importance to ad agencies, ad tech companies, and app publishers • Measuring audience in mobile apps
  52. 52. Traditional audience measurement - Panels & Popularity Persistence Fundamental to panel driven measurement Idea of popularity persistence Large pool of options “small” set of popular choices 99 – 1 rule Objects popular today  popular 30-60-90 days from today • Panel can be assumed to eventually gravitate towards the persistent popular set
  53. 53. Mobilewalla Use Case – App Publishers • How is my app doing? – Rank by Category and Country, Reviews, Ratings, Feature mentions, Sentiment Analysis, Social Media, Audience Profile, Negative Review Analysis, Upgrades • Competitive Tracking – All of the above for competitors presented as overlays • Audience Analysis – Demographics, Psychographics • Alerts – Notifications upon specific events: review spikes, Twitter spikes
  54. 54. Mobilewalla Use Case – Mobile Ad Tech • New Publisher Acquisition – Top N apps & Publishers for a Category / Geography – Top publishers by audience • Optimal Traffic Allocation – Related apps by content – Related apps by Audience profile – Behavioral profiles of network apps • Real-Time, Programmatic Delivery – API driven access – Sub 100ms response times
  55. 55. Mobilewalla Approach Social media / web Web Crawler Cloud Storage Amazon S3 Amazon EBS Amazon RDS
  56. 56. Mobilewalla Approach – Map-Reduce based analytics Analytics Analytics Analytics Analytics Map Reduce Analytics Cloud Storage ( 30+ Terabyte) Amazon S3 Amazon EBS Amazon RDS
  57. 57. Mobilewalla – Amazon EC2 Infrastructure Web Crawler • 700+ micro to small instances • Elastic map-reduce – flexibility of allocating a large number instances for a distributed program running for short time • Spot Instance – reduces the cost
  58. 58. Mobilewalla – Amazon EC2 Infrastructure Cloud Storage • 50+ Medium to Large instances • Cassandra DB Nodes – EBS backed • Distributed in two availability zones in two different geographical regions • Flexibility to add nodes as and when required – allows you to grow with the business • Region based fail-over • Tier Storage systems – Local storage – Elastic Block Storage – S3 Storage • Considering Amazon Redshift Amazon S3 Amazon EBS Amazon RDS
  59. 59. Mobilewalla – Amazon EC2 Infrastructure Map Reduce Framework • Complex analytics jobs on Hadoop systems in EC2 nodes • Elastic map-reduce for jobs requiring large number of nodes on S3 storage systems Analytics Analytics Analytics Analytics
  60. 60. Mobilewalla – Amazon EC2 Infrastructure Analytics Delivery • Multiple application servers with load balancers • High read throughput from data nodes • Load balancers (ELB) and fail-over
  61. 61. Amazon Web Services for Mobilewalla - Advantages • On-Demand and reserved nodes – Flexibility to add, modify, delete nodes as your business changes • Tiered storage systems to store and manage terabytes of data – Flexibility to change the data parameters (reliability, read-throughput, write throughput) by varying the storage systems of your choice • Elastic Map-Reduce – Large scale map-reduce cluster without getting details into managing individual nodes and map-reduce framework Amazon EC2 allowed us to size our infrastructure as per our need and data growth.
  62. 62. Amazon Web Services for Mobilewalla - Suggestions • Take the initial time to explore all the various offerings of Amazon in data storage and management, before developing a solution • Changing solution architecture for terabytes of data at later time is a challenge
  63. 63. Thank You
  64. 64. Big Data Analytics Eddie Toh Regional Platform Marketing Manager Pricing & Product Marketing Group Intel APAC July 18, 2013
  65. 65. Create new business models and improve organizational processes. Enhance scientific understanding, drive innovation, and accelerate medical cures. Increase public safety and improve energy efficiency with smart grids. Analysis of Data Can Transform Society
  66. 66. Unlock Value in Silicon Support Open Platforms Deliver Software Value Democratizing Analytics gets Value out of Big Data
  67. 67. Intel at the Intersection of Big Data Enabling exascale computing on massive data sets Helping enterprises build open interoperable clouds CloudHPC Contributing code and fostering ecosystem Open Source
  68. 68. Intel at the Heart of the Cloud Server Storage Network
  69. 69. Scale-Out Platform Optimizations for Big Data Cost-effective performance • Intel® Advanced Vector Extension Technology • Intel® Turbo Boost Technology 2.0 • Intel® Advanced Encryption Standard New Instructions Technology
  70. 70. Intel® Advanced Vector Extensions Technology 1 : Performance comparison using Linpack benchmark. See backup for configuration details. For more legal information on performance forecasts go to http://www.intel.com/performance 76 • Newest in a long line of processor instruction innovations • Increases floating point operations per clock up to 2X1 performance Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  71. 71. More Performance Higher turbo speeds maximize performance for single and multi-threaded applications Intel® Turbo Boost Technology 2.0
  72. 72. Intel® Advanced Encryption Standard New Instructions • Processor assistance for performing AES encryption - 7 new instructions • Makes enabled encryption software faster and stronger
  73. 73. Richer user experiences 4HRS 50% Reduction ~7MIN 80% Reduction 50% Reduction 40% Reduction TeraSort for 1TB sort Intel® Xeon® Processor E5 2600 Solid-State Drive 10G Ethernet Intel® Distribution for Apache Hadoop Previous Intel® Xeon® Processor Power of the Platform built by Intel
  74. 74. Cloud Intelligent Systems Clients Virtuous Cycle of Data-Driven Experience
  75. 75. Thank You
  76. 76. Technical Track
  77. 77. Break Technical Track

×