Aaum Analytics event - Big data in the cloud

480 views
349 views

Published on

Presentation given at Aaum Analytics event, which spoke about how to do Big Data Analytics on the Cloud.

Published in: Data & Analytics, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
480
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Aaum Analytics event - Big data in the cloud

  1. 1. Big Data in the Cloud Ganesh Raja Solutions Architect AWS graja@amazon.com ©Amazon.com, Inc. and its affiliates. All rights reserved.
  2. 2. When you do a directory listing on a folder and the lights start to flicker What is Big Data ?
  3. 3. When your data sets become so large that you have to start innovating how to Collect, Store, Organize, Analyze and Share it Its tough because of Velocity, Volume and Variety What is Big Data ?
  4. 4. Big Data Generation Collection & storage Analytics & computation Collaboration & sharing
  5. 5. The cost of data generation is falling
  6. 6. GB TB PB ZB EB Unconstrained Data Growth Big Data is now moving fast … • IT/ Application server logs IT Infrastructure logs, Metering, Audit logs, Change logs • Web sites / Mobile Apps/ Ads Clickstream, User Engagement • Sensor data Weather, Smart Grids, Wearables • Social Media, User Content 450MM+ Tweets/day
  7. 7. The volume of data is increasing
  8. 8. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  9. 9. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput Highly constrained
  10. 10. Data Volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Generated data Available for analysis
  11. 11. Elastic & Highly Scalable + No capital expense + Pay-per-use + On-demand Cloud Computing $0 = Remove constraints
  12. 12. Generation Collection & storage Analytics & computation Collaboration & sharing Accelerated
  13. 13. 3.5 Billion records, 71 Million unique cookies, 1.7 Million targeted ads per day Analyzed customer clicks and impressions with Elastic MapReduce “…no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired.” “Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.” Targeted Ad User recently purchased a sports movie and is searching for video games Case Study: Razorfish
  14. 14. Results: 500% return on ad spend From 2 months procurement time to a minute
  15. 15. How?
  16. 16. Big Data Technology Technologies and techniques for working productively with data, at any scale
  17. 17. Generation Collection & storage Analytics & computation Collaboration & sharing AWS Data Pipeline Amazon S3, Amazon Glacier, AWS Storage Gateway, Amazon DynamoDB, Amazon Redshift, Amazon RDS, Amazon Kinesis Amazon EC2, Amazon Elastic MapReduce Amazon EC2 Amazon S3, Amazon Redshift, Amazon RDS
  18. 18. Amazon Simple Storage Service (S3)
  19. 19. EMR is Hadoop in the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines What is Amazon Elastic MapReduce (EMR)?
  20. 20. How does it work? 1. Put the data into S3 (or HDFS) 2. Launch your cluster. Choose: • Hadoop distribution • How many nodes • Node type (hi-CPU, hi- memory, etc.) • Hadoop apps (Hive, Pig, HBase) 3. Get the results. S3 EMR Cluster
  21. 21. How does it work? S3 EMR Cluster You can easily resize the cluster.
  22. 22. How does it work? S3 EMR Cluster Use Spot instances to save time and money.
  23. 23. How does it work? S3 EMR Cluster Launch parallel clusters against the same data source
  24. 24. How does it work? S3 EMR Cluster When the work is complete, you can terminate the cluster (and stop paying)
  25. 25. Cost to run a 100 node Elastic MapReduce Cluster INR 450/hour ($7.5/h)
  26. 26. Cost to run a 100 node Elastic MapReduce Cluster Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/ Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/
  27. 27. Each day AWS adds the equivalent server capacity to a global, $7B enterprise
  28. 28. Thousands of Customers, 5+ Millions of Clusters
  29. 29. The Zoo Apache Kafka Amazon Kinesis Apache Flume Storm Apache Spark Apache Spark Streaming Hadoop/ EMR Redshift S3 DynamoDB Hive Pig Shark HDFS Impala ?
  30. 30. EMR makes it easy to use Hive and Pig Pig: • High-level programming language (Pig Latin) • Supports UDFs • Ideal for data flow/ETL Hive: • Data Warehouse for Hadoop • SQL-like query language (HiveQL) • Initially developed at Facebook
  31. 31. HBase: • Column-oriented database • Runs on top of HDFS • Ideal for sparse data • Random, read/write access • Ideal for very large tables (billions of rows, millions of columns) Mahout: • Machine learning library • Supports recommendation mining, clustering, classification, and frequent itemset mining EMR makes it easy to use HBase and Mahout
  32. 32. EMR makes it easy to use Spark and Shark Shark: • Hive on Spark • Up to 100x Faster • Compatible with Hive • Used at Yahoo, Airbnb, etc • Download via BA Spark: • In-memory MapReduce • Up to 100x faster than Hadoop • Access HDFS, HBase, S3 • Developed at UCBerkely • Download via BA
  33. 33. Ganglia • Scalable distributed monitoring • View performance of the cluster and individual nodes • Open source R: • Language and software environment for statistical computing and graphics • Open source EMR makes it easy to use other tools and applications
  34. 34. EMR Also Supports MapR’s Hadoop Distributions In addition to the Amazon Hadoop distribution, EMR supports two MapR Hadoop distributions M3 M5 M7 Key features of MapR NFS No NameNode JobTracker high availability Cluster mirroring (disaster recovery) Compression
  35. 35. Integrates with Hadoop Eco-System EMR
  36. 36. Amazon S3 Amazon EMR
  37. 37. Amazon S3 Amazon EMR Amazon Redshift
  38. 38. Amazon Redshift • Easy to provision, scale, operate • No up-front costs, pay-as-you-go • Fast: Columnar storage, compression, Specialized nodes • 1-100 node clusters 2TB - 1.6PB • $999 per TB per year • Transparent backups, restore, failover • Security in transit, at rest, for backups, VPC Data Warehousing the AWS way
  39. 39. Amazon S3 Amazon EMR Amazon Redshift
  40. 40. Amazon S3 Amazon EMR Amazon Redshift Reporting +BI
  41. 41. Amazon Redshift AWS Marketplace AWS BI Partners
  42. 42. Amazon Redshift Reporting +BI On-premise Systems
  43. 43. 3 million (100 * 30 thousand) queries are executed per hour on 100 HANA instances.
  44. 44. Each SAP HANA instance is deployed on an Amazon Web Services (AWS) Linux server. Each AWS Server has 16 cores, 60 GB of RAM and 600 million rows (Total: 1776 cores, 6.6TB RAM and 60 billion rows).”
  45. 45. Rolls-Royce performance for the cost of a Ford.USD 400 / Hour
  46. 46. Experiment often Fail quickly, at low cost More Innovation
  47. 47. Data is the New Oil
  48. 48. How does Rajinikanth do Big Data ?
  49. 49. Thank you! Ganesh Raja Solutions Architect AWS graja@amazon.com ©Amazon.com, Inc. and its affiliates. All rights reserved.

×