AWS Summit Milan - Data Analysis

1,299 views
1,068 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,299
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
104
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

AWS Summit Milan - Data Analysis

  1. 1. AWS Summit 2013 Milan 31 Ottobre 2013 DATA ANALYSIS ON AWS Hakan Gurel Solutions Architecture
  2. 2. THE COST OF GENERATING DATA IS FALLING
  3. 3. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT
  4. 4. Lower cost, higher throughput  GENERATE  STORE  ANALYZE  SHARE Highly constrained
  5. 5. DATA VOLUME Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  6. 6. ACCELERATE GENERATE  STORE  ANALYZE  SHARE
  7. 7. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + PAY FOR ONLY WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  8. 8. AWS Import / Export AWS Direct Connect GENERATE  STORE  ANALYZE  SHARE
  9. 9. Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
  10. 10. Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  11. 11. AMAZON S3 SIMPLE STORAGE SERVICE
  12. 12. AMAZON DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED NoSQL DATABASE SERVICE
  13. 13. DURABLE & AVAILABLE CONSISTENT, DISK-ONLY WRITES (SSD)
  14. 14. LOW LATENCY AVERAGE READS < 5MS, WRITES < 10MS
  15. 15. NO ADMINISTRATION
  16. 16. 500,000 WRITES PER SECOND DURING SUPER BOWL
  17. 17. AMAZON REDSHIFT FULLY MANAGED, PETA-BYTE SCALE DATAWAREHOUSE ON AWS
  18. 18. DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was… A Lot Faster AMAZON REDSHIFT A Lot Cheaper A Whole Lot Simpler
  19. 19. AMAZON REDSHIFT RUNS ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
  20. 20. 30 MINUTES DOWN TO 12 SECONDS
  21. 21. AMAZON REDSHIFT LETS YOU START SMALL AND GROW BIG Extra Large Node (HS1.XL) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB)
  22. 22. CREATE A DATAWAREHOUSE IN MINUTES
  23. 23. JDBC/ODBC
  24. 24. Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999
  25. 25. DATA WAREHOUSING DONE THE AWS WAY Easy to provision and scale up massively No upfront costs, pay as you go Really fast performance at a really low price Open and flexible with support for popular tools
  26. 26. USAGE SCENARIOS
  27. 27. Cloud ETL for Big Data S3 EMR Redshift Reporting and BI • Maintain online SQL access to historical logs • Transformation and enrichment with EMR • Longer history ensures better insight
  28. 28. Live archive for (structured) Big Data OLTP Web Apps • • • • DynamoDB Redshift Direct integration with copy command High velocity data Data ages into Redshift Low cost, high scale option for new apps Reporting and BI
  29. 29. Reporting Warehouse OLTP ERP RDBMS Redshift • Accelerated operational reporting • Support for short-time use cases • Data compression, index redundancy Reporting and BI
  30. 30. On-Premises Integration OLTP ERP RDBMS Redshift + Reporting & BI
  31. 31. GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce
  32. 32. AMAZON EC2 ELASTIC COMPUTE CLOUD
  33. 33. CLUSTER GPU QUADRUPLE EXTRA LARGE Intel Xeon X5570, quad-core Nehalem architecture NVIDIA Tesla Fermi M2050 GPUs 22 GB of memory – 1.7 TB of storage 2x 2x
  34. 34. ON A SINGLE INSTANCE COMPUTE TIME: 4h COST: 4h x $2.1 = $8.4
  35. 35. ON MULTIPLE INSTANCES COMPUTE TIME: 1h COST: 1h x 4 x $2.1 = $8.4
  36. 36. For 3 hours $4828.85/hr instead of $20+ MILLIONS in infrastructure
  37. 37. AMAZON ELASTIC MAPREDUCE HADOOP AS A SERVICE
  38. 38. • • • • A FRAMEWORK SPLITS DATA INTO PIECES LETS PROCESSING OCCUR GATHERS THE RESULTS
  39. 39. Corporate Data Center Application data and logs for analysis pushed to S3 Elastic Data Center
  40. 40. Amazon Elastic Map Reduce name node to control analysis N Corporate Data Center Elastic Data Center
  41. 41. N Corporate Data Center Hadoop cluster started by Elastic Map Reduce Elastic Data Center
  42. 42. N Corporate Data Center Adding many hundreds or thousands of nodes Elastic Data Center
  43. 43. Disposed of when job completes N Corporate Data Center Elastic Data Center
  44. 44. Corporate Data Center Results of analysis pulled back into your systems Elastic Data Center
  45. 45. Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  46. 46. GENERATE  STORE  ANALYZE  SHARE AWS Data Pipeline
  47. 47. AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage compute resources
  48. 48. AWS Import / Export AWS Direct Connect Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce AWS Data Pipeline
  49. 49. FROM DATA TO ACTIONABLE INFORMATION
  50. 50. Stefano Rodighiero
  51. 51. MXM FACTS 7+ million lyrics catalogue in more than 50 distinct languages Currently musiXmatch is the only lyrics platform allowed for worldwide licensing and has deals with top Music Publishers: Warner Chappell, Universal, BMG, EMI Publishing, Sony ATV, Peer Music, ... Daily updated with more than 1 million artists and more than 20 million music tracks Synced lyrics! Music Discography Meta Data: Lyrics, Artists, Albums, Songs, Biographies, Worldwide Charts Words matter
  52. 52. SYNCED LYRICS
  53. 53. OUR DATA MUSIC METADATA: RECORDING & PUBLISHING
  54. 54. OUR DATA CONTENT USAGE
  55. 55. OUR DATA OTHER SOURCES
  56. 56. DATA ANALYSIS @ MXM CONTENT USAGE: REPORTING & ANALYTICS
  57. 57. DATAFLOW Frontend Filter/norma lization Redis (real time analytics) "Unrolling" Redshift Hive Analytics Post process Publishing catalogue Batch Words matter
  58. 58. BATCH REPORTING Step 1. Aggregation of views by country, application and content type Step 2. Join with a 500M+ rows table Hive It takes approx 1 hour with 5 c1.xlarge instances It used to take days with traditional techniques! Post process Batch SQL interface makes it easier to review and share the process Words matter
  59. 59. DATAFLOW Frontend proxy Filter/norm alization Redis (real time analytics) "Unrolling" Redshift Hive Analytics Publishing catalogue Post process Interactive Words matter
  60. 60. INTERACTIVE ANALYTICS SQL interface like Hive, accessible with any Postgresql client... Redis (real time analytics) Redshift ...but faster! Flexible costs Analytics With Redshift doing all the heavy lifting, it's easier to build analytics tools Interactive Words matter
  61. 61. DATAFLOW Frontend proxy Filter/normali zation Redis (real time analytics) "Unrolling" Redshift Hive Analytics Post process Publishing catalogue Interactive Batch Words matter
  62. 62. MUSIXMATCH Stefano Rodighiero stefano@musixmatch.com @larsen Words matter
  63. 63. MUSIXMATCH THANK YOU!
  64. 64. THANK YOU hakan@amazon.lu

×