AWS Summit Milan - Data Analysis

  • 832 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
832
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
96
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. AWS Summit 2013 Milan 31 Ottobre 2013 DATA ANALYSIS ON AWS Hakan Gurel Solutions Architecture
  • 2. THE COST OF GENERATING DATA IS FALLING
  • 3. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT
  • 4. Lower cost, higher throughput  GENERATE  STORE  ANALYZE  SHARE Highly constrained
  • 5. DATA VOLUME Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • 6. ACCELERATE GENERATE  STORE  ANALYZE  SHARE
  • 7. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + PAY FOR ONLY WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  • 8. AWS Import / Export AWS Direct Connect GENERATE  STORE  ANALYZE  SHARE
  • 9. Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
  • 10. Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  • 11. AMAZON S3 SIMPLE STORAGE SERVICE
  • 12. AMAZON DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED NoSQL DATABASE SERVICE
  • 13. DURABLE & AVAILABLE CONSISTENT, DISK-ONLY WRITES (SSD)
  • 14. LOW LATENCY AVERAGE READS < 5MS, WRITES < 10MS
  • 15. NO ADMINISTRATION
  • 16. 500,000 WRITES PER SECOND DURING SUPER BOWL
  • 17. AMAZON REDSHIFT FULLY MANAGED, PETA-BYTE SCALE DATAWAREHOUSE ON AWS
  • 18. DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was… A Lot Faster AMAZON REDSHIFT A Lot Cheaper A Whole Lot Simpler
  • 19. AMAZON REDSHIFT RUNS ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
  • 20. 30 MINUTES DOWN TO 12 SECONDS
  • 21. AMAZON REDSHIFT LETS YOU START SMALL AND GROW BIG Extra Large Node (HS1.XL) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB)
  • 22. CREATE A DATAWAREHOUSE IN MINUTES
  • 23. JDBC/ODBC
  • 24. Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999
  • 25. DATA WAREHOUSING DONE THE AWS WAY Easy to provision and scale up massively No upfront costs, pay as you go Really fast performance at a really low price Open and flexible with support for popular tools
  • 26. USAGE SCENARIOS
  • 27. Cloud ETL for Big Data S3 EMR Redshift Reporting and BI • Maintain online SQL access to historical logs • Transformation and enrichment with EMR • Longer history ensures better insight
  • 28. Live archive for (structured) Big Data OLTP Web Apps • • • • DynamoDB Redshift Direct integration with copy command High velocity data Data ages into Redshift Low cost, high scale option for new apps Reporting and BI
  • 29. Reporting Warehouse OLTP ERP RDBMS Redshift • Accelerated operational reporting • Support for short-time use cases • Data compression, index redundancy Reporting and BI
  • 30. On-Premises Integration OLTP ERP RDBMS Redshift + Reporting & BI
  • 31. GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce
  • 32. AMAZON EC2 ELASTIC COMPUTE CLOUD
  • 33. CLUSTER GPU QUADRUPLE EXTRA LARGE Intel Xeon X5570, quad-core Nehalem architecture NVIDIA Tesla Fermi M2050 GPUs 22 GB of memory – 1.7 TB of storage 2x 2x
  • 34. ON A SINGLE INSTANCE COMPUTE TIME: 4h COST: 4h x $2.1 = $8.4
  • 35. ON MULTIPLE INSTANCES COMPUTE TIME: 1h COST: 1h x 4 x $2.1 = $8.4
  • 36. For 3 hours $4828.85/hr instead of $20+ MILLIONS in infrastructure
  • 37. AMAZON ELASTIC MAPREDUCE HADOOP AS A SERVICE
  • 38. • • • • A FRAMEWORK SPLITS DATA INTO PIECES LETS PROCESSING OCCUR GATHERS THE RESULTS
  • 39. Corporate Data Center Application data and logs for analysis pushed to S3 Elastic Data Center
  • 40. Amazon Elastic Map Reduce name node to control analysis N Corporate Data Center Elastic Data Center
  • 41. N Corporate Data Center Hadoop cluster started by Elastic Map Reduce Elastic Data Center
  • 42. N Corporate Data Center Adding many hundreds or thousands of nodes Elastic Data Center
  • 43. Disposed of when job completes N Corporate Data Center Elastic Data Center
  • 44. Corporate Data Center Results of analysis pulled back into your systems Elastic Data Center
  • 45. Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  • 46. GENERATE  STORE  ANALYZE  SHARE AWS Data Pipeline
  • 47. AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage compute resources
  • 48. AWS Import / Export AWS Direct Connect Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce AWS Data Pipeline
  • 49. FROM DATA TO ACTIONABLE INFORMATION
  • 50. Stefano Rodighiero
  • 51. MXM FACTS 7+ million lyrics catalogue in more than 50 distinct languages Currently musiXmatch is the only lyrics platform allowed for worldwide licensing and has deals with top Music Publishers: Warner Chappell, Universal, BMG, EMI Publishing, Sony ATV, Peer Music, ... Daily updated with more than 1 million artists and more than 20 million music tracks Synced lyrics! Music Discography Meta Data: Lyrics, Artists, Albums, Songs, Biographies, Worldwide Charts Words matter
  • 52. SYNCED LYRICS
  • 53. OUR DATA MUSIC METADATA: RECORDING & PUBLISHING
  • 54. OUR DATA CONTENT USAGE
  • 55. OUR DATA OTHER SOURCES
  • 56. DATA ANALYSIS @ MXM CONTENT USAGE: REPORTING & ANALYTICS
  • 57. DATAFLOW Frontend Filter/norma lization Redis (real time analytics) "Unrolling" Redshift Hive Analytics Post process Publishing catalogue Batch Words matter
  • 58. BATCH REPORTING Step 1. Aggregation of views by country, application and content type Step 2. Join with a 500M+ rows table Hive It takes approx 1 hour with 5 c1.xlarge instances It used to take days with traditional techniques! Post process Batch SQL interface makes it easier to review and share the process Words matter
  • 59. DATAFLOW Frontend proxy Filter/norm alization Redis (real time analytics) "Unrolling" Redshift Hive Analytics Publishing catalogue Post process Interactive Words matter
  • 60. INTERACTIVE ANALYTICS SQL interface like Hive, accessible with any Postgresql client... Redis (real time analytics) Redshift ...but faster! Flexible costs Analytics With Redshift doing all the heavy lifting, it's easier to build analytics tools Interactive Words matter
  • 61. DATAFLOW Frontend proxy Filter/normali zation Redis (real time analytics) "Unrolling" Redshift Hive Analytics Post process Publishing catalogue Interactive Batch Words matter
  • 62. MUSIXMATCH Stefano Rodighiero stefano@musixmatch.com @larsen Words matter
  • 63. MUSIXMATCH THANK YOU!
  • 64. THANK YOU hakan@amazon.lu