Sound cloud - User & Partner Conference - AT Internet

958 views
798 views

Published on

Big Data with Amazon Redshift and ATI - AT Internet

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
958
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Sound cloud - User & Partner Conference - AT Internet

  1. 1. Big Data with Amazon Redshift and ATI November, 27th 2013
  2. 2. HI, I’M OLE
  3. 3. SOUNDCLOUD IS THE WORLD’S LEADING AUDIO PLATFORM
  4. 4. Every minute, creators upload 12hrs of audio
  5. 5. reaching over 250m people every month
  6. 6. 8% of the internet
  7. 7. PRESIDENT OBAMA FOO FIGHTERS SNOOP LION MADONNA SKRILLEX MACKLEMORE JOHN OLIVER (DAILY SHOW/BUGLE)
  8. 8. How‘s the sales funnel performing in Brazil and what‘s the split between products?
  9. 9. DATA DEMOCRATIZATION • Avoid Silos • Remove unnecessary restrictions • Provide simple tools • Teach People how to use data
  10. 10. DATA DEMOCRATIZATION In one sentence: Deliver the right information to the right person at the right time.
  11. 11. DATA ANALYSIS AND REPORTING 2010-2012 PRODUCTION DB ANALYTICS DB AT Internet
  12. 12. DATA ANALYSIS AND REPORTING Listens Sounds Users Comments Favorites Shares Reposts Impressions Clicks Conversions Suggestions Downloads Taggings Uploads
  13. 13. DATA ANALYSIS AND REPORTING Listens timestamp duration sound owner listener API-key (location) country
  14. 14. DATA ANALYSIS AND REPORTING additional metadata: • location within sound • context (location on site) • segmentation Listening creates >6000 events/s BIG DATA
  15. 15. HADOOP TO THE RESCUE 2 Datacenter in AMS 200+ Nodes
  16. 16. HADOOP TO THE RESCUE listen data listen metadata search data recommender data product testing data backend production data backend logs
  17. 17. HADOOP AND DATA DEMOCRATIZATION Data is siloed on hadoop Data governance not existing Technical hurdles for access Not realtime Slow access
  18. 18. AMAZON REDSHIFT Fast fully managed DW service Optimized for petabyte or more datasets Fast query and I/O performance Columnar storage technology
  19. 19. BI INFRASTRUCTURE 2013 Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
  20. 20. How‘s the sales funnel performing in Brazil and what‘s the split between products?
  21. 21. ATI Data Query Create query: 1. filter on funnel pages 2.select metrics and dimension 3.add REST URL to ETL pipeline
  22. 22. Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
  23. 23. DATA EXPLORATION Simple and fast access to data More time for “deep dives” into data Individualized Reporting Allows interactivity between users Integrated with RedShift
  24. 24. DATA DEMOCRATIZATION • Reports designed by end users • Central repository for data analysis • User interaction • Data from one source only • Scalable solution • Data to the people!
  25. 25. QUESTIONS?
  26. 26. THANK YOU! P.S. WE’RE HIRING. SOUNDCLOUD.COM/JOBS
  27. 27. APPENDIX
  28. 28. IMPORT DATA FROM SOURCE SYSTEMS First: Gather data from the several source systems into S3 Hadoop Full/Daily Imports MySql (production db) External Systems MapReduce for: - Listens - Plays - Impressions - Affiliations - ...
  29. 29. IMPORT DATA FROM SOURCE SYSTEMS Second: Rebuild staging area tables for full imports Based on configuration files tracks users client applications Create statements generated ... Re-create DISTKEYS and SORTKEYS Full control in changes in the data model Staging Area yaml config files
  30. 30. IMPORT DATA FROM SOURCE SYSTEMS Third: Import the data from S3 to RedShift tracks Full import: TRUNCATE & COPY Daily import: COPY users Staging Area client applications ...
  31. 31. ETL AND DW DATAMODEL ETL scripts divided into layers: - Layer 1: Staging -> DW (dimensions) - Layer 2: Staging -> DW (fact tables - raw data) - Layer 3: DW -> DW (aggregated fact tables) - Layer 4: DW -> Reporting Data Cubes (reporting data)
  32. 32. ETL AND DW DATAMODEL DataWarehouse ETL Layer 1 & 2 ETL Layer 3 ETL Layer 4 Data Exploration Staging Area Data Cleaning Data Transformation Data Presentation SQL Ruby/SQL Scripts Data Aggregation Ruby/SQL Scripts
  33. 33. JOB SCHEDULE AND EXECUTION Job-scheduling tool developed internally Set dependencies between jobs Execution in multiple machines Supports all the ETL layers
  34. 34. TIMELINE Week 2 • • Week 4 Gap Analysis Business Exploration (requirements interviews) Week 6 Week 8 Week 10 Week 12 Week 14 Week 16 Requirement Analysis • • Information Mapping Design Solution Design (Draft) End of Analysis Stage • • Define Infrastructure Design Data Model Infrastructure Ready! • • • Build ETL Build Data Cubes Design Reports/Dashboards (Presentation Layer) BI 1.0 is built! • • System/Integration Tests User Acceptance BI 1.0 is tested! • • User Workshops BI 1.0 Evaluation BI 1.0 is ready to use! Milestones Analysis Stage Design & Build Test & Deploy

×