Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

2,475 views

Published on

SoundCloud is the world’s leading social sound platform where anyone can create sounds and share them everywhere. 200 Million people every month listen to sounds on SoundCloud. That is eight percent of the Internet. 12 hours are uploaded on SoundCloud every minute. This means that SoundCloud not only deals with a lot of data (3-digit terabytes approximately) but embraces the concept of “data democratization,” which means that all data must be available for anyone in the company that needs to access and work with it.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,475
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
23
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Data Democratization at Soundcloud - Bruno Sá (SoundCloud)

  1. 1. DATA DEMOCRATIZATION @ SOUNDCLOUD October, 29th 2013
  2. 2. HI, I’M BRUNO
  3. 3. SOUNDCLOUD IS THE WORLD’S LEADING AUDIO PLATFORM
  4. 4. Every minute, creators upload
 12hrs 
 of audio

  5. 5. reaching over 
 200m people every month 

  6. 6. ! 8% 
 of the internet

  7. 7. PRESIDENT OBAMA FOO FIGHTERS SNOOP LION MADONNA SKRILLEX MACKLEMORE JOHN OLIVER˝ (DAILY SHOW/BUGLE)
  8. 8. what gets
 listened to
 where? how much revenue do
 we make in Brasil? how many new users
 did we get from that
 campaign? what makes a
 sound successfull? how do users use
 our iOS and
 Android apps? do comments on
 tracks correlate
 with listens? did the product
 change affect
 feature x? what makes an
 artist successfull?
  9. 9. DATA DEMOCRATIZATION • Avoid Silos
 • Remove unnecessary restrictions
 • Provide simple tools
 • Teach People how to use data
  10. 10. DATA DEMOCRATIZATION In one sentence: Deliver the right information to the right person at the right time.
  11. 11. DATA ANALYSIS AND REPORTING 2010-2012 PRODUCTION DB ANALYTICS DB
  12. 12. DATA ANALYSIS AND REPORTING Listens
 Sounds
 Users
 Comments
 Favorites
 Shares
 Reposts Impressions
 Clicks
 Conversions
 Suggestions
 Downloads
 Taggings
 Uploads
  13. 13. DATA ANALYSIS AND REPORTING Listens timestamp
 duration
 sound
 owner
 listener
 API-key
 (location)
 country
  14. 14. DATA ANALYSIS AND REPORTING additional metadata:
 • location within sound
 • context (location on site)
 • segmentation Listening creates >6000 events/s BIG DATA
  15. 15. HADOOP TO THE RESCUE 2 Datacenter in AMS
 200+ Nodes
  16. 16. HADOOP TO THE RESCUE listen data
 listen metadata
 search data
 recommender data
 product testing data
 backend production data
 backend logs
  17. 17. HADOOP AND DATA DEMOCRATIZATION Data is siloed on hadoop
 Data governance not existing
 Technical hurdles for access
 Not realtime
 Slow access
  18. 18. AMAZON REDSHIFT Fast fully managed DW service
 Optimized for petabyte or more datasets
 Fast query and I/O performance
 Columnar storage technology
  19. 19. BI INFRASTRUCTURE Source Systems 2013 Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts External Systems ... ETL Scripts Job execution powered by: ETL Scripts
  20. 20. IMPORT DATA FROM SOURCE SYSTEMS First: Gather data from the several source systems into S3
 Hadoop Full/Daily Imports MySql (production db) External Systems MapReduce for: - Listens - Plays - Impressions - Affiliations - ...
  21. 21. IMPORT DATA FROM SOURCE SYSTEMS Second: Rebuild staging area tables for full imports
 Based on configuration files
 ! tracks users client applications Create statements generated
 ... ! Re-create DISTKEYS and SORTKEYS
 
 Full control in changes in the data model
 Staging Area ! yaml config files
  22. 22. IMPORT DATA FROM SOURCE SYSTEMS Third: Import the data from S3 to RedShift
 tracks Full import: TRUNCATE & COPY Daily import: COPY users Staging Area client applications ...
  23. 23. ETL AND DW DATAMODEL ! ETL scripts divided into layers:
 ! - Layer 1: Staging -> DW (dimensions)
 - Layer 2: Staging -> DW (fact tables - raw data)
 - Layer 3: DW -> DW (aggregated fact tables)
 - Layer 4: DW -> Reporting Data Cubes (reporting data)
 !
  24. 24. ETL AND DW DATAMODEL DataWarehouse ETL Layer 1 & 2 ETL Layer 3 ETL Layer 4 Data Exploration Staging Area Data Cleaning Data Transformation Data Presentation ! ! SQL Ruby/SQL Scripts Data Aggregation ! Ruby/SQL Scripts
  25. 25. JOB SCHEDULE AND EXECUTION Job-scheduling tool developed internally
 Set dependencies between jobs
 Execution in multiple machines
 Supports all the ETL layers
  26. 26. DATA EXPLORATION Simple and fast access to data
 More time for “deep dives” into data
 Individualized Reporting
 Allows interactivity between users
 Integrated with RedShift
  27. 27. TIMELINE Week 2 • • Week 4 Gap Analysis˝ Business Exploration (requirements interviews) Week 6 Week 8 Week 10 Week 12 Week 14 Week 16 Requirement Analysis˝ • • Information Mapping Design˝ Solution Design (Draft) End of Analysis Stage˝ • • Define Infrastructure˝ Design Data Model Infrastructure Ready!˝ • • • Build ETL ˝ Build Data Cubes˝ Design Reports/Dashboards (Presentation Layer) BI 1.0 is built!˝ • • System/Integration Tests ˝ User Acceptance BI 1.0 is tested!˝ • • User Workshops˝ BI 1.0 Evaluation BI 1.0 is ready to use!˝ Milestones˝ Analysis Stage˝ Design & Build˝ Test & Deploy
  28. 28. DATA DEMOCRATIZATION • Reports designed by end users
 • Central repository for data analysis
 • User interaction
 • Data from one source only
 • Scalable solution
 • Data to the people!
  29. 29. what gets
 listened to
 where? how much revenue do
 we make in Brasil? how many new users
 did we get from that
 campaign? what makes a
 sound successfull? how do users use
 our iOS and
 Android apps? do comments on
 tracks correlate
 with listens? did the product
 change affect
 feature x? what makes an
 artist successfull?
  30. 30. ! QUESTIONS?
  31. 31. THANK YOU! P.S. WE’RE HIRING.
 SOUNDCLOUD.COM/JOBS

×