Sound cloud - User & Partner Conference - AT Internet
Upcoming SlideShare
Loading in...5
×
 

Sound cloud - User & Partner Conference - AT Internet

on

  • 909 views

Big Data with Amazon Redshift and ATI - AT Internet

Big Data with Amazon Redshift and ATI - AT Internet

Statistics

Views

Total Views
909
Views on SlideShare
909
Embed Views
0

Actions

Likes
1
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Sound cloud - User & Partner Conference - AT Internet Sound cloud - User & Partner Conference - AT Internet Presentation Transcript

  • Big Data with Amazon Redshift and ATI November, 27th 2013
  • HI, I’M OLE
  • SOUNDCLOUD IS THE WORLD’S LEADING AUDIO PLATFORM View slide
  • Every minute, creators upload 12hrs of audio View slide
  • reaching over 250m people every month
  • 8% of the internet
  • PRESIDENT OBAMA FOO FIGHTERS SNOOP LION MADONNA SKRILLEX MACKLEMORE JOHN OLIVER (DAILY SHOW/BUGLE)
  • How‘s the sales funnel performing in Brazil and what‘s the split between products?
  • DATA DEMOCRATIZATION • Avoid Silos • Remove unnecessary restrictions • Provide simple tools • Teach People how to use data
  • DATA DEMOCRATIZATION In one sentence: Deliver the right information to the right person at the right time.
  • DATA ANALYSIS AND REPORTING 2010-2012 PRODUCTION DB ANALYTICS DB AT Internet
  • DATA ANALYSIS AND REPORTING Listens Sounds Users Comments Favorites Shares Reposts Impressions Clicks Conversions Suggestions Downloads Taggings Uploads
  • DATA ANALYSIS AND REPORTING Listens timestamp duration sound owner listener API-key (location) country
  • DATA ANALYSIS AND REPORTING additional metadata: • location within sound • context (location on site) • segmentation Listening creates >6000 events/s BIG DATA
  • HADOOP TO THE RESCUE 2 Datacenter in AMS 200+ Nodes
  • HADOOP TO THE RESCUE listen data listen metadata search data recommender data product testing data backend production data backend logs
  • HADOOP AND DATA DEMOCRATIZATION Data is siloed on hadoop Data governance not existing Technical hurdles for access Not realtime Slow access
  • AMAZON REDSHIFT Fast fully managed DW service Optimized for petabyte or more datasets Fast query and I/O performance Columnar storage technology
  • BI INFRASTRUCTURE 2013 Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
  • How‘s the sales funnel performing in Brazil and what‘s the split between products?
  • ATI Data Query Create query: 1. filter on funnel pages 2.select metrics and dimension 3.add REST URL to ETL pipeline
  • Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
  • DATA EXPLORATION Simple and fast access to data More time for “deep dives” into data Individualized Reporting Allows interactivity between users Integrated with RedShift
  • DATA DEMOCRATIZATION • Reports designed by end users • Central repository for data analysis • User interaction • Data from one source only • Scalable solution • Data to the people!
  • QUESTIONS?
  • THANK YOU! P.S. WE’RE HIRING. SOUNDCLOUD.COM/JOBS
  • APPENDIX
  • IMPORT DATA FROM SOURCE SYSTEMS First: Gather data from the several source systems into S3 Hadoop Full/Daily Imports MySql (production db) External Systems MapReduce for: - Listens - Plays - Impressions - Affiliations - ...
  • IMPORT DATA FROM SOURCE SYSTEMS Second: Rebuild staging area tables for full imports Based on configuration files tracks users client applications Create statements generated ... Re-create DISTKEYS and SORTKEYS Full control in changes in the data model Staging Area yaml config files
  • IMPORT DATA FROM SOURCE SYSTEMS Third: Import the data from S3 to RedShift tracks Full import: TRUNCATE & COPY Daily import: COPY users Staging Area client applications ...
  • ETL AND DW DATAMODEL ETL scripts divided into layers: - Layer 1: Staging -> DW (dimensions) - Layer 2: Staging -> DW (fact tables - raw data) - Layer 3: DW -> DW (aggregated fact tables) - Layer 4: DW -> Reporting Data Cubes (reporting data)
  • ETL AND DW DATAMODEL DataWarehouse ETL Layer 1 & 2 ETL Layer 3 ETL Layer 4 Data Exploration Staging Area Data Cleaning Data Transformation Data Presentation SQL Ruby/SQL Scripts Data Aggregation Ruby/SQL Scripts
  • JOB SCHEDULE AND EXECUTION Job-scheduling tool developed internally Set dependencies between jobs Execution in multiple machines Supports all the ETL layers
  • TIMELINE Week 2 • • Week 4 Gap Analysis Business Exploration (requirements interviews) Week 6 Week 8 Week 10 Week 12 Week 14 Week 16 Requirement Analysis • • Information Mapping Design Solution Design (Draft) End of Analysis Stage • • Define Infrastructure Design Data Model Infrastructure Ready! • • • Build ETL Build Data Cubes Design Reports/Dashboards (Presentation Layer) BI 1.0 is built! • • System/Integration Tests User Acceptance BI 1.0 is tested! • • User Workshops BI 1.0 Evaluation BI 1.0 is ready to use! Milestones Analysis Stage Design & Build Test & Deploy