• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Sound cloud - User & Partner Conference - AT Internet
 

Sound cloud - User & Partner Conference - AT Internet

on

  • 854 views

Big Data with Amazon Redshift and ATI - AT Internet

Big Data with Amazon Redshift and ATI - AT Internet

Statistics

Views

Total Views
854
Views on SlideShare
854
Embed Views
0

Actions

Likes
1
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sound cloud - User & Partner Conference - AT Internet Sound cloud - User & Partner Conference - AT Internet Presentation Transcript

    • Big Data with Amazon Redshift and ATI November, 27th 2013
    • HI, I’M OLE
    • SOUNDCLOUD IS THE WORLD’S LEADING AUDIO PLATFORM
    • Every minute, creators upload 12hrs of audio
    • reaching over 250m people every month
    • 8% of the internet
    • PRESIDENT OBAMA FOO FIGHTERS SNOOP LION MADONNA SKRILLEX MACKLEMORE JOHN OLIVER (DAILY SHOW/BUGLE)
    • How‘s the sales funnel performing in Brazil and what‘s the split between products?
    • DATA DEMOCRATIZATION • Avoid Silos • Remove unnecessary restrictions • Provide simple tools • Teach People how to use data
    • DATA DEMOCRATIZATION In one sentence: Deliver the right information to the right person at the right time.
    • DATA ANALYSIS AND REPORTING 2010-2012 PRODUCTION DB ANALYTICS DB AT Internet
    • DATA ANALYSIS AND REPORTING Listens Sounds Users Comments Favorites Shares Reposts Impressions Clicks Conversions Suggestions Downloads Taggings Uploads
    • DATA ANALYSIS AND REPORTING Listens timestamp duration sound owner listener API-key (location) country
    • DATA ANALYSIS AND REPORTING additional metadata: • location within sound • context (location on site) • segmentation Listening creates >6000 events/s BIG DATA
    • HADOOP TO THE RESCUE 2 Datacenter in AMS 200+ Nodes
    • HADOOP TO THE RESCUE listen data listen metadata search data recommender data product testing data backend production data backend logs
    • HADOOP AND DATA DEMOCRATIZATION Data is siloed on hadoop Data governance not existing Technical hurdles for access Not realtime Slow access
    • AMAZON REDSHIFT Fast fully managed DW service Optimized for petabyte or more datasets Fast query and I/O performance Columnar storage technology
    • BI INFRASTRUCTURE 2013 Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
    • How‘s the sales funnel performing in Brazil and what‘s the split between products?
    • ATI Data Query Create query: 1. filter on funnel pages 2.select metrics and dimension 3.add REST URL to ETL pipeline
    • Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
    • DATA EXPLORATION Simple and fast access to data More time for “deep dives” into data Individualized Reporting Allows interactivity between users Integrated with RedShift
    • DATA DEMOCRATIZATION • Reports designed by end users • Central repository for data analysis • User interaction • Data from one source only • Scalable solution • Data to the people!
    • QUESTIONS?
    • THANK YOU! P.S. WE’RE HIRING. SOUNDCLOUD.COM/JOBS
    • APPENDIX
    • IMPORT DATA FROM SOURCE SYSTEMS First: Gather data from the several source systems into S3 Hadoop Full/Daily Imports MySql (production db) External Systems MapReduce for: - Listens - Plays - Impressions - Affiliations - ...
    • IMPORT DATA FROM SOURCE SYSTEMS Second: Rebuild staging area tables for full imports Based on configuration files tracks users client applications Create statements generated ... Re-create DISTKEYS and SORTKEYS Full control in changes in the data model Staging Area yaml config files
    • IMPORT DATA FROM SOURCE SYSTEMS Third: Import the data from S3 to RedShift tracks Full import: TRUNCATE & COPY Daily import: COPY users Staging Area client applications ...
    • ETL AND DW DATAMODEL ETL scripts divided into layers: - Layer 1: Staging -> DW (dimensions) - Layer 2: Staging -> DW (fact tables - raw data) - Layer 3: DW -> DW (aggregated fact tables) - Layer 4: DW -> Reporting Data Cubes (reporting data)
    • ETL AND DW DATAMODEL DataWarehouse ETL Layer 1 & 2 ETL Layer 3 ETL Layer 4 Data Exploration Staging Area Data Cleaning Data Transformation Data Presentation SQL Ruby/SQL Scripts Data Aggregation Ruby/SQL Scripts
    • JOB SCHEDULE AND EXECUTION Job-scheduling tool developed internally Set dependencies between jobs Execution in multiple machines Supports all the ETL layers
    • TIMELINE Week 2 • • Week 4 Gap Analysis Business Exploration (requirements interviews) Week 6 Week 8 Week 10 Week 12 Week 14 Week 16 Requirement Analysis • • Information Mapping Design Solution Design (Draft) End of Analysis Stage • • Define Infrastructure Design Data Model Infrastructure Ready! • • • Build ETL Build Data Cubes Design Reports/Dashboards (Presentation Layer) BI 1.0 is built! • • System/Integration Tests User Acceptance BI 1.0 is tested! • • User Workshops BI 1.0 Evaluation BI 1.0 is ready to use! Milestones Analysis Stage Design & Build Test & Deploy