Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hemispheres of Data


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Hemispheres of Data

  1. 1. Hemispheres of Data FOX Audience Network Brian Dolan, Director of Research Analytics
  2. 2. What is FOX Audience Network? • Formally a division of FOX Interactive with sister company MySpace, we are now an independent ad network • Exclusive consumer of MySpace profile data • Owner of two massive data stores: – ~500TB Hadoop instance containing MySpace user data – ~250 TB (1 PB w/ redundancy) ad serving events in Greenplum data warehouse.
  3. 3. FAN's Data Challenge • 3-5 Billion ad serving events captured today, not including hundreds of millions of dimension • Updating 30-50 million user profiles today • Training over 2,000 sophisticated mathematical models weekly against multi-TB data sets
  4. 4. Data Character Varies Dramatically • User Data – Very Sparse – Intermittent – Unstructured and user generated – Untrustworthy – Enormous • Advertiser Data – Dense – Current – Defined Business Dimensions – Verified – Enormous
  5. 5. Not Separate, Isolated Hadoop I love Horror Movies! I need a cell phone It's Miley! Greenplum What is responding to my ad? How much revenue did I generate today? Is this campaign fatigued?
  6. 6. Platform Tasks Also Differ Dramatically • User Data – Long strings parsed with regexp routines – No more than three passes through the data – Unreliable data feed where dimensions change weekly – Complicated APIs • Advertiser Data – Hundreds of 1st Normal Form dimension tables – Self-joining a routine task – Views and temporary tables – Reporting needs – User management
  7. 7. Communicating • Now – Flat files passed between the systems –BAD! • Soon – Hive to provide better structured output from Hadoop – Greenplum to release HDFS reader/writer • Message Bus methods can feed both systems all the data, but transfer will always be necessary • Don't drink the Kool-Aid