Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data spain 2013 - ad networks analytics


Published on

Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Each time an ad is shown in a web page or in a mobile application an impression event is generated. These impressions and other events are the source of analytical panels that are used by the agents (advertiser and publisher) to analyze the performance of its campaigns or its web pages.

Presenting these panels to the agents is a technical challenge because Ad Networks have to deal with billions of events each day, and have to present interactive panels to thousands of agents. The scale of the problem requires the usage of distributed tools. Obviously, Hadoop may come to the rescue with its storage and computing capacity. It can be used to precompute several statistics that are later presented in the panels.
But that is not enough for the agents. In order to perform exploratory analytics they need an interactive panel that allows to filter down by a particular web page, country and device in a particular time-frame, or whichever other ad-hoc filter.
Therefore, something more than Hadoop is needed in order to store the data and to perform some statical precomputations. At Datasalt, we have addressed this problem for some clients and we have found a solution than will be presented in the talk.
The solution includes two modules: the off-line and the on-line.
The off-line module is in charge of storing the received events and preforming the most costly operations: cleaning the dataset; performing some aggregations in order to reduce the size of the data; and create some file structures that will be used later to serve the on-line analytics. All these tasks are handled properly by Hadoop. The most innovative part on this process is the last step where some file-structures are created for being exported to the on-line part in order to serve the analytical panels.
The on-line module is in charge of serving the analytical queries received from the agents' panel webapp. The queries are basic statistics (count, count distinct, stdev, sum, etc) run over a subset of the input dataset represented by an ad-hoc filter. The challenge here is that the system has to serve statistics for filters “on the fly”. That makes it impossible to precalculate everything on the off-line side. Therefore, part of the calculations must be done on-demand. That would not be a problem if the scale of the data wouldn't be that big. Some kind of scalable database is needed for this task.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big data spain 2013 - ad networks analytics

  1. 1. Iván  de  Prado  Alonso  –  CEO  of  Datasalt   @ivanprado   @datasalt   Ad Networks analytics using Hadoop and Splout SQL
  2. 2. Big Data consulting & training
  3. 3. Agenda   1.  Analy,cs  for  Ad  Networks   2.  Our  solu,on   1.  Hadoop  +  Splout  SQL   2.  Splout  SQL  in  detail   3.  Pre-­‐aggregaFons  v.s.  Sampling   3.  Conclusions  
  4. 4. Analy,cs  for  Ad   Networks  
  5. 5. Ad  Networks   " Principal  agents   ›  AdverFser   ›  Publisher   •  Web  pages   •  Mobile  apps   " Ad  Network   ›  Network  of  agents  that  mediate  between   adverFsers  and  publishers   ›  DSPs,  SSPs,  DMPs,  ADTs,  ITDs,  etc    
  6. 6. For  the  sake  of  simplicity...   " Let’s  consider  a  monolithic  Ad  Network   ›  Single  agent  between  adverFsers  and  publishers     " But  the  exposed  solu,on  is  also  useful  for  DSPs,   SSPs,  DMPs,  etc.  
  7. 7. Need  for  analy,cs   " For  adver,sers   ›  Monitoring  campaigns   ›  Improve  ROI   " For  publishers   ›  Improve  ad  placement   " But  there  can  be   ›  Tens  of  thousands  of  adverFsers   ›  Hundred  of  thousands  of  publishers    
  8. 8. Analy,cs   " Coun,ng  impressions,  clicks  and  CPC   ›  For  a  given  range  of  dates   ›  Filtered  by   •  Campaign   •  LocaFon   •  Language   •  Browser/device   •  Ad  type   •  ...  or  any  combinaFon  of  the  above!  
  9. 9. Two-­‐fold  usage   " Opera,onal   ›  For  invoicing,  accounFng,  etc.   ›  Limited  set  of  parameter  variaFons   •  Fixed  date  ranges  and  common  aggregaFons   ›  Exact  results  expected   " Exploratory   ›  Unlimited  variaFons  of  parameters   •  Ad-­‐hoc  filtering   ›  Approximated  results  are  enough  
  10. 10. Challenges   " Billions  of  events  and  hundreds  of  gigabytes   per  day   ›  Need  for  a  distributed  system   " Query  flexibility   ›  Need  to  cope  with  operaFonal  and  exploratory   queries   "  Web  latencies   ›  Queries  must  return  in  milliseconds  
  11. 11. Exploding   " Data  needed  to  serve  analy,cs  panels  is  Big   Data   ›  Thousands  of  adverFser  panels   ›  Even  more  for  publisher  panels   " But  individually  each  agent  panel  can  be   served  with  one  machine   ›  At  least  for  the  98%  of  adverFsers/publishers   ›  Horizontal  parFFoning  is  a  good  strategy  
  12. 12. Our  solu,on  
  13. 13. Our  solu,on  
  14. 14. Hadoop   " Scalable     ›  Storage  of  raw  data   ›  CompuFng  capabiliFes   " Good  for   ›  CreaFng  pre-­‐computed  aggregaFons  (views)   ›  GeneraFng  samples  of  data   " Bad  for   ›  Serving  data   ›  On-­‐line  aggregaFons  
  15. 15. " Scalable   ›  Serving  of  full  SQL  queries  (unlike  NoSQLs)   " Good  for   ›  Ad-­‐hoc  aggregaFons  over  pre-­‐computed  views   ›  Serving  low-­‐latency  web  pages  with  concurrency  
  16. 16. A  well-­‐balanced  solu,on   " Hadoop   ›  Provides  a  scalable  repository  for  impressions   ›  Performs  off-­‐line  pre-­‐aggregaFons  and  sampling   " Splout  SQL   ›  Serves  queries   ›  Performs  on-­‐line  aggregaFons  in  sub-­‐second   latencies   •  Each  parFFon  contains  only  data  for  a  few  agents,   which  ensures  performance  
  17. 17. Splout  SQL   (in  detail)  
  18. 18. Splout  SQL  in  detail   Isola,on  between  genera,on  and  serving  
  19. 19. Splout  SQL  Architecture  
  20. 20. Genera,on   Generate  tablespace  T_ADVERTISERS  with  2  parFFons  for     table  ADVERTISERS   parFFoned  by  CID   table  IMPRESIONS   parFFoned  by  CID   Tablespace  T_ADVERTISERS   ADVERTISERS   AID   Name   ParFFon  U10  –  U35   U20   Doug   ADVERTISERS   U21   Ted   AID   Name   PID   U40   John   U20   Doug   S100   U20   102   U21   Ted   S101   U20   60   IMPRESSIONS   PID   AID   Amount   S100   U20   102   S101   U20   60   S223   U40   99   IMPRESSIONS   AID   Amount   ParFFon  U36  –  U60   ADVERTISERS   IMPRESSIONS   AID   Name   PID   U40   John   S223   U40   AID   Amount   99  
  21. 21. API  -­‐  Genera,on   Command  line   Loading  CSV  files   $ hadoop jar splout-*-hadoop.jar generate … Java  API   HCatalog   Hive   Pig  
  22. 22. Serving   For  key  =  ‘U20’,  tablespace=‘T_ADVERTISERS’   SELECT  Name,  sum(Amount)  FROM     ADVERTISERS  a,  IMPRESSIONS  i  WHERE     a.AID  =  i.AID  AND  AID  =  ‘U20’;     ParFFon  U10  –  U35   ParFFon  U36  –  U60   ADVERTISERS   ADVERTISERS   AID   Name   U20   Doug   U21   Ted   AID   Name   U40   John   IMPRESSIONS   PID   AID   IMPRESSIONS   Amount   PID   S100   U20   102   S223   U40   S101   U20   60   AID   Amount   99  
  23. 23. Serving   For  key  =  ‘U40’,  tablespace=‘T_ADVERTISERS’   SELECT  Name,  sum(Amount)  FROM     ADVERTISERS  a,  IMPRESSIONS  i  WHERE     a.AID  =  i.AID  AND  AID  =  ‘U40’;     ParFFon  U10  –  U35   ParFFon  U36  –  U60   ADVERTISERS   ADVERTISERS   AID   Name   U20   Doug   U21   Ted   AID   Name   U40   John   IMPRESSIONS   PID   AID   IMPRESSIONS   Amount   PID   S100   U20   102   S223   U40   S101   U20   60   AID   Amount   99  
  24. 24. API  -­‐  Service   Rest  API   JSON  response  
  25. 25. API  -­‐  Console  
  26. 26. Pre-­‐aggrega,ons   v.s.     Sampling  
  27. 27. Opera,onal  usage   " Invoicing,  accoun,ng,  monitoring,  etc.   ›  ›  Exact  results   Constrained  space  of  aggregaFons   " Pre-­‐computed  aggregates  done  in  Hadoop   ›  For  example:   •  per  day   •  per  day  per  locaFon   " Extended  aggrega,ons  done  on-­‐line   ›  ›  Using  Splout  SQL   For  example,  aggregate  per  week  based  on  daily   stats  
  28. 28. Why  not  to  pre-­‐compute  everything?   " Create  one  table  per  each  dimension   combina,on   ›  For  two  dimensions  (day,  locaFon):   •  day   •  locaFon   •  locaFon,  day   " For  n  dimensions   ›  2n  –  1  combinaFons   ›  It  explodes!  
  29. 29. Exploratory  usage   " Ad-­‐hoc  filters  to  learn  from  data   ›  Approximated  results  are  enough   " Intensive  use  of  sampling   ›  It  can  provide  good  accuracy  with  fast  response   " Confidence  interval   ›  p=proporFon   ›  n=sample  size   p ± z! /2 ›  z=normal  distribuFon   p ! (1" p) n
  30. 30. Samples   " Created  on  Hadoop   ›  Different  sample  sets   •  For  last  X  days   •  For  last  year   " Splout  SQL  for  serving  them   •  On-­‐line  analyFcs  over  samples   •  1  Million  records  per  second*  (44  bytes  per  row)   •  Faster  with  data  in  memory     ü  Warming  data  prior  use   ü  2.7  Million  records  per  second*   *  Measured  in  a  laptop  
  31. 31. Pre-­‐aggrega,ons  pros  &  cons   " Advantages   ›  Exact  results   ›  Good  for  exploring  the  long-­‐tail   " Limita,ons   ›  Only  for  a  constrained  amount  of  aggregaFon   combinaFons   ›  Not  good  for  exploratory  analysis  
  32. 32. Sampling  pros  &  cons   " Advantages   ›  Fast  filtering  for  any  set  of  dimensions   ›  Good  accuracy  for  Top  N  queries   " Limita,ons   ›  Bad  for  narrow  dimension  filters   ›  Bad  for  exploring  the  long-­‐tail   ›  Approximated  results    
  33. 33. Conclusions  
  34. 34. Conclusions   " Analy,cs  in  Ad  Networks  is  a  complex   ques,on   ›  Due  to  the  amount  of  data   ›  Due  to  the  amount  of  agents   " It  can  be  solved  using  Hadoop  +  Splout  SQL   ›  By  the  use  of  parFFoning   ›  Using  pre-­‐aggregaFons   •  For  operaFve  usages   ›  Using  sampling   •  For  exploratory  profiles  
  35. 35. Iván  de  Prado  Alonso  –  CEO  of  Datasalt   @ivanprado   @datasalt   Questions?