SlideShare a Scribd company logo
Yahoo! Display Ads Attribution
Framework:
A Problem Of Efficient Sparse Joins On Massive Data



            Supreeth, Sundeep, Chenjie, Chinmay

                    Data Team, Yahoo!




                              1
Agenda

§  Problem description
 ›    Serves impressions clicks
 ›    Attribution
§  Class of problems and application in other use cases
§  Attribution framework
§  Performance comparison
§  Conclusion




                              2
Serves Impressions Clicks


                                                               Web                        Ad
                                                              Servers                   Servers


  Be the first place people go when they
  want to find, explore, and participate with
Impressionsnews, from serious forfun. ad shown
  all forms of – client side event to an                      Serves - Server logged event for
Clicks – client side event for a click on an ad               an ad served. Serve has
Interactions – client side events for interactions            complete context
                  within an ad
                                                              Serve events are heavy and is
Impressions clicks and conversions are a few                  a few 10s of KBs
bytes


 Serve Guid + Serve timestamp + {other fields of               Serve Guid + Serve timestamp + {other
                                                       Join
        impressions/clicks/interactions}                                  fields of serve}


  * Guid is global unique identifier
                                                   3
Need For Attribution

                                   Serves




                     5m


                              Several hours to days        Older instances




Impressions/Clicks
   Every 5 mins
                      Attribute an impression/click with the serve



                          4
Distribution Of % Impressions Arrived
From The Client Side wrt Serves
                         % of Impressions for a serve
    90


    80


    70


    60


    50

                                                                              %of Impressions for a serve
    40


    30                                                                        t1->201205301000
                                                                              t2->201205300955
    20                                                                        t3->201205300950
                                                                              .
    10                                                                        .
                                                                              .
     0
         t1    t2   t3    t4   t5   t6   t7       t8   t9   t10   t12   t13



              Time period from when the serves happened
                                              5
Distribution Of % Clicks Arrived From
The Client Side wrt Serves
                        %of Clicks for a serve
   45

   40

   35

   30

   25

                                                                                  %of Clicks for a serve
   20

   15
                                                                              t1->201205301000
                                                                              t2->201205300955
   10
                                                                              t3->201205300950
                                                                              .
    5                                                                         .
                                                                              .
    0
        t1    t2   t3     t4   t5   t6   t7       t8   t9   t10   t12   t13




             Time period from when the serves happened
                                              6
Class Of Problems


§  Sparse joins spanning TBs of data on grid
§  Few MBs to a few TBs
§  Left outer join or any other outer join


      Data Set              Impressions   Serves (5m*288)

      Data Size             400MB         20GB *288 ~= 5.6 TB
      (Compressed size)




                                 7
Similar Use Cases

§  Associating video, click, social interactions back to the
    activity data
§  Attribute back a small size client beacon to a large
    dataset
§  Within Yahoo
 ›    Audience view/click attribution
 ›    Weblog based investigation
 ›    Joining dimensional data with web traffic data




                                 8
Pig Joins And Problem Fit


   Join Strategy     Comments                        Cost
   Merge join        The datasets are not sorted     High
   Hash join         Shuffle and reduce time         High
   Replicated Join   Does not meet performance       High
                     needs; left outer join on the
                     replicated dataset
   Skewed Join       Data set is not skewed          N/A




                                  9
Problem Statement




 To do a sparse outer join on a very large
dataset with high performance requirements
      for display ad attribution on grid




                    10
Attribution Framework - Overview


            Smart Instrumentation Strategies




           Aggressive partitioning and selection




           Partition Aware Efficient Join Query
                          Plan




                             11
Instrument For Attribution

                                                                    Ø Smart Instrumentation
                                                                           Strategies
 §  Serve guid                                                     Aggressive partitioning and
                                                                            selection
 §  Clues which can help you partition better                      Partition Aware Efficient Join
                                                                              Query Plan
     ›    Timestamp of the serve
 §  Partition keys used in event instrumentation
 §  In the impression attribution example:

            Impression                                              Serves


Serve Guid + Serve timestamp + {other fields of        Serve Guid + Serve timestamp + {other
       impressions/clicks/interactions}                           fields of serve}




                                                  12
Partitioning approach

§  Join key based partitioning                            Smart Instrumentation
                                                                Strategies

§  Keys for leveraging physical partitioning           Ø Aggressive partitioning
                                                              and selection

 ›    timestamp                                         Partition Aware Efficient Join
                                                                  Query Plan


§  Use of hashes in partitioning
 ›    HashFNV, Murmur


         Key                           Partition Type
         Join keys                     Hash
         Timestamp                     Range




                                  13
Pruning/Selection

§  Hashing of keys in the data sets                        Smart Instrumentation
                                                                 Strategies
                                                         Ø Aggressive partitioning
§  Pruning of partitions                                      and selection
                                                         Partition Aware Efficient Join
 ›    Timestamp                                                    Query Plan



 ›    Hash of the join key
§  IO costs and partitions
§  Configurable partitions
        Key                   Partition Type   Pruning
        Join keys             Hash             Yes
        Timestamp             Range            Yes




                                  14
Partition Aware Efficient Join Query
Plans
                                     Stream the selected
Impression event keys                                                     Smart Instrumentation
      Size : MBs
                                    Serve event partitions                     Strategies
                                               Size : TBs
                                                                        Aggressive partitioning and
                                                                                selection
                                                                            Ø Partition Aware
                           Inner                                        Efficient Join Query Plan
                            Join



                                                                  Stream full
                   Annotated impression
                                                               Impression event
                       Size : MBs
                                                            Size: Hundreds of MBs


                                               Left outer
                                                  join




                                         Complete
                                    Annotated Impression
- in memory                         data with Serve data
- stream
                                          15
Attribution Framework: Capabilities

                                                      Smart Instrumentation
                                                           Strategies
§  Left outer on impression/click/interaction     Aggressive partitioning and
                                                           selection

›    As long as the impression/click/interaction   Partition Aware Efficient Join
                                                             Query Plan
     exists, we will get a record in output
§  Complete annotation with the serve
§  Distinct join with serves
§  Sparse joins achieved by pruning the partitions
§  Map side joins




                                16
Attribution Framework: Implementation

                                             Smart Instrumentation
                                                  Strategies
§  Python embedded PIG                   Aggressive partitioning and
                                                  selection

§  Dynamic partitioning/pruning (UDFs)   Partition Aware Efficient Join
                                                    Query Plan

§  Configurable parameters
 ›    Lookbacks
 ›    Partitions
 ›    CombinedSplitSize




                              17
Attribution Framework: Tuning Parameters

§  Serve Partitions: trade off between IO & namespace used

                  (lookback = 24 hours)

               4000                                                        180000
  Bytes read




                                                                                    Number of files
               3500                                                        160000

                                                                           140000
               3000
                                                                           120000
               2500
                                                                           100000
               2000                                                                                   Bytes Read(GB)
                                                                           80000                      Namespace Used
               1500
                                                                           60000
               1000
                                                                           40000

               500                                                         20000

                  0                                                        0
                      2   4   8   16   32   64   128   256    512   1024

                                        Partitions
                                                         18
Attribution Framework: Tuning Parameters

§  Split Size: trade off between number of mappers and map
    task run time
(partitions = 16, lookback = 24 hours)
                        35000                                           1200
    Number of Mappers




                                                                               Time taken
                        30000
                                                                        1000

                        25000
                                                                        800

                        20000
                                                                        600                 Number of Mappers
                        15000                                                               Time Taken(s)

                                                                        400
                        10000

                                                                        200
                        5000


                           0                                            0
                                128MB   1 GB   2 GB   3 GB       4 GB


                                               Split Size
                                                            19
Comparison With Other PIG Joins

Join          Mappers       Reducers Lookback            Input Size              Time to
                                                                                 complete
Left Outer    2800          45           40mins         180GB                    42.5m*
Hash Join
Replicated    5680          0            5hours         1TB                      7m**
Join
Attribution   5760          0            24hours        Effective 5.6 TB;
                                                                            6m***
Framework                                               With Pruning 1.1 TB




 * Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer)
 ** Map time taken
 *** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join)



                                             20
Conclusion


§  For the sparse look up problem, the attribution framework
    used works very well and within the performance needs
§  Effective partitioning aids longer lookbacks and reduced
    IO
§  The levers in the framework allow for tuning based on the
    computation/IO requirements




                              21
Future Steps


§  Use Hbase/Cassandra to store the event grain serve data
    and do lookups
§  Use of bloom filter along with an index format
§  Compare the strategy with what Hive does and come up
    with a framework using Hive




                               22
Questions?




             23

More Related Content

Viewers also liked

State of digital ad fraud 2017 by augustine fou
State of digital ad fraud 2017 by augustine fouState of digital ad fraud 2017 by augustine fou
State of digital ad fraud 2017 by augustine fou
Dr. Augustine Fou - Independent Ad Fraud Researcher
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks
 
Chango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on FacebookChango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on Facebook
DDM Alliance
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 

Viewers also liked (9)

State of digital ad fraud 2017 by augustine fou
State of digital ad fraud 2017 by augustine fouState of digital ad fraud 2017 by augustine fou
State of digital ad fraud 2017 by augustine fou
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Chango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on FacebookChango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on Facebook
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Yahoo Display Advertising Attribution

MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks
 
Softtek Break Through Savings No Need Offshore 2011 Asug Final
Softtek Break Through Savings No Need Offshore 2011 Asug FinalSofttek Break Through Savings No Need Offshore 2011 Asug Final
Softtek Break Through Savings No Need Offshore 2011 Asug Final
Mauro Okamoto
 
Dreamforce'12 - Automate Business Processes with Force.com
Dreamforce'12 - Automate Business Processes with Force.comDreamforce'12 - Automate Business Processes with Force.com
Dreamforce'12 - Automate Business Processes with Force.com
Mudit Agarwal
 
Samanage Benchmarking: Better Service Performance Starts Here
Samanage Benchmarking: Better Service Performance Starts HereSamanage Benchmarking: Better Service Performance Starts Here
Samanage Benchmarking: Better Service Performance Starts Here
Samanage
 
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
Compuware APM
 
Managed services
Managed servicesManaged services
Managed services
rakeysh001
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
Rundeck
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloud
Michael Kopp
 
Managed services
Managed servicesManaged services
Managed services
rakeysh001
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny Rachitsky
Cloud Congress
 
Prelim survey data 9 17-11
Prelim survey data 9 17-11Prelim survey data 9 17-11
Prelim survey data 9 17-11
tmartinez12
 
Pinning Down Cloud Computing
Pinning Down Cloud ComputingPinning Down Cloud Computing
Pinning Down Cloud Computing
Yankee Group
 
Warranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic GainsWarranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic Gains
ImranMasood
 
JDX Suite - A Product by Ad2pro Group
JDX Suite - A Product by Ad2pro GroupJDX Suite - A Product by Ad2pro Group
JDX Suite - A Product by Ad2pro Group
ShivaKumar1803
 
Soa To The Rescue
Soa To The RescueSoa To The Rescue
Soa To The Rescue
David Linthicum
 
IT Infrastructure Outsourcing Benefits Demystified
IT Infrastructure Outsourcing Benefits Demystified IT Infrastructure Outsourcing Benefits Demystified
IT Infrastructure Outsourcing Benefits Demystified
CTRLS
 
Daniel Jasník - ITSMF pro cloudové služby - AID2019
Daniel Jasník - ITSMF pro cloudové služby - AID2019Daniel Jasník - ITSMF pro cloudové služby - AID2019
Daniel Jasník - ITSMF pro cloudové služby - AID2019
ALVAO
 
IT Service Level Agreement
IT Service Level AgreementIT Service Level Agreement
IT Service Level Agreement
KHNOG
 
Brotight China - Professional Service
Brotight China - Professional ServiceBrotight China - Professional Service
Brotight China - Professional Service
Allen He
 
Sciencelogic - A Leader in IT Transformation
Sciencelogic - A Leader in IT Transformation Sciencelogic - A Leader in IT Transformation
Sciencelogic - A Leader in IT Transformation
Chris Phillips
 

Similar to Yahoo Display Advertising Attribution (20)

MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!
 
Softtek Break Through Savings No Need Offshore 2011 Asug Final
Softtek Break Through Savings No Need Offshore 2011 Asug FinalSofttek Break Through Savings No Need Offshore 2011 Asug Final
Softtek Break Through Savings No Need Offshore 2011 Asug Final
 
Dreamforce'12 - Automate Business Processes with Force.com
Dreamforce'12 - Automate Business Processes with Force.comDreamforce'12 - Automate Business Processes with Force.com
Dreamforce'12 - Automate Business Processes with Force.com
 
Samanage Benchmarking: Better Service Performance Starts Here
Samanage Benchmarking: Better Service Performance Starts HereSamanage Benchmarking: Better Service Performance Starts Here
Samanage Benchmarking: Better Service Performance Starts Here
 
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
 
Managed services
Managed servicesManaged services
Managed services
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloud
 
Managed services
Managed servicesManaged services
Managed services
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny Rachitsky
 
Prelim survey data 9 17-11
Prelim survey data 9 17-11Prelim survey data 9 17-11
Prelim survey data 9 17-11
 
Pinning Down Cloud Computing
Pinning Down Cloud ComputingPinning Down Cloud Computing
Pinning Down Cloud Computing
 
Warranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic GainsWarranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic Gains
 
JDX Suite - A Product by Ad2pro Group
JDX Suite - A Product by Ad2pro GroupJDX Suite - A Product by Ad2pro Group
JDX Suite - A Product by Ad2pro Group
 
Soa To The Rescue
Soa To The RescueSoa To The Rescue
Soa To The Rescue
 
IT Infrastructure Outsourcing Benefits Demystified
IT Infrastructure Outsourcing Benefits Demystified IT Infrastructure Outsourcing Benefits Demystified
IT Infrastructure Outsourcing Benefits Demystified
 
Daniel Jasník - ITSMF pro cloudové služby - AID2019
Daniel Jasník - ITSMF pro cloudové služby - AID2019Daniel Jasník - ITSMF pro cloudové služby - AID2019
Daniel Jasník - ITSMF pro cloudové služby - AID2019
 
IT Service Level Agreement
IT Service Level AgreementIT Service Level Agreement
IT Service Level Agreement
 
Brotight China - Professional Service
Brotight China - Professional ServiceBrotight China - Professional Service
Brotight China - Professional Service
 
Sciencelogic - A Leader in IT Transformation
Sciencelogic - A Leader in IT Transformation Sciencelogic - A Leader in IT Transformation
Sciencelogic - A Leader in IT Transformation
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 

Recently uploaded (20)

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 

Yahoo Display Advertising Attribution

  • 1. Yahoo! Display Ads Attribution Framework: A Problem Of Efficient Sparse Joins On Massive Data Supreeth, Sundeep, Chenjie, Chinmay Data Team, Yahoo! 1
  • 2. Agenda §  Problem description ›  Serves impressions clicks ›  Attribution §  Class of problems and application in other use cases §  Attribution framework §  Performance comparison §  Conclusion 2
  • 3. Serves Impressions Clicks Web Ad Servers Servers Be the first place people go when they want to find, explore, and participate with Impressionsnews, from serious forfun. ad shown all forms of – client side event to an Serves - Server logged event for Clicks – client side event for a click on an ad an ad served. Serve has Interactions – client side events for interactions complete context within an ad Serve events are heavy and is Impressions clicks and conversions are a few a few 10s of KBs bytes Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other Join impressions/clicks/interactions} fields of serve} * Guid is global unique identifier 3
  • 4. Need For Attribution Serves 5m Several hours to days Older instances Impressions/Clicks Every 5 mins Attribute an impression/click with the serve 4
  • 5. Distribution Of % Impressions Arrived From The Client Side wrt Serves % of Impressions for a serve 90 80 70 60 50 %of Impressions for a serve 40 30 t1->201205301000 t2->201205300955 20 t3->201205300950 . 10 . . 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13 Time period from when the serves happened 5
  • 6. Distribution Of % Clicks Arrived From The Client Side wrt Serves %of Clicks for a serve 45 40 35 30 25 %of Clicks for a serve 20 15 t1->201205301000 t2->201205300955 10 t3->201205300950 . 5 . . 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13 Time period from when the serves happened 6
  • 7. Class Of Problems §  Sparse joins spanning TBs of data on grid §  Few MBs to a few TBs §  Left outer join or any other outer join Data Set Impressions Serves (5m*288) Data Size 400MB 20GB *288 ~= 5.6 TB (Compressed size) 7
  • 8. Similar Use Cases §  Associating video, click, social interactions back to the activity data §  Attribute back a small size client beacon to a large dataset §  Within Yahoo ›  Audience view/click attribution ›  Weblog based investigation ›  Joining dimensional data with web traffic data 8
  • 9. Pig Joins And Problem Fit Join Strategy Comments Cost Merge join The datasets are not sorted High Hash join Shuffle and reduce time High Replicated Join Does not meet performance High needs; left outer join on the replicated dataset Skewed Join Data set is not skewed N/A 9
  • 10. Problem Statement To do a sparse outer join on a very large dataset with high performance requirements for display ad attribution on grid 10
  • 11. Attribution Framework - Overview Smart Instrumentation Strategies Aggressive partitioning and selection Partition Aware Efficient Join Query Plan 11
  • 12. Instrument For Attribution Ø Smart Instrumentation Strategies §  Serve guid Aggressive partitioning and selection §  Clues which can help you partition better Partition Aware Efficient Join Query Plan ›  Timestamp of the serve §  Partition keys used in event instrumentation §  In the impression attribution example: Impression Serves Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other impressions/clicks/interactions} fields of serve} 12
  • 13. Partitioning approach §  Join key based partitioning Smart Instrumentation Strategies §  Keys for leveraging physical partitioning Ø Aggressive partitioning and selection ›  timestamp Partition Aware Efficient Join Query Plan §  Use of hashes in partitioning ›  HashFNV, Murmur Key Partition Type Join keys Hash Timestamp Range 13
  • 14. Pruning/Selection §  Hashing of keys in the data sets Smart Instrumentation Strategies Ø Aggressive partitioning §  Pruning of partitions and selection Partition Aware Efficient Join ›  Timestamp Query Plan ›  Hash of the join key §  IO costs and partitions §  Configurable partitions Key Partition Type Pruning Join keys Hash Yes Timestamp Range Yes 14
  • 15. Partition Aware Efficient Join Query Plans Stream the selected Impression event keys Smart Instrumentation Size : MBs Serve event partitions Strategies Size : TBs Aggressive partitioning and selection Ø Partition Aware Inner Efficient Join Query Plan Join Stream full Annotated impression Impression event Size : MBs Size: Hundreds of MBs Left outer join Complete Annotated Impression - in memory data with Serve data - stream 15
  • 16. Attribution Framework: Capabilities Smart Instrumentation Strategies §  Left outer on impression/click/interaction Aggressive partitioning and selection ›  As long as the impression/click/interaction Partition Aware Efficient Join Query Plan exists, we will get a record in output §  Complete annotation with the serve §  Distinct join with serves §  Sparse joins achieved by pruning the partitions §  Map side joins 16
  • 17. Attribution Framework: Implementation Smart Instrumentation Strategies §  Python embedded PIG Aggressive partitioning and selection §  Dynamic partitioning/pruning (UDFs) Partition Aware Efficient Join Query Plan §  Configurable parameters ›  Lookbacks ›  Partitions ›  CombinedSplitSize 17
  • 18. Attribution Framework: Tuning Parameters §  Serve Partitions: trade off between IO & namespace used (lookback = 24 hours) 4000 180000 Bytes read Number of files 3500 160000 140000 3000 120000 2500 100000 2000 Bytes Read(GB) 80000 Namespace Used 1500 60000 1000 40000 500 20000 0 0 2 4 8 16 32 64 128 256 512 1024 Partitions 18
  • 19. Attribution Framework: Tuning Parameters §  Split Size: trade off between number of mappers and map task run time (partitions = 16, lookback = 24 hours) 35000 1200 Number of Mappers Time taken 30000 1000 25000 800 20000 600 Number of Mappers 15000 Time Taken(s) 400 10000 200 5000 0 0 128MB 1 GB 2 GB 3 GB 4 GB Split Size 19
  • 20. Comparison With Other PIG Joins Join Mappers Reducers Lookback Input Size Time to complete Left Outer 2800 45 40mins 180GB 42.5m* Hash Join Replicated 5680 0 5hours 1TB 7m** Join Attribution 5760 0 24hours Effective 5.6 TB; 6m*** Framework With Pruning 1.1 TB * Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer) ** Map time taken *** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join) 20
  • 21. Conclusion §  For the sparse look up problem, the attribution framework used works very well and within the performance needs §  Effective partitioning aids longer lookbacks and reduced IO §  The levers in the framework allow for tuning based on the computation/IO requirements 21
  • 22. Future Steps §  Use Hbase/Cassandra to store the event grain serve data and do lookups §  Use of bloom filter along with an index format §  Compare the strategy with what Hive does and come up with a framework using Hive 22