S M A R T V I D E O A D V E R T I S I N G
Processing Complex Workflows
in Advertising Using Hadoop
June 3rd, 2014
Who we are
Rahul Ravindran Bernardo de Seabra
Data Team Data Team
rahul@brightroll.com bernardo@brightroll.com
@bseabra
Agenda
• Introduction to BrightRoll
• Data Consumer Requirements
• Motivation
• Design
– Streaming log data into HDFS
– An...
Introduction: BrightRoll
• Largest Online Video Advertisement Platform
• BrightRoll builds technology that improves
and au...
Data Consumer Requirements
• Processing results
– Campaign delivery
– Analytics
– Telemetry
• Consumers of processed data
...
Motivation – legacy data pipeline
• Not linearly scale-able
• Unit of processing was single campaign
• Not HA
• Lots of mo...
Motivation – legacy data pipeline
• Lots of boilerplate code
– hard to onboard new data/computations
• Interval based proc...
Performance requirements
• Low end-to-end delivery of aggregated
metrics
– Feedback loop into delivery algorithm
– Campaig...
Design decisions
• Streaming model
– Data is continuously being written
– Process data once
– Checkpoint states
– Low end-...
Overview Data Processing Pipeline
ProcessDe-duplicate Store
HDFS
M/R HBase
Data
Data Producers
Flume NG
Data Warehouse
Stream log data into HDFS using Flume
Adser
v
Adser
v
Adser
v
HDFS
Flume
• Flume rolls files every
2 minutes
• Files lexic...
File.1239
File.1238
File.1237
File.1236
File.1235
File.1234
Marker
Files written by Flume
Event Header Event
payload
Event ID Event
timestamp
Event type Machine id
Anatomy of an event
De-duplication
Raw logs Hbase table
• We load raw logs into
an hbase table
• We use hbase table as a
stream
• We keep trac...
Hbase table
Start time
End time
• Next run will read data which was
inserted from start time to end time
(window of TO_BE_...
Chunk 1
Chunk 3
Chunk 2
• Break up data in
WINDOW_TO_BE_PROCESSED
into chunks
• Each chunk has same salt and
contiguous ev...
StartRow
EndRow
Historical Scan
without time range ,
multi-versions
• New Scan object gives
historical view
• Perform de-d...
De-duplication performance
• High Dedup throughput – 1.2+ million events
per second
• Dedup across 4 days of historical da...
Processing - Joins
Impression Auction Computation
Arbitrary joins
• Use of an mechanism very similar to the de-
duplication previously described
• Historical scan now check...
Auditing
Adser
v
Adser
v
Adser
v
Metadata
Auditor
Metadata
Machine id #
events
Time
interval
Deduped.1 Deduped.2 Deduped.3...
What we have now
• All the stuff we have talked about plus system
which
– Scales linearly
– HA within our data center
– HA...
Future
• Move to HBase 0.98/1.x
• Further improvements to De-duplication
algorithm
• Dynamic definition of join semantics
...
Questions
Upcoming SlideShare
Loading in …5
×

Processing Complex Workflows in Advertising using Hadoop

658 views
474 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
658
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Good afternoon everyone. Thanks for joining us on this talk named Processing Complex Workflows in Advertising Using Hadoop.
  • My name is Bernardo de Seabra, this is Rahul Ravindran and we are part of the Data Team at BrightRoll. Our team is responsible for all Big Data related things in the company including the most recent project we undertook to rebuild the data processing pipeline that powers a lot of the critical components of the BrightRoll technology stack. That’s data processing pipeline will be the focus of this talk.
  • In order to give the audience some more context we’ll take a minute to explain what BrightRoll does for those of you that are not familiar with the company. We will then cover the requirements of all the different consumers of data throughout the platform, the motivation to develop a new data processing pipeline and the design decisions made to respond to such requirements.
  • Smaller chances of underdelivery or overdelivery which costs us money
  • Bernardo to cover up to this slide.


    All files until File.1235 have been processed
    The arrows between files represent time. Older file which was followed by the next file
  • Global unique event id for each event generated at the point when the event is logged
  • We have a requirement to consume all logs. We have a separate audit mechanism to verify if all logs were consumed by the pipeline. We automatically replay log lines if we find missing log lines. On replay, we may have duplicates which need to be de-duped.
  • Historical perspective: We began with a naïve dedup algo where we would look up each event id to check if it exists, if so, it is duplicate else we would emit. This as too slow as large number of such random lookup were slow and each lookup went over the entire keyspace. We needed a mechanism to constrain the key space and perform a range query but with event IDs being random, this was hard. Hence, we needed the event timestamp at the beginning of the rowkey, but this would result in hosspotting, so, we used added a one byte salt, generated from the hash of the event id as the prefix to distribute load across all the regions.
  • StartRow and EndRow of each chunk are used to construct a new scan object with no constraints on time. This scan constraints the keyspace in the query using startRow and endRow
  • As number of hfiles increase, timerange scans benefit, since a lot of hfiles outside the timerange are ignored. However, as number of hfiles increase, the historical scan gets slower as all the hfiles need to be scanned. In the other scenario, if we have one giant hfile(say after a major compaction run), then, the timerange scan has to scan the entire hfile which is slow. So, we use co-processor which enables us to use #hfiles as a coarse index on time for recent data (where we will do a timerange scan) and older large hfiles which provide an index on the rowkey
  • Allow arbitrary event joins
    Events to be joined, along with fields to be used are defined in a configuration file to allow ease of adding new computations
    All financial computations expressed via config. Currently, 24 different computations exist
    On-boarding new computations are changes to config file
    Each computation is an entry in a different hbase table

  • Also, allows us to perform joins across events generated over arbitrary and possibly long time windows (currently at 2 hours) since mobile clients frequently cache the auction results and show an ad later (as much as 2 hours from auction time). Hence, impression generated 2 hours after auction. This does not need us to compare with all the old data. Old pipeline required us to load all the data for 2 hours to perform join
    Last event type which is part of the join triggers a computation
    Since we have a view into joined data, this allows other engineering teams to query into this data to allow for better debugging at large scale. This allows for arbitrary joins across event types which enables engineering to deal with new events.
  • Auditor processes the deduped stream and then uses that to compare with the meta data it has received from the adserving machines. If they do not match, we force a replay of the files from the adserv box, which would get deduped, thereby removing all the duplicates and ensuring that all the data makes it through to the processing pipeline
  • If we can provide something about how this has impacted business
  • Processing Complex Workflows in Advertising using Hadoop

    1. 1. S M A R T V I D E O A D V E R T I S I N G Processing Complex Workflows in Advertising Using Hadoop June 3rd, 2014
    2. 2. Who we are Rahul Ravindran Bernardo de Seabra Data Team Data Team rahul@brightroll.com bernardo@brightroll.com @bseabra
    3. 3. Agenda • Introduction to BrightRoll • Data Consumer Requirements • Motivation • Design – Streaming log data into HDFS – Anatomy of an event – Event de-duplication – Configuration driven processing – Auditing • Future
    4. 4. Introduction: BrightRoll • Largest Online Video Advertisement Platform • BrightRoll builds technology that improves and automates video advertising globally • Reaching 53.9% of US audience, 168MM unique viewers • 3+ Billion video ads / month • 20+ Billion events processed / day
    5. 5. Data Consumer Requirements • Processing results – Campaign delivery – Analytics – Telemetry • Consumers of processed data – Delivery algorithms to augment decision behavior – Campaign managers to monitor/tweak campaigns – Billing system – Forecasting/planning tools – Business Analysts: long/short term analysis
    6. 6. Motivation – legacy data pipeline • Not linearly scale-able • Unit of processing was single campaign • Not HA • Lots of moving parts, no centralized control and monitoring • Failure recovery was time consuming
    7. 7. Motivation – legacy data pipeline • Lots of boilerplate code – hard to onboard new data/computations • Interval based processing – 2 hour sliding window – Inherited delay – Inefficient use of resources • All data must be retrieved prior to processing
    8. 8. Performance requirements • Low end-to-end delivery of aggregated metrics – Feedback loop into delivery algorithm – Campaign managers can react faster to their campaign performance • Linearly scalable
    9. 9. Design decisions • Streaming model – Data is continuously being written – Process data once – Checkpoint states – Low end-to-end latency (5 mins) • Idempotent – Jobs can fail, tasks can fail, allows repeatability • Configuration driven join semantics – Ease of on-boarding new data/computations
    10. 10. Overview Data Processing Pipeline ProcessDe-duplicate Store HDFS M/R HBase Data Data Producers Flume NG Data Warehouse
    11. 11. Stream log data into HDFS using Flume Adser v Adser v Adser v HDFS Flume • Flume rolls files every 2 minutes • Files lexicographically ordered • Treat files written from flume to be a stream • Maintain a marker which points to current location in the input stream • Enables us to always process new data logs logs logs
    12. 12. File.1239 File.1238 File.1237 File.1236 File.1235 File.1234 Marker Files written by Flume
    13. 13. Event Header Event payload Event ID Event timestamp Event type Machine id Anatomy of an event
    14. 14. De-duplication Raw logs Hbase table • We load raw logs into an hbase table • We use hbase table as a stream • We keep track of a time-based marker per table which represents a point in time up to which we have processed data
    15. 15. Hbase table Start time End time • Next run will read data which was inserted from start time to end time (window of TO_BE_PROCESSED data) • Rowkey is <salt, event timestamp, event id>
    16. 16. Chunk 1 Chunk 3 Chunk 2 • Break up data in WINDOW_TO_BE_PROCESSED into chunks • Each chunk has same salt and contiguous event timestamp • Each chunk is sorted – artifact of hbase storage Salt time id 4 1234 Foobar 1 4 1234 Foobar 2 4 1235 Foobar 3 6 1234 Foobar 4 7 1235 Foobar 5 7 1236 foobar6
    17. 17. StartRow EndRow Historical Scan without time range , multi-versions • New Scan object gives historical view • Perform de-duplication of data in chunk based on historical view Key Event payload 4,1234, foobar1 4,1234, foobar2 4,1235, foobar3
    18. 18. De-duplication performance • High Dedup throughput – 1.2+ million events per second • Dedup across 4 days of historical data StartRow/EndRow scan TimeRange scan Compaction co- processor to compact files older than table start Time
    19. 19. Processing - Joins Impression Auction Computation
    20. 20. Arbitrary joins • Use of an mechanism very similar to the de- duplication previously described • Historical scan now checks for other events specified in the join • Business level de-duplication – duplicate impressions for same auction performed here as well • “Session debugging”
    21. 21. Auditing Adser v Adser v Adser v Metadata Auditor Metadata Machine id # events Time interval Deduped.1 Deduped.2 Deduped.3 Disk Replay
    22. 22. What we have now • All the stuff we have talked about plus system which – Scales linearly – HA within our data center – HA across data centers (by switching traffic) – Allows us to on-board new computations easily – Provide guarantees on consumption on data in pipeline
    23. 23. Future • Move to HBase 0.98/1.x • Further improvements to De-duplication algorithm • Dynamic definition of join semantics • HDFS Federation
    24. 24. Questions

    ×