From 6 hours to 1 minute... in 2 days!
How we managed to stream our (long)
Hadoop batches
1
Sofian DJAMAA - Software engineer
@sdjamaa
Phase 1 - Buy displays
Phase 2 - Sell clicks
Phase 3 - ???
Phase 4 - Profit
Criteo
How does it work?
What’s
on Bild website
today?
user
We gather the
information for the
retargeting process
Let’s go on
Amazon!!
The website
contains a
« pixel » used
to put
information
on the user
cookie
user
eCommerce
website
publisher
website
Using the
information we have
on the user, tagged
by a cookie, we
display the right ad
Advertiser side
Publisher side
HOW DO THEY
KNOOOOOOOOWW????
Our constraints
6 datacenters
!
3 billions events a day
!
+50 PB of data in our Hadoop cluster
!
800K HTTP requests/second
!
JIRA ticket generation
How do we use the data? Where’s my
money?
WHERE’S
MY F@*#
MONEY?!?!
Where’s my
client’s money?
finance
sales
business devinternal
reports
client
reports
billing
(heavy) data
transformation
clicks, displays,
purchases…
client
YAY!!
MAKIN’
MONEY!!
client dashboard
(web)
But this takes time…
Where goes the data? There’s
an issue… let’s
investigate
WTF?!?!?
business
escalation
relase
management
Data not aggregated cross-DC
!
Granularity limited due to the volume
of data
!
Load time can be huge even if we bulk
insert (or move files)
clicks, displays,
purchases…
IIS web
servers
SQL Server DBs
(multiple instances
per DC)
Graphite
monitoring
Scaling is limited therefore
only the most aggregated
level is kept in Graphite
production
alert
Up to 6 hours for some metrics
!
Volume being huge, processing and
storage takes time (SQL Server
replication hell…)
!
Multiple datacenters containing data
with latency issues
6 hours to get business data
+ 1 hour to check data/raise alerts
+ 1 hour to find root cause
!
- SSome big money (up to million €)
finance
sales
release
management
(heavy) data
transformation
PBs of data to
replay
client
A lot of people need the same aggregated data but all with their own
constraints…
Consistency
required
Quick feedback
Batching on Hadoop or SQL Server
doesn’t fill the requirement
!
We need to have our checks as soon
as something wrong happens
!
We need to handle both real-time
and batch mode
But who can do it?
Some amazing projects such as:
!
- Ads in GIF format
!
- Embedded ad banners in movies
!
- Streaming service stopping several times a movie to display an ad
!
- Ponys
!
- And something about Chuck Norris (‘cause he’s awesome)
Internal hackathon
No team wanted to take the responsibility of project so we built our
own:
!
- 3 1/2 developers
!
- 1 business release manager (as a product owner)
!
- 2 business intelligence engineers (the guys that write SQL queries all day long)
!
- 1 business developer (doesn’t code a business layer obviously)
!
- 1 creative
!
- 1 release manager
!
- 1 technical solution guy (the guy who helps in putting the pixels)
Turbo
What people think a hackathon
is…
What really a hackathon is….
SummingBird computations
Metric aggregations at banner/zone level
- Clicks, displays, sales, revenue…
!
Real-time part
Aggregates are updated and sent after each event
30K messages/second
!
Batch part
Computes the expected trend of data for each period (reference data)
Using lasso: sum of squared errors, with a bound on the sum of the absolute
values of the coefficients
data processing is done in batches
on each side
when an offline batch is ready, it becomes the truth for the
whole system
online batches are computed in
streaming
batch #1 (e.g.
1H)
batch #2 batch #3
insert AND update
aggregated results
for each event
insert aggregated results
for 1 hour of events
λ =
http://github.com/twitter/summingbird
Platform[T]
def job = source.map { !
/* your job here */ "
}.store()
P#Source P#Store
job is
executed on
sends job results
to store
redirects input
to job
Why not streaming everything
then ?
Streaming costs a lot of infrastructure
!
Sometimes we need to replay (backfilling) events from the
past to correct a bug or adjust a formula
!
With PBs of data generated per day, a streaming
architecture needs to be massively parallelized (much
more than a batching architecture) for replays
!
Lambda Architecture is a good way to move towards a
full streaming architecture
Rule engine
Developed in Prolog
!
10K decisions/minute
!
Linked to the real-time flow to compute the
discrepancies with expected values and tag
abnormal data
Vizatra
In-house analytic visualization stack: world map, graphs, real-
time curves…
!
AngularJS, Bootstrap, Scala, Finagle, any DB supporting SQL
!
Web-component oriented: easy customization
!
Query deconstruction and NOT query building :-)
!
Open-source release coming soon
Riemann
Monitoring system with a « powerful stream processing
language » (nah, just Clojure configuration files)
!
Sends alerts based on tags sent by the rule engine
!
Scoped alerts
!
Alerts are emails and SMS to on-duty people
!
JIRA ticket generation
Awesomeness
✓ Data granularity
	 - Checks at banner/zone levels for
better investigations
✗ Data granularity
	 - Checks only at publisher website
level (only on RTB)
	 - No checks on the client side
✗ Latency up to 6 hours ✓ Latency of 1 minute
✗ Data aggregated hourly ✓ Data aggregated in 5-min period
✗ Money in the bank: $ ✓ Money in the bank: $$$$$
Even more…
Thanks to the hackathon, we are now able to provide real-time
feedback to sales, business developers, MDs and VPs which led
us to :
!
- Getting more clients as they love having a quick feedback on
their campaigns
!
- Adjust CPC in real-time
- For special occasions like sales
- To test aggressively our prediction models in an A/B test
!
- And more…
Some feedbacks
✗ Exponential learning curve with all frameworks
✗ A lot of features are missing (e.g. stores)
✗ Very limited documentation or tutorial
✗ Testing the error rate between Hadoop and Storm is
really too long for a 2 day development period
✗ Cassandra was a bad choice because of the data model
needed for the visualization part (lot of joins)
#Paris, #BigData, #MachineLearning, #NerfGuns, #Hadoop, #Storm, #Spark,
#Cassandra, #MongoDB, #Riemann, #Scala, #C# (?!), #Java
31
Sofian DJAMAA - Software engineer
@sdjamaa
WE RECRUIT!!!!
We have many open positions in the R&D:
!
•	Data Processing Systems Manager
•	Senior Software Engineer (Grenoble)
•	Software Development lead/Manager
•	SRE OPS Manager
•	Senior Software Engineer – Palo Alto, CA.
•	Python Software Lead Engineer - Paris
•	Software Development Engineer –Paris
•	Machine Learning Scientist

From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Hadoop batches_ Sofian Dhamaa_Criteo

  • 1.
    From 6 hoursto 1 minute... in 2 days! How we managed to stream our (long) Hadoop batches 1 Sofian DJAMAA - Software engineer @sdjamaa
  • 2.
    Phase 1 -Buy displays Phase 2 - Sell clicks Phase 3 - ??? Phase 4 - Profit Criteo
  • 3.
    How does itwork? What’s on Bild website today? user We gather the information for the retargeting process Let’s go on Amazon!! The website contains a « pixel » used to put information on the user cookie user eCommerce website publisher website Using the information we have on the user, tagged by a cookie, we display the right ad Advertiser side Publisher side
  • 4.
  • 5.
    Our constraints 6 datacenters ! 3billions events a day ! +50 PB of data in our Hadoop cluster ! 800K HTTP requests/second ! JIRA ticket generation
  • 6.
    How do weuse the data? Where’s my money? WHERE’S MY F@*# MONEY?!?! Where’s my client’s money? finance sales business devinternal reports client reports billing (heavy) data transformation clicks, displays, purchases… client YAY!! MAKIN’ MONEY!! client dashboard (web)
  • 7.
  • 8.
    Where goes thedata? There’s an issue… let’s investigate WTF?!?!? business escalation relase management Data not aggregated cross-DC ! Granularity limited due to the volume of data ! Load time can be huge even if we bulk insert (or move files) clicks, displays, purchases… IIS web servers SQL Server DBs (multiple instances per DC) Graphite monitoring Scaling is limited therefore only the most aggregated level is kept in Graphite production alert
  • 9.
    Up to 6hours for some metrics ! Volume being huge, processing and storage takes time (SQL Server replication hell…) ! Multiple datacenters containing data with latency issues
  • 10.
    6 hours toget business data + 1 hour to check data/raise alerts + 1 hour to find root cause ! - SSome big money (up to million €)
  • 11.
    finance sales release management (heavy) data transformation PBs ofdata to replay client A lot of people need the same aggregated data but all with their own constraints… Consistency required Quick feedback
  • 12.
    Batching on Hadoopor SQL Server doesn’t fill the requirement ! We need to have our checks as soon as something wrong happens ! We need to handle both real-time and batch mode
  • 13.
  • 15.
    Some amazing projectssuch as: ! - Ads in GIF format ! - Embedded ad banners in movies ! - Streaming service stopping several times a movie to display an ad ! - Ponys ! - And something about Chuck Norris (‘cause he’s awesome) Internal hackathon
  • 16.
    No team wantedto take the responsibility of project so we built our own: ! - 3 1/2 developers ! - 1 business release manager (as a product owner) ! - 2 business intelligence engineers (the guys that write SQL queries all day long) ! - 1 business developer (doesn’t code a business layer obviously) ! - 1 creative ! - 1 release manager ! - 1 technical solution guy (the guy who helps in putting the pixels) Turbo
  • 17.
    What people thinka hackathon is…
  • 18.
    What really ahackathon is….
  • 20.
    SummingBird computations Metric aggregationsat banner/zone level - Clicks, displays, sales, revenue… ! Real-time part Aggregates are updated and sent after each event 30K messages/second ! Batch part Computes the expected trend of data for each period (reference data) Using lasso: sum of squared errors, with a bound on the sum of the absolute values of the coefficients
  • 21.
    data processing isdone in batches on each side when an offline batch is ready, it becomes the truth for the whole system online batches are computed in streaming batch #1 (e.g. 1H) batch #2 batch #3 insert AND update aggregated results for each event insert aggregated results for 1 hour of events
  • 22.
  • 23.
    Platform[T] def job =source.map { ! /* your job here */ " }.store() P#Source P#Store job is executed on sends job results to store redirects input to job
  • 24.
    Why not streamingeverything then ? Streaming costs a lot of infrastructure ! Sometimes we need to replay (backfilling) events from the past to correct a bug or adjust a formula ! With PBs of data generated per day, a streaming architecture needs to be massively parallelized (much more than a batching architecture) for replays ! Lambda Architecture is a good way to move towards a full streaming architecture
  • 25.
    Rule engine Developed inProlog ! 10K decisions/minute ! Linked to the real-time flow to compute the discrepancies with expected values and tag abnormal data
  • 26.
    Vizatra In-house analytic visualizationstack: world map, graphs, real- time curves… ! AngularJS, Bootstrap, Scala, Finagle, any DB supporting SQL ! Web-component oriented: easy customization ! Query deconstruction and NOT query building :-) ! Open-source release coming soon
  • 27.
    Riemann Monitoring system witha « powerful stream processing language » (nah, just Clojure configuration files) ! Sends alerts based on tags sent by the rule engine ! Scoped alerts ! Alerts are emails and SMS to on-duty people ! JIRA ticket generation
  • 28.
    Awesomeness ✓ Data granularity - Checks at banner/zone levels for better investigations ✗ Data granularity - Checks only at publisher website level (only on RTB) - No checks on the client side ✗ Latency up to 6 hours ✓ Latency of 1 minute ✗ Data aggregated hourly ✓ Data aggregated in 5-min period ✗ Money in the bank: $ ✓ Money in the bank: $$$$$
  • 29.
    Even more… Thanks tothe hackathon, we are now able to provide real-time feedback to sales, business developers, MDs and VPs which led us to : ! - Getting more clients as they love having a quick feedback on their campaigns ! - Adjust CPC in real-time - For special occasions like sales - To test aggressively our prediction models in an A/B test ! - And more…
  • 30.
    Some feedbacks ✗ Exponentiallearning curve with all frameworks ✗ A lot of features are missing (e.g. stores) ✗ Very limited documentation or tutorial ✗ Testing the error rate between Hadoop and Storm is really too long for a 2 day development period ✗ Cassandra was a bad choice because of the data model needed for the visualization part (lot of joins)
  • 31.
    #Paris, #BigData, #MachineLearning,#NerfGuns, #Hadoop, #Storm, #Spark, #Cassandra, #MongoDB, #Riemann, #Scala, #C# (?!), #Java 31 Sofian DJAMAA - Software engineer @sdjamaa WE RECRUIT!!!! We have many open positions in the R&D: ! • Data Processing Systems Manager • Senior Software Engineer (Grenoble) • Software Development lead/Manager • SRE OPS Manager • Senior Software Engineer – Palo Alto, CA. • Python Software Lead Engineer - Paris • Software Development Engineer –Paris • Machine Learning Scientist