Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele M...
Copyright © 2014 Improve Digital - All Rights Reserved
Real Time Advertisement Technology
Media Owners Advertisers
Copyright © 2014 Improve Digital - All Rights Reserved
3
Adtech 101
<150 msec
• Geographically distributed adserver fleet
...
Copyright © 2014 Improve Digital - All Rights Reserved
4
– How much revenue did publisher X generate last month? Which
are...
Copyright © 2014 Improve Digital - All Rights Reserved 5
Historically
• Batch pipelines
• Incremental processing
• Realtim...
Copyright © 2014 Improve Digital - All Rights Reserved
6
• Write jobs once
• Unifiy models and
• Analytics codebase
• Data...
Copyright © 2014 Improve Digital - All Rights Reserved
7
Analytics Architecture
Real-time
log
collection
Brokerage
(Kakfa
...
Copyright © 2014 Improve Digital - All Rights Reserved
8
Kafka and Samza
• Kafka (http://kafka.apache.org) as a
distribute...
Copyright © 2014 Improve Digital - All Rights Reserved
9
Apache Spark
• Spark (Zaharia et al. 2010)
• “Iterative” computin...
Copyright © 2014 Improve Digital - All Rights Reserved
10
Challenges
• Conceptually everything is a stream
• Satisfy a tra...
Copyright © 2014 Improve Digital - All Rights Reserved
Make big data small
Samples, sketches and summaries
Copyright © 2014 Improve Digital - All Rights Reserved
12
Reservoir Sampling (Vitter, 1985)
• Hard to parallelize
• How to...
Copyright © 2014 Improve Digital - All Rights Reserved
Cardinality estimation (count distinct)
How many users are visiting...
Copyright © 2014 Improve Digital - All Rights Reserved
14
Claim
The cardinality of a multiset of
uniformly-distributed ran...
Copyright © 2014 Improve Digital - All Rights Reserved
15
Intuitively

1. Apply an hash function on each element and
take ...
Copyright © 2014 Improve Digital - All Rights Reserved
16
val hll = new HyperLogLogMonoid(12)
!
val approxUsers = users.ma...
Copyright © 2014 Improve Digital - All Rights Reserved
17
HyperLogLog (< 2% error rate in 15kB)
Count
Exact
Approximate
Me...
Copyright © 2014 Improve Digital - All Rights Reserved
Frequency estimation
Top 10 most visited sites (out of a few millio...
Copyright © 2014 Improve Digital - All Rights Reserved
19
Count Min Sketch
(Cormode, Graham, and S. Muthukrishnan, 2005)
I...
Copyright © 2014 Improve Digital - All Rights Reserved
20
val eps = 0.01
val delta = 1E-3
val seed = 1
val perc = 0.003
!
...
Copyright © 2014 Improve Digital - All Rights Reserved
21
CMS results
Exact Approximate
Copyright © 2014 Improve Digital - All Rights Reserved
Learning from data
Copyright © 2014 Improve Digital - All Rights Reserved 23
Iterative methods are hard to
scale in MapReduce
Copyright © 2014 Improve Digital - All Rights Reserved
24
• Liner Regression
– OLS + SGD on batches of data
– Recursive Le...
Copyright © 2014 Improve Digital - All Rights Reserved
25
• Streaming is part of the broader system
• Approximation can he...
Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele M...
Upcoming SlideShare
Loading in …5
×

Approximation algorithms for stream and batch processing

888 views

Published on

At Improve Digital (http://www.improvedigital.com) we collect and process large amounts of machine generated and behavioral data. Our systems address a variety of use cases that involve both batch and streaming technologies. One common denominator of the overall architecture is the need to share models and workflows across both worlds. Another one is that the analysis of large amounts of data often requires trade-offs; for instance trading accuracy for timeliness in streaming applications. One approach to satisfy these constraints is to make "big data" small. In this talk we will review a number of approximation methods for sketching, summarization and clustering and discuss how they are starting to change the way we think about certain types of analytics, and how they are being integrated into our data pipelines.

Published in: Data & Analytics

Approximation algorithms for stream and batch processing

  1. 1. Copyright © 2014 Improve Digital - All Rights Reserved Approximation algorithms for stream and batch processing Gabriele Modena Data Scientist Improve Digital
 E: g.modena@improvedigital.com
  2. 2. Copyright © 2014 Improve Digital - All Rights Reserved Real Time Advertisement Technology Media Owners Advertisers
  3. 3. Copyright © 2014 Improve Digital - All Rights Reserved 3 Adtech 101 <150 msec • Geographically distributed adserver fleet • 200+ billion events / month • Hundreds of TB in a Hadoop cluster
  4. 4. Copyright © 2014 Improve Digital - All Rights Reserved 4 – How much revenue did publisher X generate last month? Which are the top advertisers? • Reporting & BI – Is the day-to-day traffic on site Y increasing or decreasing? • Trend analysis – Is the traffic legit or coming from a botnet ? • Fraud detection – How likely is this impression to generate a click or a conversion? • Predictive modelling – How are advertisers bidding and buying on inventory? Who is our audience? • Pattern Recognition Improve digital data platform
  5. 5. Copyright © 2014 Improve Digital - All Rights Reserved 5 Historically • Batch pipelines • Incremental processing • Realtime pipelines • Monitoring and trend analysis ! Batch dataset != Realtime dataset Batch models != Realtime models
  6. 6. Copyright © 2014 Improve Digital - All Rights Reserved 6 • Write jobs once • Unifiy models and • Analytics codebase • Datasets semantic • Experimentation Goals
  7. 7. Copyright © 2014 Improve Digital - All Rights Reserved 7 Analytics Architecture Real-time log collection Brokerage (Kakfa +Samza) Processing (YARN+Spark +MapReduce) Push Expose Publish Publish Publish Datab ase HDFS Redis
  8. 8. Copyright © 2014 Improve Digital - All Rights Reserved 8 Kafka and Samza • Kafka (http://kafka.apache.org) as a distributed message queue • Topic-based • Producers write, consumers read • Messages are persistently stored – topics can be re-read • We use Samza for coordinating ingestion, ETL and distributed stream processing
  9. 9. Copyright © 2014 Improve Digital - All Rights Reserved 9 Apache Spark • Spark (Zaharia et al. 2010) • “Iterative” computing • Generalization of MapReduce (Isard 2007) • Runs atop Hadoop (YARN)
 ! • Spark Streaming • Break data into batches and pass it to Spark engine (same API & data structures)
  10. 10. Copyright © 2014 Improve Digital - All Rights Reserved 10 Challenges • Conceptually everything is a stream • Satisfy a tradeoff between • Latency • Memory • Accuracy
 • On infinitely expanding datasets
  11. 11. Copyright © 2014 Improve Digital - All Rights Reserved Make big data small Samples, sketches and summaries
  12. 12. Copyright © 2014 Improve Digital - All Rights Reserved 12 Reservoir Sampling (Vitter, 1985) • Hard to parallelize • How to use samples to answer certain queries? Count distinct? TopK? • From an infinitely expanding dataset • With constant memory and in a single pass
  13. 13. Copyright © 2014 Improve Digital - All Rights Reserved Cardinality estimation (count distinct) How many users are visiting a site?
  14. 14. Copyright © 2014 Improve Digital - All Rights Reserved 14 Claim The cardinality of a multiset of uniformly-distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set.
  15. 15. Copyright © 2014 Improve Digital - All Rights Reserved 15 Intuitively
 1. Apply an hash function on each element and take the binary representation of the output 2. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n 3. Account for variance by averaging on subsets HyperLogLog (Flajolet, Philippe, et al. 2008)
  16. 16. Copyright © 2014 Improve Digital - All Rights Reserved 16 val hll = new HyperLogLogMonoid(12) ! val approxUsers = users.mapPartitions(user => user.map(uuid => hll(uuid.getBytes))).reduce(_ + _) ! var h = globalHll.zero approxUsers.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() h += partial } }) HyperLogLog (with Spark + Algebird)
  17. 17. Copyright © 2014 Improve Digital - All Rights Reserved 17 HyperLogLog (< 2% error rate in 15kB) Count Exact Approximate Memory
  18. 18. Copyright © 2014 Improve Digital - All Rights Reserved Frequency estimation Top 10 most visited sites (out of a few millions) ?
  19. 19. Copyright © 2014 Improve Digital - All Rights Reserved 19 Count Min Sketch (Cormode, Graham, and S. Muthukrishnan, 2005) It’s the hashing trick!
  20. 20. Copyright © 2014 Improve Digital - All Rights Reserved 20 val eps = 0.01 val delta = 1E-3 val seed = 1 val perc = 0.003 ! val approxImpressions = publishers.mapPartitions(publisher => { val cms = new CountMinSketchMonoid(delta, eps, seed, perc) publisher.map(publisher_id => cms.create(publisher_id.toLong)) }).reduce(_ ++ _) ! var globalCMS = new CountMinSketchMonoid(delta, eps, seed, perc).zero approxTopUsers.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)).toSeq.sortBy(_._2).reverse.slice(0, 5) } }) CMS (with Spark + Algebird)
  21. 21. Copyright © 2014 Improve Digital - All Rights Reserved 21 CMS results Exact Approximate
  22. 22. Copyright © 2014 Improve Digital - All Rights Reserved Learning from data
  23. 23. Copyright © 2014 Improve Digital - All Rights Reserved 23 Iterative methods are hard to scale in MapReduce
  24. 24. Copyright © 2014 Improve Digital - All Rights Reserved 24 • Liner Regression – OLS + SGD on batches of data – Recursive Least Squares with Forgetting (Vahidi et al. 2005)
 • Streaming kmeans (Ailon et al. 2009, Shindler et al 2011, Ostrovsky et al. 2012) – Single iteration-to-convergence – Use sketches to reduce dimensionality (k log N centroids) – Mini batch updates + forgetfulness Using sketches
  25. 25. Copyright © 2014 Improve Digital - All Rights Reserved 25 • Streaming is part of the broader system • Approximation can help us scale both streaming and batch loads – Make “big data” small – Unification • Data collection and distribution is key ▪ Publishing results follows • Large scale analytics = Architecture + Algos + Data Structures Conclusion
  26. 26. Copyright © 2014 Improve Digital - All Rights Reserved Approximation algorithms for stream and batch processing Gabriele Modena Data Scientist Improve Digital
 E: g.modena@improvedigital.com

×