[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

1,501 views
1,263 views

Published on

Rakuten Technology Conference 2013
"Counting Big Data by Streaming Algorithms"
Yusaku Kaneta (Rakuten)

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,501
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
37
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

  1. 1. Counting Big Data by Streaming Algorithms 2013/10/26 @ Rakuten Technology Conference 2013 Rakuten Institute of Technology, Rakuten, Inc., Yusaku Kaneta http://www.rakuten.co.jp/
  2. 2. Who am I? • Yusaku Kaneta (@yusakukaneta) – Joined Rakuten in April 2012. – Rakuten Institute of Technology (RIT) • Interests: – String processing (esp., Pattern matching) – Hardware design using FPGA – Bitwise tricks & techniques • Love TAOCP 7.1.3 & Hacker's Delight 2
  3. 3. Problem: Count Big Data • Counting: – Fundamental operation in data analysis. • Big data is difficult to just count – Because it needs huge amount of memory. – E.g., 400GB+ is needed for one-year access logs. 3
  4. 4. Batch Processing • Batch processing can solve this. – E.g., • Two issues: – High latency – Requirement for a cluster of machines Batch Batch Batch = High cost Batch Batch Batch 4
  5. 5. Our Goals 1. Reduce memory – Cost reduction. 2. Reduce latency – Quick business decisions. 3. Achieve high-accuracy – Correct business decisions. 5
  6. 6. Our Approach • Streaming algorithms – Can fulfill all our goals! – Become common in Web companies. • See the paper on Google’s PowerDrill & the code of Twitter’s Algebird for examples of how to use. • Keys: – Limited memory – Low latency – Theoretical guarantee for accuracy 6
  7. 7. Streaming Algorithm Library • RIT internally provides a C library for streaming algorithms, libsketch. • Three advantages: Memory efficient • Bindings for High speed High accuracy & 7
  8. 8. Why C? • Our target: Python & Ruby users! for data analysis for stream processing – But most of existing libraries are written in Scala (algebird), Java (stream-lib), ... This is a reason why our library is written in C! Easy to incorporate C libraris in Python & Ruby. 8
  9. 9. Application
  10. 10. Count Query in Rakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 10
  11. 11. Count Query in Rakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 11
  12. 12. Problem: Unique Item Count • Naïve approach: – Uses dict in Python: ”dict[key] += 1” – This can require a large amount of memory. • Streaming algorithm: HyperLogLog – Counts unique items approximately. – This needs a fixed amount of memory. • Google recently proposed an improved version of HyperLogLog, called HyperLogLog++. 12
  13. 13. HyperLogLog • Basic ideas: –Hash function –Harmonic mean –Stochastic averaging 13
  14. 14. HyperLogLog • Algorithm Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i]) … upper bits lower bits … Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0 Item2 i = (0001)2= 1 j = (# leading 0s)+1= 6 A[1] Item3 4 0 1 ··· ··· Item1 array A 2 6 3. Estimate # unique items from E=1/Σ(2-A[i]). (In practice, we use heuristics for corrections.) 14
  15. 15. Demo • Naïve vs. HyperLogLog 15
  16. 16. Performance • Task: Count unique items in an item set. Memory efficient High speed 1% 4x -1% Memory 1193MB 5MB Speed-up 419sec 108sec High accuracy Accuracy 100% 99% This data set is small, but we are using HyperLogLog for bigger data. 16
  17. 17. Conclusion • Streaming algorithms in Rakuten –We are using them for data analysis. –We have an internal C library with bindings. • HyperLogLog, Count-Min Sketch, and so on. –Future: Plan to implement other algorithms. 17
  18. 18. Reference • HyperLogLog & HyperLogLog++ – [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013] • Count-Min Sketch – [Cormode, Muthukrishnan, J. Algorithms, 2005] • An excellent slide by Alex Smola – http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf • AK TECH BLOG by Aggregate Knowledge – http://blog.aggregateknowledge.com/ • Stream-lib by Clearspring – https://github.com/clearspring/stream-lib 18

×