Counting Big Data
by Streaming Algorithms
2013/10/26 @ Rakuten Technology Conference 2013
Rakuten Institute of Technology, Rakuten, Inc.,
Yusaku Kaneta
http://www.rakuten.co.jp/
Who am I?
• Yusaku Kaneta (@yusakukaneta)
– Joined Rakuten in April 2012.
– Rakuten Institute of Technology (RIT)

• Interests:
– String processing (esp., Pattern matching)
– Hardware design using FPGA
– Bitwise tricks & techniques
• Love TAOCP 7.1.3 & Hacker's Delight
2
Problem: Count Big Data
• Counting:
– Fundamental operation in data analysis.

• Big data is difficult to just count
– Because it needs huge amount of memory.
– E.g., 400GB+ is needed for
one-year access logs.

3
Batch Processing
• Batch processing can solve this.
– E.g.,

• Two issues:
– High latency

– Requirement for a cluster of machines
Batch

Batch

Batch

= High cost

Batch

Batch

Batch

4
Our Goals
1. Reduce memory
– Cost reduction.

2. Reduce latency
– Quick business decisions.

3. Achieve high-accuracy
– Correct business decisions.
5
Our Approach
• Streaming algorithms
– Can fulfill all our goals!
– Become common in Web companies.
• See the paper on Google’s PowerDrill & the code of
Twitter’s Algebird for examples of how to use.

• Keys:
– Limited memory
– Low latency
– Theoretical guarantee for accuracy
6
Streaming Algorithm Library
• RIT internally provides a C library
for streaming algorithms, libsketch.
• Three advantages:
Memory
efficient

• Bindings for

High
speed

High
accuracy

&
7
Why C?
• Our target: Python & Ruby users!
for data analysis

for stream processing

– But most of existing libraries are written in Scala
(algebird), Java (stream-lib), ...

This is a reason
why our library is written in C!
Easy to incorporate C libraris in Python & Ruby.
8
Application
Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?

• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
10
Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?

• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
11
Problem: Unique Item Count
• Naïve approach:
– Uses dict in Python: ”dict[key] += 1”
– This can require a large amount of memory.

• Streaming algorithm: HyperLogLog
– Counts unique items approximately.
– This needs a fixed amount of memory.
• Google recently proposed an improved version of
HyperLogLog, called HyperLogLog++.

12
HyperLogLog
• Basic ideas:

–Hash function
–Harmonic mean
–Stochastic averaging

13
HyperLogLog
• Algorithm
Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i])

…

upper bits

lower bits

…

Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0
Item2
i = (0001)2= 1 j = (# leading 0s)+1= 6
A[1]
Item3
4
0 1
···
···
Item1 array A 2 6
3. Estimate # unique items from E=1/Σ(2-A[i]).
(In practice, we use heuristics for corrections.)
14
Demo
• Naïve vs. HyperLogLog

15
Performance
• Task: Count unique items in an item set.
Memory
efficient

High
speed

1%

4x -1%

Memory
1193MB

5MB

Speed-up
419sec

108sec

High
accuracy

Accuracy
100%

99%

This data set is small,
but we are using HyperLogLog for bigger data.
16
Conclusion
• Streaming algorithms in Rakuten
–We are using them for data analysis.
–We have an internal C library with bindings.
• HyperLogLog, Count-Min Sketch, and so on.

–Future: Plan to implement other algorithms.

17
Reference
• HyperLogLog & HyperLogLog++
– [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013]

• Count-Min Sketch
– [Cormode, Muthukrishnan, J. Algorithms, 2005]

• An excellent slide by Alex Smola
– http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf

• AK TECH BLOG by Aggregate Knowledge
– http://blog.aggregateknowledge.com/

• Stream-lib by Clearspring
– https://github.com/clearspring/stream-lib

18

[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

  • 1.
    Counting Big Data byStreaming Algorithms 2013/10/26 @ Rakuten Technology Conference 2013 Rakuten Institute of Technology, Rakuten, Inc., Yusaku Kaneta http://www.rakuten.co.jp/
  • 2.
    Who am I? •Yusaku Kaneta (@yusakukaneta) – Joined Rakuten in April 2012. – Rakuten Institute of Technology (RIT) • Interests: – String processing (esp., Pattern matching) – Hardware design using FPGA – Bitwise tricks & techniques • Love TAOCP 7.1.3 & Hacker's Delight 2
  • 3.
    Problem: Count BigData • Counting: – Fundamental operation in data analysis. • Big data is difficult to just count – Because it needs huge amount of memory. – E.g., 400GB+ is needed for one-year access logs. 3
  • 4.
    Batch Processing • Batchprocessing can solve this. – E.g., • Two issues: – High latency – Requirement for a cluster of machines Batch Batch Batch = High cost Batch Batch Batch 4
  • 5.
    Our Goals 1. Reducememory – Cost reduction. 2. Reduce latency – Quick business decisions. 3. Achieve high-accuracy – Correct business decisions. 5
  • 6.
    Our Approach • Streamingalgorithms – Can fulfill all our goals! – Become common in Web companies. • See the paper on Google’s PowerDrill & the code of Twitter’s Algebird for examples of how to use. • Keys: – Limited memory – Low latency – Theoretical guarantee for accuracy 6
  • 7.
    Streaming Algorithm Library •RIT internally provides a C library for streaming algorithms, libsketch. • Three advantages: Memory efficient • Bindings for High speed High accuracy & 7
  • 8.
    Why C? • Ourtarget: Python & Ruby users! for data analysis for stream processing – But most of existing libraries are written in Scala (algebird), Java (stream-lib), ... This is a reason why our library is written in C! Easy to incorporate C libraris in Python & Ruby. 8
  • 9.
  • 10.
    Count Query inRakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 10
  • 11.
    Count Query inRakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 11
  • 12.
    Problem: Unique ItemCount • Naïve approach: – Uses dict in Python: ”dict[key] += 1” – This can require a large amount of memory. • Streaming algorithm: HyperLogLog – Counts unique items approximately. – This needs a fixed amount of memory. • Google recently proposed an improved version of HyperLogLog, called HyperLogLog++. 12
  • 13.
    HyperLogLog • Basic ideas: –Hashfunction –Harmonic mean –Stochastic averaging 13
  • 14.
    HyperLogLog • Algorithm Keys 1.Set i to upper bits 2. Set A[i] to max(j, A[i]) … upper bits lower bits … Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0 Item2 i = (0001)2= 1 j = (# leading 0s)+1= 6 A[1] Item3 4 0 1 ··· ··· Item1 array A 2 6 3. Estimate # unique items from E=1/Σ(2-A[i]). (In practice, we use heuristics for corrections.) 14
  • 15.
    Demo • Naïve vs.HyperLogLog 15
  • 16.
    Performance • Task: Countunique items in an item set. Memory efficient High speed 1% 4x -1% Memory 1193MB 5MB Speed-up 419sec 108sec High accuracy Accuracy 100% 99% This data set is small, but we are using HyperLogLog for bigger data. 16
  • 17.
    Conclusion • Streaming algorithmsin Rakuten –We are using them for data analysis. –We have an internal C library with bindings. • HyperLogLog, Count-Min Sketch, and so on. –Future: Plan to implement other algorithms. 17
  • 18.
    Reference • HyperLogLog &HyperLogLog++ – [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013] • Count-Min Sketch – [Cormode, Muthukrishnan, J. Algorithms, 2005] • An excellent slide by Alex Smola – http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf • AK TECH BLOG by Aggregate Knowledge – http://blog.aggregateknowledge.com/ • Stream-lib by Clearspring – https://github.com/clearspring/stream-lib 18