Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- NiceCover: A Serverless Webapp for ... by Kyushu Institute ... 1032 views
- Cloudy会 @cloudymeeting ハイブリッドクラウドとh... by Kyushu Institute ... 726 views
- Searching for the Most Cost Effecti... by Kyushu Institute ... 234 views
- Black Swan Based VM Placement and M... by Kyushu Institute ... 576 views
- Regularised Cross-Modal Hashing (SI... by Sean Moran, Ph.D. 973 views
- Graph Regularised Hashing (ECIR'15 ... by Sean Moran, Ph.D. 642 views

612 views

Published on

Published in:
Technology

No Downloads

Total views

612

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

8

Comments

0

Likes

2

No embeds

No notes for slide

- 1. . The Data Streaming Problem M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 2/23 ... 2/23
- 2. . Data Streaming Problem • based on the traditional Information Theory 01 02 • but a new formulation altogether 04 • data streaming: processes input in realtime (no storage), creating space efficient sketches on the output • alternative to database, indexing, offline processing, etc. technologies 01 C.Shannon "A Mathematical Theory of Communication" The Bell System Tech.J (1948) 02 D.MacKey "Information Theory, Inference, and Learning Algorithms" Cambridge UniPress (2003) 04 S.Muthukrishnan "Data Streams: Algorithms and Applications" Theoretical Comp.Science (2005) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 3/23 ... 3/23
- 3. . Data Streaming Problems • fast hashing 08 • efficient blooming 09 10 • space-efficient streaming algorithms Other Uses Data Streaming Other uses Bloom Filter Other Types of Hashing Fast Hashing 08 D.Lemire+1 "Strongly Universal String Hashing is Fast" Cornell Techreport (2013) 09 F.Putze+2 "Cache- Hash- and Space-Efficient Bloom Filters" JEA Journal (2009) 10 A.Kirsch+1 "Less Hashing, Same Performance: Building a Better Bloom Filter" Inderscience (2007) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 4/23 ... 4/23
- 4. . Hashing and Blooming M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 5/23 ... 5/23
- 5. . Hashing Technology • perfect hashing • minimal perfect hashing ◦ applied to blooming, but relatively inefficient 11 • universal hashing ← this is the one we use ◦ but many efficiency tricks ◦ bitwise fast hashing, etc.12 11 G.Antichi+4 "Blooming Trees for Minimal Perfect Hashing" GLOBECOM (2008) 12 F.Bonomi+4 "Bloom Filters via d-Left Hashing and Dynamic Bit Reassignment" 44th ACCCC (2006) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 6/23 ... 6/23
- 6. . Hashing Quality Metrics • uniform distribution • avalance condition ◦ change in one bit on the input changes about half of bits on the output • no partial correlation ◦ hard to achieve, head and tail bits have different qualities in common algorithms like CRC24, etc. M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 7/23 ... 7/23
- 7. . Blooming Quality Metrics • True Positive OK, but False Positive also possible • an answer to the question of "have you seen this before?" • time it takes to "fill up" a bloom structure -- useless afterwards M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 8/23 ... 8/23
- 8. . Bloom Filter Types • stop additions filter • delition filter • counting filter • .... very active research 09 10 • .... reality: most of them are inefficient! 09 F.Putze+2 "Cache- Hash- and Space-Efficient Bloom Filters" JEA Journal (2009) 10 A.Kirsch+1 "Less Hashing, Same Performance: Building a Better Bloom Filter" Inderscience (2007) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 9/23 ... 9/23
- 9. . Efficiency M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 10/23 ... 10/23
- 10. . Efficiency (1) : Hash/Bloom M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 11/23 ... 11/23
- 11. . Efficiency (2) : Hash/Bloom • how many hash functions k? k = ln2 (m n ) ≈ 0.6 m n . • the "fill-up" rate -- when it becomes useless p = ( 1 − 1 m )kn ≈ e −kn m . • FP probability pFP = (1 − p)k ≈ ( 1 − e −kn m )k ≈ 1 2k , M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 12/23 ... 12/23
- 12. . Efficiency (2) : Hotspot Input • most data today hotspot distribution moded as a SB process M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 13/23 ... 13/23
- 13. . Efficiency (3) : DLL and Collissions • a practical alternative to perfect hashing • catch and resolve collissions using sideways DLL • hotspots: move changed items to the top of DLL • common in C/C++ M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 14/23 ... 14/23
- 14. . Data Streaming Examples M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 15/23 ... 15/23
- 15. . Examples (1) : Heavy Hitterns . Objective .. .Finding Heavy Hitters in a hotspot distributed input. • find k most frequently accessed items in a list. • good algorithms can be found in 04 04 S.Muthukrishnan "Data Streams: Algorithms and Applications" Theoretical Comp.Science (2005) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 16/23 ... 16/23
- 16. . Examples (2) : Superspreaders . Objective .. . Superspreaders: detect items which access or are accessed by exceedingly many other items. • computer viruses, botnets, etc. • one source, many destinations • short lifespan M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 17/23 ... 17/23
- 17. . Examples (3) : M2M Patterns . Objective .. . M2M patterns: A more generic case of heavy hitters and superspreaders, but in this definition the patterns are not known in advance. • m2m communication patterns • space efficiency is important • selective filtering -- pick only interesting m2m units M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 18/23 ... 18/23
- 18. . The Why : Practical Application - BigData M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 19/23 ... 19/23
- 19. . BigData: Today 05 K.Shvachko "HDFS Scalability: the Limits to Growth" the Magazine of USENIX (2012) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 20/23 ... 20/23
- 20. . BigData Replay (new) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 21/23 ... 21/23
- 21. . BigData on Multicore M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 22/23 ... 22/23
- 22. . That’s all, thank you ... M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23 ... 23/23
- 23. . [01] C.Shannon (1948) A Mathematical Theory of Communication The Bell System Tech.J [02] D.MacKey (2003) Information Theory, Inference, and Learning Algorithms Cambridge UniPress [03] A.Konheim (2010) Hashing in Computer Science: Fifty Years of Slicing and Dicing Wiley [04] S.Muthukrishnan (2005) Data Streams: Algorithms and Applications Theoretical Comp.Science [05] K.Shvachko (2012) HDFS Scalability: the Limits to Growth the Magazine of USENIX M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23 ... 23/23
- 24. . [06] S.Heinz+2 (2002) Burst Tries: A Fast, Efficient Data Structure for String Keys ACM TOIS [07] M.Ramakrishna+1 (1997) Performance in Practice of String Hashing Functions 5th ICDSAA [08] D.Lemire+1 (2013) Strongly Universal String Hashing is Fast Cornell Techreport [09] F.Putze+2 (2009) Cache- Hash- and Space-Efficient Bloom Filters JEA Journal [10] A.Kirsch+1 (2007) Less Hashing, Same Performance: Building a Better Bloom Filter Inderscience [11] G.Antichi+4 (2008) M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23 ... 23/23
- 25. . Blooming Trees for Minimal Perfect Hashing GLOBECOM [12] F.Bonomi+4 (2006) Bloom Filters via d-Left Hashing and Dynamic Bit Reassignment 44th ACCCC M.Zhanikeev -- maratishe@gmail.com -- Efficiency Tricks for Hashing and Blooming in Streaming Algorithms -- http://bit.do/marat140516 -- 23/23 ... 23/23

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment