Count-Min Tree Sketch : Approximate counting for NLP tasks

Guillaume Pitel
Guillaume PitelTech'Mentor BigData - Epitech Innovation Hub
PAGE1
www.exensa.com
www.exensa.com
PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLP
Count-Min Tree Sketch
Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul
Mouhamadsultane
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7
PAGE2
www.exensa.com
A bit of context
Why do we need to count ?
Data analysis platform : eXenGine.
Processes different kind of data (mostly text).
We need to create relevant cross-features : to do that we need to count occurrences of all possible
cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams.
There are many different measures to decide if a n-gram is interesting. All require to count the
occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in
bigrams)
Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the
whole data structure containing the counts in memory is impossible, so one has to resort to using
huge map/reduce with joins to do the job.
PAGE3
www.exensa.com
A bit of context
What kind of data are we talking about ?
Google N-grams
tokens 1024 Billions
sentences 95 Billions
1-grams (count > 200) 14 Millions
2-grams (count > 40) 314 Millions
3-grams 977 Millions
4-grams 1.3 Billion
5-grams 1.2 Billion
PAGE4
www.exensa.com
A bit of context
What kind of data are we talking about ?
Zipfian distribution
[Le Quan & al. 2003]
PAGE5
www.exensa.com
A bit of context
What kind of measures are we talking about ?
PMI, TF-IDF, LLR
PAGE6
www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
PAGE7
www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
We can use probabilistic
structures
PAGE8
www.exensa.com
Count-Min Sketch
A probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]
PAGE9
www.exensa.com
Count-Min Sketch
A probabilistic data structure to store counts
Conservative Update :
improve CMS by updating
only min values
PAGE10
www.exensa.com
Count-Min Log Sketch
A probabilistic data structure to store logarithmic counts
[Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch
Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting
logarithmically.
Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can
use more counters
However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some
point, error stops improving with space (there is an inherent residual error)
PAGE11
www.exensa.com
Count-Min Tree Sketch
A count min sketch with shared counters
Idea : use a hierarchal storage where most significant bits are shared
between counters.
Somehow similar to TOMB counters [Van Durme, 2009], except that
overflow is managed very differently.
PAGE12
www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o A tree is made of three kinds of storage:
o Counting bits
o Barrier bits
o Spire (not required except for
performance)
oSeveral layers alternating counting
and barrier bits.
oHere we have a
<[(8,8),(4,4),(2,2),(1,1)],4> counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
PAGE13
www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o8 counters in 30 bits + spire
oWithout a spire, n bits can count up
to 3 × 21+log2
𝑛
4
o Many small shared counters with spires
are more efficient than a large shared
counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer
PAGE14
www.exensa.com
Tree Shared Counters
Reading values
o A counter stops at the first ZERO barrier
o When two barrier paths meet, there is
a conflict
o Barrier length (b) is evaluated in unary
o Counter bits (c) are evaluated in a more
classical way
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7
PAGE15
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0 1 2
PAGE16
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
3 4 5
PAGE17
www.exensa.com
Tree Shared Counters
Incrementing (counter 5)
0
0
0 0 0 0
0 0 1 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
6
1
A bit at that level is worth …
2
2
4
4
8
PAGE18
www.exensa.com
Count-Min Tree Sketches
Experiments
Results !
• 140M tokens from English Wikipedia*
• 14.7M words (unigrams + bigrams)
• Reference counts stored in UnorderedMap  815MiB
Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits
counters. For 14.7M words, it amounts to 59MiB.
Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native
UnorderedMap performance.
We use 3-layers sketches (good performance/precision tradeoff)
* We preferred to test our counters with a large number of parameters rather than with a large
corpus, so we limit to 5% of Wikipedia.
PAGE19
www.exensa.com
Count-Min Tree Sketches
Average Relative Error
Results !
PAGE20
www.exensa.com
Count-Min Tree Sketches
RMSE
Results !
PAGE21
www.exensa.com
Count-Min Tree Sketches
RMSE on PMI
Results !
PAGE22
www.exensa.com
Count-Min Tree Sketch
Question : are CMTS really useful in real-life ?
1 – CMTS are better on the whole vocabulary, but what happens if we
skip the least frequent words / bigrams ?
2 – CMTS are better on average, but what happens quantile by quantile ?
PAGE23
www.exensa.com
Count-Min Tree Sketches
PMI Error per quantile
(sketches at 50% perfect
size, limit eval to f > 10-7
)
Results !
PAGE24
www.exensa.com
Count-Min Tree Sketches
Relative Error per log2-quantile
(sketches at 50% perfect size,
limit eval to f > 10-7 )
Results !
PAGE25
www.exensa.com
Conclusion
Where are we ?
CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient
way.
Because most of the time in sketch accesses is due to memory access, its performance is on-par with
other methods
• Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage
size), the error skyrockets
• Other drawback : implementation is not straightforward. We have devised at least 4 different ways to
increment the counters.
Merging (and thus distributing) is easy once you can read and set a counter.
PAGE26
www.exensa.com
Conclusion
Where are we going ?
Dynamic : we are working on a CMTS version that can automatically grow (more layers added below)
Pressure control : when we detect that pressure becomes too high, we can divide and subsample to
stop the collisions to cascade
Open Source python package on its way
1 of 26

Recommended

Things to Remember When Developing 64-bit Software by
Things to Remember When Developing 64-bit SoftwareThings to Remember When Developing 64-bit Software
Things to Remember When Developing 64-bit SoftwareAndrey Karpov
637 views23 slides
Data visualization pyplot by
Data visualization pyplotData visualization pyplot
Data visualization pyplotchinthala Vijaya Kumar
119 views25 slides
Connaissance marché et apports du web by
Connaissance marché et apports du webConnaissance marché et apports du web
Connaissance marché et apports du webThomas Coustenoble
805 views16 slides
Les professionnels de l'information face aux défis du Web de données by
Les professionnels de l'information face aux défis du Web de donnéesLes professionnels de l'information face aux défis du Web de données
Les professionnels de l'information face aux défis du Web de donnéesGautier Poupeau
4.8K views29 slides
Programmation web1 complet by
Programmation web1 completProgrammation web1 complet
Programmation web1 completAnnabi Gihed
1.6K views59 slides
Trends and challenges in web application development by
Trends and challenges in web application developmentTrends and challenges in web application development
Trends and challenges in web application developmentPixel Crayons
802 views13 slides

More Related Content

Viewers also liked

Evolution du look & feel du web 0.0 au 2.0 - Printemps.com by
Evolution du look & feel du web 0.0 au 2.0 - Printemps.comEvolution du look & feel du web 0.0 au 2.0 - Printemps.com
Evolution du look & feel du web 0.0 au 2.0 - Printemps.combenoit.rigaut
762 views99 slides
2016 06-30-deep-learning-archi by
2016 06-30-deep-learning-archi2016 06-30-deep-learning-archi
2016 06-30-deep-learning-archiDaisuke Nagao
938 views19 slides
Modern Datacenter : de la théorie à la pratique by
Modern Datacenter : de la théorie à la pratique Modern Datacenter : de la théorie à la pratique
Modern Datacenter : de la théorie à la pratique Microsoft Technet France
1.9K views24 slides
Les cabinets de recrutement spécialisés dans les métiers du numérique by
Les cabinets de recrutement spécialisés dans les métiers du numériqueLes cabinets de recrutement spécialisés dans les métiers du numérique
Les cabinets de recrutement spécialisés dans les métiers du numériqueFrenchWeb.fr
45.3K views11 slides
hands on: Text Mining With R by
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
2.1K views19 slides
Web1, web2 and web 3 by
Web1, web2 and web 3Web1, web2 and web 3
Web1, web2 and web 3mercedeh37
12.7K views9 slides

Viewers also liked(18)

Evolution du look & feel du web 0.0 au 2.0 - Printemps.com by benoit.rigaut
Evolution du look & feel du web 0.0 au 2.0 - Printemps.comEvolution du look & feel du web 0.0 au 2.0 - Printemps.com
Evolution du look & feel du web 0.0 au 2.0 - Printemps.com
benoit.rigaut762 views
2016 06-30-deep-learning-archi by Daisuke Nagao
2016 06-30-deep-learning-archi2016 06-30-deep-learning-archi
2016 06-30-deep-learning-archi
Daisuke Nagao938 views
Les cabinets de recrutement spécialisés dans les métiers du numérique by FrenchWeb.fr
Les cabinets de recrutement spécialisés dans les métiers du numériqueLes cabinets de recrutement spécialisés dans les métiers du numérique
Les cabinets de recrutement spécialisés dans les métiers du numérique
FrenchWeb.fr45.3K views
Web1, web2 and web 3 by mercedeh37
Web1, web2 and web 3Web1, web2 and web 3
Web1, web2 and web 3
mercedeh3712.7K views
Introducing natural language processing(NLP) with r by Vivian S. Zhang
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang5.4K views
Natural Language Processing in R (rNLP) by fridolin.wild
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
fridolin.wild22.1K views
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar... by Databricks
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks9.8K views
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic... by craftworkz
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
Craftworkz at InterConnect 2017 - Creating a Highly Scalable Chatbot in a Mic...
craftworkz995 views
Deep Learning for NLP: An Introduction to Neural Word Embeddings by Roelof Pieters
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters20.1K views
Web Development on Web Project Presentation by Milind Gokhale
Web Development on Web Project PresentationWeb Development on Web Project Presentation
Web Development on Web Project Presentation
Milind Gokhale33.4K views
DocDoku - Mobile Monday Toulouse 1ère : la NFC by DocDoku
DocDoku - Mobile Monday Toulouse 1ère : la NFCDocDoku - Mobile Monday Toulouse 1ère : la NFC
DocDoku - Mobile Monday Toulouse 1ère : la NFC
DocDoku4.2K views
#MDSGAM : Etude Digital Trends Morocco 2015 by Othmane Ghailane
#MDSGAM : Etude Digital Trends Morocco 2015#MDSGAM : Etude Digital Trends Morocco 2015
#MDSGAM : Etude Digital Trends Morocco 2015
Othmane Ghailane3.4K views
Detail History of web 1.0 to 3.0 by Ghazal Hina
Detail History of web 1.0 to 3.0Detail History of web 1.0 to 3.0
Detail History of web 1.0 to 3.0
Ghazal Hina45.6K views
Web 1.0, Web 2.0 & Web 3.0 by tokey_sport
Web 1.0, Web 2.0 & Web 3.0Web 1.0, Web 2.0 & Web 3.0
Web 1.0, Web 2.0 & Web 3.0
tokey_sport129.4K views

Similar to Count-Min Tree Sketch : Approximate counting for NLP tasks

Business Maths Statistics Assignment 1 by
Business Maths Statistics Assignment 1Business Maths Statistics Assignment 1
Business Maths Statistics Assignment 1Nicole Stewart
4 views47 slides
Computer data representation (integers, floating-point numbers, text, images,... by
Computer data representation (integers, floating-point numbers, text, images,...Computer data representation (integers, floating-point numbers, text, images,...
Computer data representation (integers, floating-point numbers, text, images,...ArtemKovera
285 views41 slides
Beyond PFCount: Shrif Nada by
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif NadaRedis Labs
265 views64 slides
CST-20363-Session 1-In the Bitginning by
CST-20363-Session 1-In the BitginningCST-20363-Session 1-In the Bitginning
CST-20363-Session 1-In the Bitginningoudesign
271 views31 slides
Lesson 26. Optimization of 64-bit programs by
Lesson 26. Optimization of 64-bit programsLesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programsPVS-Studio
307 views4 slides
Counting (Notes) by
Counting (Notes)Counting (Notes)
Counting (Notes)roshmat
252 views7 slides

Similar to Count-Min Tree Sketch : Approximate counting for NLP tasks(20)

Business Maths Statistics Assignment 1 by Nicole Stewart
Business Maths Statistics Assignment 1Business Maths Statistics Assignment 1
Business Maths Statistics Assignment 1
Nicole Stewart4 views
Computer data representation (integers, floating-point numbers, text, images,... by ArtemKovera
Computer data representation (integers, floating-point numbers, text, images,...Computer data representation (integers, floating-point numbers, text, images,...
Computer data representation (integers, floating-point numbers, text, images,...
ArtemKovera285 views
Beyond PFCount: Shrif Nada by Redis Labs
Beyond PFCount: Shrif NadaBeyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
Redis Labs265 views
CST-20363-Session 1-In the Bitginning by oudesign
CST-20363-Session 1-In the BitginningCST-20363-Session 1-In the Bitginning
CST-20363-Session 1-In the Bitginning
oudesign271 views
Lesson 26. Optimization of 64-bit programs by PVS-Studio
Lesson 26. Optimization of 64-bit programsLesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programs
PVS-Studio307 views
Counting (Notes) by roshmat
Counting (Notes)Counting (Notes)
Counting (Notes)
roshmat252 views
CSF Tips and Tricks 8MS Webinar by Aerialink
CSF Tips and Tricks 8MS WebinarCSF Tips and Tricks 8MS Webinar
CSF Tips and Tricks 8MS Webinar
Aerialink84 views
Manoch1raw 160512091436 by marangburu42
Manoch1raw 160512091436Manoch1raw 160512091436
Manoch1raw 160512091436
marangburu4257 views
Feature Importance Analysis with XGBoost in Tax audit by Michael BENESTY
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
Michael BENESTY28.4K views
Development of a static code analyzer for detecting errors of porting program... by PVS-Studio
Development of a static code analyzer for detecting errors of porting program...Development of a static code analyzer for detecting errors of porting program...
Development of a static code analyzer for detecting errors of porting program...
PVS-Studio364 views
Essay About Week 4 Ilab by Katie Ellis
Essay About Week 4 IlabEssay About Week 4 Ilab
Essay About Week 4 Ilab
Katie Ellis3 views
Lesson 13. Pattern 5. Address arithmetic by PVS-Studio
Lesson 13. Pattern 5. Address arithmeticLesson 13. Pattern 5. Address arithmetic
Lesson 13. Pattern 5. Address arithmetic
PVS-Studio213 views

Recently uploaded

Note on the Riemann Hypothesis by
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesisvegafrank2
8 views20 slides
Oral_Presentation_by_Fatma (2).pdf by
Oral_Presentation_by_Fatma (2).pdfOral_Presentation_by_Fatma (2).pdf
Oral_Presentation_by_Fatma (2).pdffatmaalmrzqi
8 views7 slides
ZEBRA FISH: as model organism.pptx by
ZEBRA FISH: as model organism.pptxZEBRA FISH: as model organism.pptx
ZEBRA FISH: as model organism.pptxmahimachoudhary0807
12 views17 slides
ALGAL PRODUCTS.pptx by
ALGAL PRODUCTS.pptxALGAL PRODUCTS.pptx
ALGAL PRODUCTS.pptxRASHMI M G
7 views17 slides
Krishna VSC 692 Credit Seminar.pptx by
Krishna VSC 692 Credit Seminar.pptxKrishna VSC 692 Credit Seminar.pptx
Krishna VSC 692 Credit Seminar.pptxKrishnaSharma682993
11 views54 slides

Recently uploaded(20)

Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank28 views
Oral_Presentation_by_Fatma (2).pdf by fatmaalmrzqi
Oral_Presentation_by_Fatma (2).pdfOral_Presentation_by_Fatma (2).pdf
Oral_Presentation_by_Fatma (2).pdf
fatmaalmrzqi8 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople68 views
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa12 views
Exploring the nature and synchronicity of early cluster formation in the Larg... by Sérgio Sacani
Exploring the nature and synchronicity of early cluster formation in the Larg...Exploring the nature and synchronicity of early cluster formation in the Larg...
Exploring the nature and synchronicity of early cluster formation in the Larg...
Sérgio Sacani1.5K views
selection of preformed arch wires during the alignment stage of preadjusted o... by MaherFouda1
selection of preformed arch wires during the alignment stage of preadjusted o...selection of preformed arch wires during the alignment stage of preadjusted o...
selection of preformed arch wires during the alignment stage of preadjusted o...
MaherFouda17 views
Presentation on experimental laboratory animal- Hamster by Kanika13641
Presentation on experimental laboratory animal- HamsterPresentation on experimental laboratory animal- Hamster
Presentation on experimental laboratory animal- Hamster
Kanika136416 views
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by Trustlife
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Trustlife207 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI9 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific121 views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain14 views

Count-Min Tree Sketch : Approximate counting for NLP tasks

  • 1. PAGE1 www.exensa.com www.exensa.com PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLP Count-Min Tree Sketch Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul Mouhamadsultane 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7
  • 2. PAGE2 www.exensa.com A bit of context Why do we need to count ? Data analysis platform : eXenGine. Processes different kind of data (mostly text). We need to create relevant cross-features : to do that we need to count occurrences of all possible cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams. There are many different measures to decide if a n-gram is interesting. All require to count the occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in bigrams) Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the whole data structure containing the counts in memory is impossible, so one has to resort to using huge map/reduce with joins to do the job.
  • 3. PAGE3 www.exensa.com A bit of context What kind of data are we talking about ? Google N-grams tokens 1024 Billions sentences 95 Billions 1-grams (count > 200) 14 Millions 2-grams (count > 40) 314 Millions 3-grams 977 Millions 4-grams 1.3 Billion 5-grams 1.2 Billion
  • 4. PAGE4 www.exensa.com A bit of context What kind of data are we talking about ? Zipfian distribution [Le Quan & al. 2003]
  • 5. PAGE5 www.exensa.com A bit of context What kind of measures are we talking about ? PMI, TF-IDF, LLR
  • 6. PAGE6 www.exensa.com A bit of context Summary / Goals Many counts Logarithms in measures We need to store a large amount of counts We care about the order of magnitude Fast and memory controlled We don’t want a distributed memory for the counts Zipfian counts Many very small counts that will be filtered out later
  • 7. PAGE7 www.exensa.com A bit of context Summary / Goals Many counts Logarithms in measures We need to store a large amount of counts We care about the order of magnitude Fast and memory controlled We don’t want a distributed memory for the counts Zipfian counts Many very small counts that will be filtered out later We can use probabilistic structures
  • 8. PAGE8 www.exensa.com Count-Min Sketch A probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]
  • 9. PAGE9 www.exensa.com Count-Min Sketch A probabilistic data structure to store counts Conservative Update : improve CMS by updating only min values
  • 10. PAGE10 www.exensa.com Count-Min Log Sketch A probabilistic data structure to store logarithmic counts [Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting logarithmically. Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can use more counters However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some point, error stops improving with space (there is an inherent residual error)
  • 11. PAGE11 www.exensa.com Count-Min Tree Sketch A count min sketch with shared counters Idea : use a hierarchal storage where most significant bits are shared between counters. Somehow similar to TOMB counters [Van Durme, 2009], except that overflow is managed very differently.
  • 12. PAGE12 www.exensa.com Tree Shared Counters Sharing most significant bits 8 counters structure o A tree is made of three kinds of storage: o Counting bits o Barrier bits o Spire (not required except for performance) oSeveral layers alternating counting and barrier bits. oHere we have a <[(8,8),(4,4),(2,2),(1,1)],4> counter Or : how can we store counts with an average approaching 4 bits / counter 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 barrier bits counting bits spire base layer
  • 13. PAGE13 www.exensa.com Tree Shared Counters Sharing most significant bits 8 counters structure o8 counters in 30 bits + spire oWithout a spire, n bits can count up to 3 × 21+log2 𝑛 4 o Many small shared counters with spires are more efficient than a large shared counter Or : how can we store counts with an average approaching 4 bits / counter 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 barrier bits counting bits spire base layer
  • 14. PAGE14 www.exensa.com Tree Shared Counters Reading values o A counter stops at the first ZERO barrier o When two barrier paths meet, there is a conflict o Barrier length (b) is evaluated in unary o Counter bits (c) are evaluated in a more classical way 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0101 b=2/c=110 b=4/c=01011001 conflict between counters 4 and 7
  • 15. PAGE15 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 0 1 2
  • 16. PAGE16 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0000 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0000 3 4 5
  • 17. PAGE17 www.exensa.com Tree Shared Counters Incrementing (counter 5) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0000 6 1 A bit at that level is worth … 2 2 4 4 8
  • 18. PAGE18 www.exensa.com Count-Min Tree Sketches Experiments Results ! • 140M tokens from English Wikipedia* • 14.7M words (unigrams + bigrams) • Reference counts stored in UnorderedMap  815MiB Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits counters. For 14.7M words, it amounts to 59MiB. Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native UnorderedMap performance. We use 3-layers sketches (good performance/precision tradeoff) * We preferred to test our counters with a large number of parameters rather than with a large corpus, so we limit to 5% of Wikipedia.
  • 22. PAGE22 www.exensa.com Count-Min Tree Sketch Question : are CMTS really useful in real-life ? 1 – CMTS are better on the whole vocabulary, but what happens if we skip the least frequent words / bigrams ? 2 – CMTS are better on average, but what happens quantile by quantile ?
  • 23. PAGE23 www.exensa.com Count-Min Tree Sketches PMI Error per quantile (sketches at 50% perfect size, limit eval to f > 10-7 ) Results !
  • 24. PAGE24 www.exensa.com Count-Min Tree Sketches Relative Error per log2-quantile (sketches at 50% perfect size, limit eval to f > 10-7 ) Results !
  • 25. PAGE25 www.exensa.com Conclusion Where are we ? CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient way. Because most of the time in sketch accesses is due to memory access, its performance is on-par with other methods • Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage size), the error skyrockets • Other drawback : implementation is not straightforward. We have devised at least 4 different ways to increment the counters. Merging (and thus distributing) is easy once you can read and set a counter.
  • 26. PAGE26 www.exensa.com Conclusion Where are we going ? Dynamic : we are working on a CMTS version that can automatically grow (more layers added below) Pressure control : when we detect that pressure becomes too high, we can divide and subsample to stop the collisions to cascade Open Source python package on its way