Qubism and scala nlp

Qubism and NLP at Scale
Jerome Banks, Principal Big Data Engineer
March 26, 2020

AGENDA
▪ INTENT AT DEMANDBASE
▪ WHAT IS THE PROBLEM?
▪ WHAT IS THE APPROACH?
▪ QUBISM AS A SOLUTION

© 2019 DEMANDBASE｜SLIDE 3
B2B Real-Time Intent
Buying Signals
4.55 Trillion
Yearly signals
80% B2B Employees
People Coverage
940 million web pages
From over 2.9 million publishers
Content Coverage
50x Scale
20x Granularity
9x More Accounts
Than Bombora
(As of 1 May 2019)

Bags O’Keywords
▪ Classical technique of NLP
▪ Bag of Words are Sparse Vectors
▪ Represented as Map[String,Double]
▪ Task is to generate lots of BOK’s
▪ Per Domain
▪ Per Publisher
▪ Globally ( for TF-IDF)
▪ Combined with other possible attributes (geo,language,industry)

Aggregation is Feature Extraction
▪ Feature Extraction is Aggregation
▪ Aggregation is dimensionality reduction
▪ Lot of events to smaller number of aggregates
▪ Aggregates are more than Dashboards
▪ Graphs and charts are nice but often not actionable
▪ Generate lots of features to drive machine learning
▪ Model Development
▪ Clustering/Similarity
▪ Outliers/Indexing

In the Beginning was Brickhouse
▪ Library of Hive UDF’s and UDAF’s
▪ Used for generating the Klout Score
▪ Open-sourced
▪ http://github.com/klout/brickhouse
▪ Used by pipelines round the world

Next generation is Qubism
▪ Scala Spark Library
▪ Re-usable transformers (DataFrame) -> DataFrame
▪ Focus on Aggregation/Feature transformation
▪ XUnits and YPaths
• Multi-dimensional feature representation
▪ Bridge to Algebird
• (Aggregator) -> UserDefinedAggregateFunction
▪ Exotic Aggregators
• Collect - ArgMax
• Cardinality estimation - KMV, HLL sketches
• Vectors
• Timeseries

XUnits and YPaths
XUnit strings represent slice-and-dice segments (YPaths)
Single event row explodes to multiple XUnits
Dimensions (YPaths) can be added or removed
(domain=”db.com”,
page=”home.html”,
account=”1234”,
country=”US”,
city=”San Francisco”,
industry=”AdTech”)
/page/domain=db.com
/account/id=1234
/industry/type=AdTech
/geo/country=US
/geo/country=US/city=San Francisco
/page/domain=db.com/page=home.html
/geo/country=UD,/page/domain=db.com
/geo/country=US,/page/domain=db.com/page=home.html
/geo/country=US/city=San Francisco,/page/domain=db.com
/account/id=1234,/page/domain=db.com
/account/id=1234,/page/domain=db.com/page=home.html
/industry/type=AdTech,/page/domain=db.com
/industry/type=AdTech,/page/domain=db.com/page=home.html

XUnits and YPaths
▪ Event Rows exploded to
multiple XUnits in map
phase
▪ Annotated Rows
distributed by XUnit in
shuffle/sort phase
▪ XUnit aggregates
produced in reduce phase
(domain=”db.com”,
page=”home.html”,
account=”1234”, country=”US”,
city=”San Francisco”,
industry=”AdTech”)
/page/domain=db.com
/account/id=1234
/industry/type=AdTech
/geo/country=US
/geo/country=US,/city=SanFrancisco
/page/domain=db.com/page=home.html
/geo/country=UD,/page/domain=db.com
/geo/country=US,/page/domain=db.com/page=home.html
/geo/country=US/city=SanFrancisco,/page/domain=db.com
/account/id=1234,/page/domain=db.com
/account/id=1234,/page/domain=db.com/page=home.html
/industry/type=AdTech,/page/domain=db.com
/industry/type=AdTech,/page/domain=db.com/page=home.html
XUnit
Explode !!!
count(*)
group By
XUnit

XUnits and YPaths
Advantages
▪ Single string key to represent arbitrary segment
▪ Maps nicely to key/value stores
▪ Dimensions can be easily added or removed
▪ Simplifies table schemas
▪ Qubism provides tools for using XUnits
▪ DSL for specifying YPath dimensions
▪ Transforms for exploding/aggregrating XUnit DataFrame
▪ Common operations on XUnit DataFrame
• Ranking, Outlier detection, Indexing, Clustering
▪ UDFs for parsing/manipulating XUnit strings
▪ FilterRules for controlling size of explosion

Aggregator
▪ Analogous to Algebird
▪ Monoid in Category Theory
▪ Supports Associative operations
▪ Qubism implements easy
transformation to Spark’s
(painful)
UserDefinedAggregateFunction

Vector - What’s the Vector, Victor?
Qubism models Sparse Vectors as
Map[String,Double]
▪ Aggregate vectors by collecting keywords
▪ Merge vectors by doing vector sums
▪ UDF’s for vector operators
▪ Scalar multiply, normalize
▪ Dot-product, cosine-similarity
▪ VectorBuffer
▪ Efficient data-structure for Serialization

KMV Sketch Set
Qubism provides implementation of KMV sketch set
▪ Estimate cardinality of large sets in fixed set of space
▪ Exact for small reach, within 1% for sets > 10,000
▪ Jacardian Set Similarity
▪ Collaborative filtering
▪ LongBufferSeq provides fast merges, serialization
-MaxLong +MaxLong0
Kth Max Hash + MaxLong
K * 2 * MaxLong
Reach

What about Intent?
▪ Generate XUnits based on parsed document attributes
▪ Publisher, Domain, Geo, Language
▪ Aggregate Keyword Vectors per XUnit
▪ Generate Global Vectors for TF-IDF
▪ Merge Vectors over various timeranges
▪ Aggregate KMV sketches of various uuids
▪ Sizing
▪ Clustering and Collaborative filtering
▪ Calculate scores by comparing Vectors
▪ Cosine similarity, Dot-product

Conclusion
▪ Data Engineering is really all about aggregation
▪ You can’t sort the universe
▪ Re-use today what you did yesterday
▪ Generate as many aggregates/features as possible
▪ Gain insight by analyzing everything at once
▪ Qubism is a re-usable Scala Spark Library
▪ Unit of re-use in Data Engineering is Function al
• (DataFrame) -> DataFrame
▪ Generate and manipulate XUnits and YPaths
▪ Implement exotic and efficient Aggregators

Qubism and scala nlp

Recommended

Recommended

More Related Content

Similar to Qubism and scala nlp

Similar to Qubism and scala nlp (20)

Recently uploaded

Recently uploaded (20)

Qubism and scala nlp