SlideShare a Scribd company logo
Qubism and NLP at Scale
Jerome Banks, Principal Big Data Engineer
March 26, 2020
AGENDA
▪ INTENT AT DEMANDBASE
▪ WHAT IS THE PROBLEM?
▪ WHAT IS THE APPROACH?
▪ QUBISM AS A SOLUTION
© 2019 DEMANDBASE|SLIDE 3
B2B Real-Time Intent
Buying Signals
4.55 Trillion
Yearly signals
80% B2B Employees
People Coverage
940 million web pages
From over 2.9 million publishers
Content Coverage
50x Scale
20x Granularity
9x More Accounts
Than Bombora
(As of 1 May 2019)
© 2019 DEMANDBASE|SLIDE 4
Bags O’Keywords
▪ Classical technique of NLP
▪ Bag of Words are Sparse Vectors
▪ Represented as Map[String,Double]
▪ Task is to generate lots of BOK’s
▪ Per Domain
▪ Per Publisher
▪ Globally ( for TF-IDF)
▪ Combined with other possible attributes (geo,language,industry)
© 2019 DEMANDBASE|SLIDE 5
Aggregation is Feature Extraction
▪ Feature Extraction is Aggregation
▪ Aggregation is dimensionality reduction
▪ Lot of events to smaller number of aggregates
▪ Aggregates are more than Dashboards
▪ Graphs and charts are nice but often not actionable
▪ Generate lots of features to drive machine learning
▪ Model Development
▪ Clustering/Similarity
▪ Outliers/Indexing
© 2019 DEMANDBASE|SLIDE 6
In the Beginning was Brickhouse
▪ Library of Hive UDF’s and UDAF’s
▪ Used for generating the Klout Score
▪ Open-sourced
▪ http://github.com/klout/brickhouse
▪ Used by pipelines round the world
© 2019 DEMANDBASE|SLIDE 7
Next generation is Qubism
▪ Scala Spark Library
▪ Re-usable transformers (DataFrame) -> DataFrame
▪ Focus on Aggregation/Feature transformation
▪ XUnits and YPaths
• Multi-dimensional feature representation
▪ Bridge to Algebird
• (Aggregator) -> UserDefinedAggregateFunction
▪ Exotic Aggregators
• Collect - ArgMax
• Cardinality estimation - KMV, HLL sketches
• Vectors
• Timeseries
© 2019 DEMANDBASE|SLIDE 8
XUnits and YPaths
XUnit strings represent slice-and-dice segments (YPaths)
Single event row explodes to multiple XUnits
Dimensions (YPaths) can be added or removed
(domain=”db.com”,
page=”home.html”,
account=”1234”,
country=”US”,
city=”San Francisco”,
industry=”AdTech”)
/page/domain=db.com
/account/id=1234
/industry/type=AdTech
/geo/country=US
/geo/country=US/city=San Francisco
/page/domain=db.com/page=home.html
/geo/country=UD,/page/domain=db.com
/geo/country=US,/page/domain=db.com/page=home.html
/geo/country=US/city=San Francisco,/page/domain=db.com
/account/id=1234,/page/domain=db.com
/account/id=1234,/page/domain=db.com/page=home.html
/industry/type=AdTech,/page/domain=db.com
/industry/type=AdTech,/page/domain=db.com/page=home.html
© 2019 DEMANDBASE|SLIDE 9
XUnits and YPaths
▪ Event Rows exploded to
multiple XUnits in map
phase
▪ Annotated Rows
distributed by XUnit in
shuffle/sort phase
▪ XUnit aggregates
produced in reduce phase
(domain=”db.com”,
page=”home.html”,
account=”1234”, country=”US”,
city=”San Francisco”,
industry=”AdTech”)
/page/domain=db.com
/account/id=1234
/industry/type=AdTech
/geo/country=US
/geo/country=US,/city=SanFrancisco
/page/domain=db.com/page=home.html
/geo/country=UD,/page/domain=db.com
/geo/country=US,/page/domain=db.com/page=home.html
/geo/country=US/city=SanFrancisco,/page/domain=db.com
/account/id=1234,/page/domain=db.com
/account/id=1234,/page/domain=db.com/page=home.html
/industry/type=AdTech,/page/domain=db.com
/industry/type=AdTech,/page/domain=db.com/page=home.html
XUnit
Explode !!!
count(*)
group By
XUnit
© 2019 DEMANDBASE|SLIDE 10
XUnits and YPaths
Advantages
▪ Single string key to represent arbitrary segment
▪ Maps nicely to key/value stores
▪ Dimensions can be easily added or removed
▪ Simplifies table schemas
▪ Qubism provides tools for using XUnits
▪ DSL for specifying YPath dimensions
▪ Transforms for exploding/aggregrating XUnit DataFrame
▪ Common operations on XUnit DataFrame
• Ranking, Outlier detection, Indexing, Clustering
▪ UDFs for parsing/manipulating XUnit strings
▪ FilterRules for controlling size of explosion
© 2019 DEMANDBASE|SLIDE 11
Aggregator
▪ Analogous to Algebird
▪ Monoid in Category Theory
▪ Supports Associative operations
▪ Qubism implements easy
transformation to Spark’s
(painful)
UserDefinedAggregateFunction
© 2019 DEMANDBASE|SLIDE 12
Vector - What’s the Vector, Victor?
Qubism models Sparse Vectors as
Map[String,Double]
▪ Aggregate vectors by collecting keywords
▪ Merge vectors by doing vector sums
▪ UDF’s for vector operators
▪ Scalar multiply, normalize
▪ Dot-product, cosine-similarity
▪ VectorBuffer
▪ Efficient data-structure for Serialization
© 2019 DEMANDBASE|SLIDE 13
KMV Sketch Set
Qubism provides implementation of KMV sketch set
▪ Estimate cardinality of large sets in fixed set of space
▪ Exact for small reach, within 1% for sets > 10,000
▪ Jacardian Set Similarity
▪ Collaborative filtering
▪ LongBufferSeq provides fast merges, serialization
-MaxLong +MaxLong0
Kth Max Hash + MaxLong
K * 2 * MaxLong
Reach
© 2019 DEMANDBASE|SLIDE 14
What about Intent?
▪ Generate XUnits based on parsed document attributes
▪ Publisher, Domain, Geo, Language
▪ Aggregate Keyword Vectors per XUnit
▪ Generate Global Vectors for TF-IDF
▪ Merge Vectors over various timeranges
▪ Aggregate KMV sketches of various uuids
▪ Sizing
▪ Clustering and Collaborative filtering
▪ Calculate scores by comparing Vectors
▪ Cosine similarity, Dot-product
© 2019 DEMANDBASE|SLIDE 15
Conclusion
▪ Data Engineering is really all about aggregation
▪ You can’t sort the universe
▪ Re-use today what you did yesterday
▪ Generate as many aggregates/features as possible
▪ Gain insight by analyzing everything at once
▪ Qubism is a re-usable Scala Spark Library
▪ Unit of re-use in Data Engineering is Function al
• (DataFrame) -> DataFrame
▪ Generate and manipulate XUnits and YPaths
▪ Implement exotic and efficient Aggregators
Questions?
THANK YOU

More Related Content

Similar to Qubism and scala nlp

Bee brief-intro-q42016
Bee brief-intro-q42016Bee brief-intro-q42016
Bee brief-intro-q42016
wahyu prayudo
 
Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities
Juan Sequeda
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
MongoDB
 
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Eelco Visser
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQL
EDB
 
[XConf Brasil 2020] Data mesh
[XConf Brasil 2020] Data mesh[XConf Brasil 2020] Data mesh
[XConf Brasil 2020] Data mesh
ThoughtWorks Brasil
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Edwin Poot
 
NetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano CiampiNetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano Ciampi
Clustin
 
AdaCore Tech Days
AdaCore Tech DaysAdaCore Tech Days
AdaCore Tech Days
Paranor
 
2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices
Kim Kao
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
ardan-bkennedy
 
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Using Redis As Your  Online Feature Store:  2021 Highlights. 2022 DirectionsUsing Redis As Your  Online Feature Store:  2021 Highlights. 2022 Directions
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Guy Korland
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
 
Cloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersCloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for Partners
Amazon Web Services
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
DataWorks Summit/Hadoop Summit
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
William Poos
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsBig Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Matt Stubbs
 
Executing the Digital Strategy
Executing the Digital StrategyExecuting the Digital Strategy
Executing the Digital Strategy
Ben Turner
 

Similar to Qubism and scala nlp (20)

Bee brief-intro-q42016
Bee brief-intro-q42016Bee brief-intro-q42016
Bee brief-intro-q42016
 
Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
[XConf Brasil 2020] Data mesh
[XConf Brasil 2020] Data mesh[XConf Brasil 2020] Data mesh
[XConf Brasil 2020] Data mesh
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
 
NetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano CiampiNetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano Ciampi
 
AdaCore Tech Days
AdaCore Tech DaysAdaCore Tech Days
AdaCore Tech Days
 
2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
 
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Using Redis As Your  Online Feature Store:  2021 Highlights. 2022 DirectionsUsing Redis As Your  Online Feature Store:  2021 Highlights. 2022 Directions
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Cloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersCloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for Partners
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsBig Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business Solutions
 
Executing the Digital Strategy
Executing the Digital StrategyExecuting the Digital Strategy
Executing the Digital Strategy
 

Recently uploaded

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 

Recently uploaded (20)

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 

Qubism and scala nlp

  • 1. Qubism and NLP at Scale Jerome Banks, Principal Big Data Engineer March 26, 2020
  • 2. AGENDA ▪ INTENT AT DEMANDBASE ▪ WHAT IS THE PROBLEM? ▪ WHAT IS THE APPROACH? ▪ QUBISM AS A SOLUTION
  • 3. © 2019 DEMANDBASE|SLIDE 3 B2B Real-Time Intent Buying Signals 4.55 Trillion Yearly signals 80% B2B Employees People Coverage 940 million web pages From over 2.9 million publishers Content Coverage 50x Scale 20x Granularity 9x More Accounts Than Bombora (As of 1 May 2019)
  • 4. © 2019 DEMANDBASE|SLIDE 4 Bags O’Keywords ▪ Classical technique of NLP ▪ Bag of Words are Sparse Vectors ▪ Represented as Map[String,Double] ▪ Task is to generate lots of BOK’s ▪ Per Domain ▪ Per Publisher ▪ Globally ( for TF-IDF) ▪ Combined with other possible attributes (geo,language,industry)
  • 5. © 2019 DEMANDBASE|SLIDE 5 Aggregation is Feature Extraction ▪ Feature Extraction is Aggregation ▪ Aggregation is dimensionality reduction ▪ Lot of events to smaller number of aggregates ▪ Aggregates are more than Dashboards ▪ Graphs and charts are nice but often not actionable ▪ Generate lots of features to drive machine learning ▪ Model Development ▪ Clustering/Similarity ▪ Outliers/Indexing
  • 6. © 2019 DEMANDBASE|SLIDE 6 In the Beginning was Brickhouse ▪ Library of Hive UDF’s and UDAF’s ▪ Used for generating the Klout Score ▪ Open-sourced ▪ http://github.com/klout/brickhouse ▪ Used by pipelines round the world
  • 7. © 2019 DEMANDBASE|SLIDE 7 Next generation is Qubism ▪ Scala Spark Library ▪ Re-usable transformers (DataFrame) -> DataFrame ▪ Focus on Aggregation/Feature transformation ▪ XUnits and YPaths • Multi-dimensional feature representation ▪ Bridge to Algebird • (Aggregator) -> UserDefinedAggregateFunction ▪ Exotic Aggregators • Collect - ArgMax • Cardinality estimation - KMV, HLL sketches • Vectors • Timeseries
  • 8. © 2019 DEMANDBASE|SLIDE 8 XUnits and YPaths XUnit strings represent slice-and-dice segments (YPaths) Single event row explodes to multiple XUnits Dimensions (YPaths) can be added or removed (domain=”db.com”, page=”home.html”, account=”1234”, country=”US”, city=”San Francisco”, industry=”AdTech”) /page/domain=db.com /account/id=1234 /industry/type=AdTech /geo/country=US /geo/country=US/city=San Francisco /page/domain=db.com/page=home.html /geo/country=UD,/page/domain=db.com /geo/country=US,/page/domain=db.com/page=home.html /geo/country=US/city=San Francisco,/page/domain=db.com /account/id=1234,/page/domain=db.com /account/id=1234,/page/domain=db.com/page=home.html /industry/type=AdTech,/page/domain=db.com /industry/type=AdTech,/page/domain=db.com/page=home.html
  • 9. © 2019 DEMANDBASE|SLIDE 9 XUnits and YPaths ▪ Event Rows exploded to multiple XUnits in map phase ▪ Annotated Rows distributed by XUnit in shuffle/sort phase ▪ XUnit aggregates produced in reduce phase (domain=”db.com”, page=”home.html”, account=”1234”, country=”US”, city=”San Francisco”, industry=”AdTech”) /page/domain=db.com /account/id=1234 /industry/type=AdTech /geo/country=US /geo/country=US,/city=SanFrancisco /page/domain=db.com/page=home.html /geo/country=UD,/page/domain=db.com /geo/country=US,/page/domain=db.com/page=home.html /geo/country=US/city=SanFrancisco,/page/domain=db.com /account/id=1234,/page/domain=db.com /account/id=1234,/page/domain=db.com/page=home.html /industry/type=AdTech,/page/domain=db.com /industry/type=AdTech,/page/domain=db.com/page=home.html XUnit Explode !!! count(*) group By XUnit
  • 10. © 2019 DEMANDBASE|SLIDE 10 XUnits and YPaths Advantages ▪ Single string key to represent arbitrary segment ▪ Maps nicely to key/value stores ▪ Dimensions can be easily added or removed ▪ Simplifies table schemas ▪ Qubism provides tools for using XUnits ▪ DSL for specifying YPath dimensions ▪ Transforms for exploding/aggregrating XUnit DataFrame ▪ Common operations on XUnit DataFrame • Ranking, Outlier detection, Indexing, Clustering ▪ UDFs for parsing/manipulating XUnit strings ▪ FilterRules for controlling size of explosion
  • 11. © 2019 DEMANDBASE|SLIDE 11 Aggregator ▪ Analogous to Algebird ▪ Monoid in Category Theory ▪ Supports Associative operations ▪ Qubism implements easy transformation to Spark’s (painful) UserDefinedAggregateFunction
  • 12. © 2019 DEMANDBASE|SLIDE 12 Vector - What’s the Vector, Victor? Qubism models Sparse Vectors as Map[String,Double] ▪ Aggregate vectors by collecting keywords ▪ Merge vectors by doing vector sums ▪ UDF’s for vector operators ▪ Scalar multiply, normalize ▪ Dot-product, cosine-similarity ▪ VectorBuffer ▪ Efficient data-structure for Serialization
  • 13. © 2019 DEMANDBASE|SLIDE 13 KMV Sketch Set Qubism provides implementation of KMV sketch set ▪ Estimate cardinality of large sets in fixed set of space ▪ Exact for small reach, within 1% for sets > 10,000 ▪ Jacardian Set Similarity ▪ Collaborative filtering ▪ LongBufferSeq provides fast merges, serialization -MaxLong +MaxLong0 Kth Max Hash + MaxLong K * 2 * MaxLong Reach
  • 14. © 2019 DEMANDBASE|SLIDE 14 What about Intent? ▪ Generate XUnits based on parsed document attributes ▪ Publisher, Domain, Geo, Language ▪ Aggregate Keyword Vectors per XUnit ▪ Generate Global Vectors for TF-IDF ▪ Merge Vectors over various timeranges ▪ Aggregate KMV sketches of various uuids ▪ Sizing ▪ Clustering and Collaborative filtering ▪ Calculate scores by comparing Vectors ▪ Cosine similarity, Dot-product
  • 15. © 2019 DEMANDBASE|SLIDE 15 Conclusion ▪ Data Engineering is really all about aggregation ▪ You can’t sort the universe ▪ Re-use today what you did yesterday ▪ Generate as many aggregates/features as possible ▪ Gain insight by analyzing everything at once ▪ Qubism is a re-usable Scala Spark Library ▪ Unit of re-use in Data Engineering is Function al • (DataFrame) -> DataFrame ▪ Generate and manipulate XUnits and YPaths ▪ Implement exotic and efficient Aggregators