SlideShare a Scribd company logo
1 of 17
Qubism and NLP at Scale
Jerome Banks, Principal Big Data Engineer
March 26, 2020
AGENDA
▪ INTENT AT DEMANDBASE
▪ WHAT IS THE PROBLEM?
▪ WHAT IS THE APPROACH?
▪ QUBISM AS A SOLUTION
© 2019 DEMANDBASE|SLIDE 3
B2B Real-Time Intent
Buying Signals
4.55 Trillion
Yearly signals
80% B2B Employees
People Coverage
940 million web pages
From over 2.9 million publishers
Content Coverage
50x Scale
20x Granularity
9x More Accounts
Than Bombora
(As of 1 May 2019)
© 2019 DEMANDBASE|SLIDE 4
Bags O’Keywords
▪ Classical technique of NLP
▪ Bag of Words are Sparse Vectors
▪ Represented as Map[String,Double]
▪ Task is to generate lots of BOK’s
▪ Per Domain
▪ Per Publisher
▪ Globally ( for TF-IDF)
▪ Combined with other possible attributes (geo,language,industry)
© 2019 DEMANDBASE|SLIDE 5
Aggregation is Feature Extraction
▪ Feature Extraction is Aggregation
▪ Aggregation is dimensionality reduction
▪ Lot of events to smaller number of aggregates
▪ Aggregates are more than Dashboards
▪ Graphs and charts are nice but often not actionable
▪ Generate lots of features to drive machine learning
▪ Model Development
▪ Clustering/Similarity
▪ Outliers/Indexing
© 2019 DEMANDBASE|SLIDE 6
In the Beginning was Brickhouse
▪ Library of Hive UDF’s and UDAF’s
▪ Used for generating the Klout Score
▪ Open-sourced
▪ http://github.com/klout/brickhouse
▪ Used by pipelines round the world
© 2019 DEMANDBASE|SLIDE 7
Next generation is Qubism
▪ Scala Spark Library
▪ Re-usable transformers (DataFrame) -> DataFrame
▪ Focus on Aggregation/Feature transformation
▪ XUnits and YPaths
• Multi-dimensional feature representation
▪ Bridge to Algebird
• (Aggregator) -> UserDefinedAggregateFunction
▪ Exotic Aggregators
• Collect - ArgMax
• Cardinality estimation - KMV, HLL sketches
• Vectors
• Timeseries
© 2019 DEMANDBASE|SLIDE 8
XUnits and YPaths
XUnit strings represent slice-and-dice segments (YPaths)
Single event row explodes to multiple XUnits
Dimensions (YPaths) can be added or removed
(domain=”db.com”,
page=”home.html”,
account=”1234”,
country=”US”,
city=”San Francisco”,
industry=”AdTech”)
/page/domain=db.com
/account/id=1234
/industry/type=AdTech
/geo/country=US
/geo/country=US/city=San Francisco
/page/domain=db.com/page=home.html
/geo/country=UD,/page/domain=db.com
/geo/country=US,/page/domain=db.com/page=home.html
/geo/country=US/city=San Francisco,/page/domain=db.com
/account/id=1234,/page/domain=db.com
/account/id=1234,/page/domain=db.com/page=home.html
/industry/type=AdTech,/page/domain=db.com
/industry/type=AdTech,/page/domain=db.com/page=home.html
© 2019 DEMANDBASE|SLIDE 9
XUnits and YPaths
▪ Event Rows exploded to
multiple XUnits in map
phase
▪ Annotated Rows
distributed by XUnit in
shuffle/sort phase
▪ XUnit aggregates
produced in reduce phase
(domain=”db.com”,
page=”home.html”,
account=”1234”, country=”US”,
city=”San Francisco”,
industry=”AdTech”)
/page/domain=db.com
/account/id=1234
/industry/type=AdTech
/geo/country=US
/geo/country=US,/city=SanFrancisco
/page/domain=db.com/page=home.html
/geo/country=UD,/page/domain=db.com
/geo/country=US,/page/domain=db.com/page=home.html
/geo/country=US/city=SanFrancisco,/page/domain=db.com
/account/id=1234,/page/domain=db.com
/account/id=1234,/page/domain=db.com/page=home.html
/industry/type=AdTech,/page/domain=db.com
/industry/type=AdTech,/page/domain=db.com/page=home.html
XUnit
Explode !!!
count(*)
group By
XUnit
© 2019 DEMANDBASE|SLIDE 10
XUnits and YPaths
Advantages
▪ Single string key to represent arbitrary segment
▪ Maps nicely to key/value stores
▪ Dimensions can be easily added or removed
▪ Simplifies table schemas
▪ Qubism provides tools for using XUnits
▪ DSL for specifying YPath dimensions
▪ Transforms for exploding/aggregrating XUnit DataFrame
▪ Common operations on XUnit DataFrame
• Ranking, Outlier detection, Indexing, Clustering
▪ UDFs for parsing/manipulating XUnit strings
▪ FilterRules for controlling size of explosion
© 2019 DEMANDBASE|SLIDE 11
Aggregator
▪ Analogous to Algebird
▪ Monoid in Category Theory
▪ Supports Associative operations
▪ Qubism implements easy
transformation to Spark’s
(painful)
UserDefinedAggregateFunction
© 2019 DEMANDBASE|SLIDE 12
Vector - What’s the Vector, Victor?
Qubism models Sparse Vectors as
Map[String,Double]
▪ Aggregate vectors by collecting keywords
▪ Merge vectors by doing vector sums
▪ UDF’s for vector operators
▪ Scalar multiply, normalize
▪ Dot-product, cosine-similarity
▪ VectorBuffer
▪ Efficient data-structure for Serialization
© 2019 DEMANDBASE|SLIDE 13
KMV Sketch Set
Qubism provides implementation of KMV sketch set
▪ Estimate cardinality of large sets in fixed set of space
▪ Exact for small reach, within 1% for sets > 10,000
▪ Jacardian Set Similarity
▪ Collaborative filtering
▪ LongBufferSeq provides fast merges, serialization
-MaxLong +MaxLong0
Kth Max Hash + MaxLong
K * 2 * MaxLong
Reach
© 2019 DEMANDBASE|SLIDE 14
What about Intent?
▪ Generate XUnits based on parsed document attributes
▪ Publisher, Domain, Geo, Language
▪ Aggregate Keyword Vectors per XUnit
▪ Generate Global Vectors for TF-IDF
▪ Merge Vectors over various timeranges
▪ Aggregate KMV sketches of various uuids
▪ Sizing
▪ Clustering and Collaborative filtering
▪ Calculate scores by comparing Vectors
▪ Cosine similarity, Dot-product
© 2019 DEMANDBASE|SLIDE 15
Conclusion
▪ Data Engineering is really all about aggregation
▪ You can’t sort the universe
▪ Re-use today what you did yesterday
▪ Generate as many aggregates/features as possible
▪ Gain insight by analyzing everything at once
▪ Qubism is a re-usable Scala Spark Library
▪ Unit of re-use in Data Engineering is Function al
• (DataFrame) -> DataFrame
▪ Generate and manipulate XUnits and YPaths
▪ Implement exotic and efficient Aggregators
Questions?
THANK YOU

More Related Content

Similar to Qubism and scala nlp

Bee brief-intro-q42016
Bee brief-intro-q42016Bee brief-intro-q42016
Bee brief-intro-q42016wahyu prayudo
 
Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Juan Sequeda
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)Eelco Visser
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQLEDB
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingEdwin Poot
 
NetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano CiampiNetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano CiampiClustin
 
AdaCore Tech Days
AdaCore Tech DaysAdaCore Tech Days
AdaCore Tech DaysParanor
 
2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservicesKim Kao
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Using Redis As Your  Online Feature Store:  2021 Highlights. 2022 DirectionsUsing Redis As Your  Online Feature Store:  2021 Highlights. 2022 Directions
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 DirectionsGuy Korland
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Cloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersCloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersAmazon Web Services
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRBWilliam Poos
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsBig Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsMatt Stubbs
 
Executing the Digital Strategy
Executing the Digital StrategyExecuting the Digital Strategy
Executing the Digital StrategyBen Turner
 

Similar to Qubism and scala nlp (20)

Bee brief-intro-q42016
Bee brief-intro-q42016Bee brief-intro-q42016
Bee brief-intro-q42016
 
Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)Domain Specific Languages for Parallel Graph AnalytiX (PGX)
Domain Specific Languages for Parallel Graph AnalytiX (PGX)
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
[XConf Brasil 2020] Data mesh
[XConf Brasil 2020] Data mesh[XConf Brasil 2020] Data mesh
[XConf Brasil 2020] Data mesh
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
 
NetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano CiampiNetSuite Number1 -Ponziano Ciampi
NetSuite Number1 -Ponziano Ciampi
 
AdaCore Tech Days
AdaCore Tech DaysAdaCore Tech Days
AdaCore Tech Days
 
2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices2019 03-23-2nd-meetup-essential capabilities behind microservices
2019 03-23-2nd-meetup-essential capabilities behind microservices
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
 
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
Using Redis As Your  Online Feature Store:  2021 Highlights. 2022 DirectionsUsing Redis As Your  Online Feature Store:  2021 Highlights. 2022 Directions
Using Redis As Your Online Feature Store: 2021 Highlights. 2022 Directions
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Cloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersCloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for Partners
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsBig Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business Solutions
 
Executing the Digital Strategy
Executing the Digital StrategyExecuting the Digital Strategy
Executing the Digital Strategy
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Qubism and scala nlp

  • 1. Qubism and NLP at Scale Jerome Banks, Principal Big Data Engineer March 26, 2020
  • 2. AGENDA ▪ INTENT AT DEMANDBASE ▪ WHAT IS THE PROBLEM? ▪ WHAT IS THE APPROACH? ▪ QUBISM AS A SOLUTION
  • 3. © 2019 DEMANDBASE|SLIDE 3 B2B Real-Time Intent Buying Signals 4.55 Trillion Yearly signals 80% B2B Employees People Coverage 940 million web pages From over 2.9 million publishers Content Coverage 50x Scale 20x Granularity 9x More Accounts Than Bombora (As of 1 May 2019)
  • 4. © 2019 DEMANDBASE|SLIDE 4 Bags O’Keywords ▪ Classical technique of NLP ▪ Bag of Words are Sparse Vectors ▪ Represented as Map[String,Double] ▪ Task is to generate lots of BOK’s ▪ Per Domain ▪ Per Publisher ▪ Globally ( for TF-IDF) ▪ Combined with other possible attributes (geo,language,industry)
  • 5. © 2019 DEMANDBASE|SLIDE 5 Aggregation is Feature Extraction ▪ Feature Extraction is Aggregation ▪ Aggregation is dimensionality reduction ▪ Lot of events to smaller number of aggregates ▪ Aggregates are more than Dashboards ▪ Graphs and charts are nice but often not actionable ▪ Generate lots of features to drive machine learning ▪ Model Development ▪ Clustering/Similarity ▪ Outliers/Indexing
  • 6. © 2019 DEMANDBASE|SLIDE 6 In the Beginning was Brickhouse ▪ Library of Hive UDF’s and UDAF’s ▪ Used for generating the Klout Score ▪ Open-sourced ▪ http://github.com/klout/brickhouse ▪ Used by pipelines round the world
  • 7. © 2019 DEMANDBASE|SLIDE 7 Next generation is Qubism ▪ Scala Spark Library ▪ Re-usable transformers (DataFrame) -> DataFrame ▪ Focus on Aggregation/Feature transformation ▪ XUnits and YPaths • Multi-dimensional feature representation ▪ Bridge to Algebird • (Aggregator) -> UserDefinedAggregateFunction ▪ Exotic Aggregators • Collect - ArgMax • Cardinality estimation - KMV, HLL sketches • Vectors • Timeseries
  • 8. © 2019 DEMANDBASE|SLIDE 8 XUnits and YPaths XUnit strings represent slice-and-dice segments (YPaths) Single event row explodes to multiple XUnits Dimensions (YPaths) can be added or removed (domain=”db.com”, page=”home.html”, account=”1234”, country=”US”, city=”San Francisco”, industry=”AdTech”) /page/domain=db.com /account/id=1234 /industry/type=AdTech /geo/country=US /geo/country=US/city=San Francisco /page/domain=db.com/page=home.html /geo/country=UD,/page/domain=db.com /geo/country=US,/page/domain=db.com/page=home.html /geo/country=US/city=San Francisco,/page/domain=db.com /account/id=1234,/page/domain=db.com /account/id=1234,/page/domain=db.com/page=home.html /industry/type=AdTech,/page/domain=db.com /industry/type=AdTech,/page/domain=db.com/page=home.html
  • 9. © 2019 DEMANDBASE|SLIDE 9 XUnits and YPaths ▪ Event Rows exploded to multiple XUnits in map phase ▪ Annotated Rows distributed by XUnit in shuffle/sort phase ▪ XUnit aggregates produced in reduce phase (domain=”db.com”, page=”home.html”, account=”1234”, country=”US”, city=”San Francisco”, industry=”AdTech”) /page/domain=db.com /account/id=1234 /industry/type=AdTech /geo/country=US /geo/country=US,/city=SanFrancisco /page/domain=db.com/page=home.html /geo/country=UD,/page/domain=db.com /geo/country=US,/page/domain=db.com/page=home.html /geo/country=US/city=SanFrancisco,/page/domain=db.com /account/id=1234,/page/domain=db.com /account/id=1234,/page/domain=db.com/page=home.html /industry/type=AdTech,/page/domain=db.com /industry/type=AdTech,/page/domain=db.com/page=home.html XUnit Explode !!! count(*) group By XUnit
  • 10. © 2019 DEMANDBASE|SLIDE 10 XUnits and YPaths Advantages ▪ Single string key to represent arbitrary segment ▪ Maps nicely to key/value stores ▪ Dimensions can be easily added or removed ▪ Simplifies table schemas ▪ Qubism provides tools for using XUnits ▪ DSL for specifying YPath dimensions ▪ Transforms for exploding/aggregrating XUnit DataFrame ▪ Common operations on XUnit DataFrame • Ranking, Outlier detection, Indexing, Clustering ▪ UDFs for parsing/manipulating XUnit strings ▪ FilterRules for controlling size of explosion
  • 11. © 2019 DEMANDBASE|SLIDE 11 Aggregator ▪ Analogous to Algebird ▪ Monoid in Category Theory ▪ Supports Associative operations ▪ Qubism implements easy transformation to Spark’s (painful) UserDefinedAggregateFunction
  • 12. © 2019 DEMANDBASE|SLIDE 12 Vector - What’s the Vector, Victor? Qubism models Sparse Vectors as Map[String,Double] ▪ Aggregate vectors by collecting keywords ▪ Merge vectors by doing vector sums ▪ UDF’s for vector operators ▪ Scalar multiply, normalize ▪ Dot-product, cosine-similarity ▪ VectorBuffer ▪ Efficient data-structure for Serialization
  • 13. © 2019 DEMANDBASE|SLIDE 13 KMV Sketch Set Qubism provides implementation of KMV sketch set ▪ Estimate cardinality of large sets in fixed set of space ▪ Exact for small reach, within 1% for sets > 10,000 ▪ Jacardian Set Similarity ▪ Collaborative filtering ▪ LongBufferSeq provides fast merges, serialization -MaxLong +MaxLong0 Kth Max Hash + MaxLong K * 2 * MaxLong Reach
  • 14. © 2019 DEMANDBASE|SLIDE 14 What about Intent? ▪ Generate XUnits based on parsed document attributes ▪ Publisher, Domain, Geo, Language ▪ Aggregate Keyword Vectors per XUnit ▪ Generate Global Vectors for TF-IDF ▪ Merge Vectors over various timeranges ▪ Aggregate KMV sketches of various uuids ▪ Sizing ▪ Clustering and Collaborative filtering ▪ Calculate scores by comparing Vectors ▪ Cosine similarity, Dot-product
  • 15. © 2019 DEMANDBASE|SLIDE 15 Conclusion ▪ Data Engineering is really all about aggregation ▪ You can’t sort the universe ▪ Re-use today what you did yesterday ▪ Generate as many aggregates/features as possible ▪ Gain insight by analyzing everything at once ▪ Qubism is a re-usable Scala Spark Library ▪ Unit of re-use in Data Engineering is Function al • (DataFrame) -> DataFrame ▪ Generate and manipulate XUnits and YPaths ▪ Implement exotic and efficient Aggregators