SlideShare a Scribd company logo
2/1/17  ©  2016  CareerBuilder
How to Integrate Spark MLlib and Apache
Solr to Build Real-Time Entity Type
Recognition System for Better Query
Understanding
Walid Shalaby, Khalifeh AlJadda,
Mohammed Korayem, Trey Grainger
Khalifeh AlJadda
Lead Data Scientist, Search Data Science
•  Joined CareerBuilder in 2013
•  PhD, Computer Science – University of Georgia
•  BSc, MSc, Computer Science, Jordan University of Science and Technology
Activities:
•  Conference Chair of Southern Data Science Conf (www.southerndatascience.com)
•  Founder and Chairman of CB Data Science Council
•  Frequent public speaker in data science and Big Data conferences.
•  Creator of GELATO (Glycomic Elucidation and Annotation Tool)
About	
  Me	
  
Search Technology at CareerBuilder by Numbers
3	
  
Powering	
  50+	
  Search	
  Experiences	
  Including:	
  
100million +
Searches	
  per	
  day	
  
30+
SoAware	
  Developers,	
  Data	
  
ScienEsts	
  +	
  Analysts	
  
500+
Search	
  Servers	
  
1,5billion +
Documents	
  indexed	
  and	
  
searchable	
  
1Global	
  Search	
  	
  
Technology	
  plaIorm	
  
...and  many  more
Agenda:
•  Problem & Challenges
•  Proposed System
•  Experiments and Results
•  Conclusions
•  Identifying regions of text corresponding to entities
•  Categorizing recognized entities into predefined classes
Entity Type Recognition (ETR)
5
Source:  Ward,  Ma?hew  O.,  Georges  Grinstein,  and  Daniel  Keim.  Interac(ve  data  visualiza(on:  founda(ons,  techniques,  and  applica(ons.  CRC  Press,  2010.
•  Understand entity types in search queries
•  Company, Job titles, School, Skill
•  Search queries are short, context-less
•  “data scientist hadoop careerbuilder”.
•  Some entities have multiple surface forms
•  university of north carolina charlotte, unc charlotte, uncc
•  Domain specific entities
•  registered nurse (rn), licensed practical nurse (lpn), director of nurse (don)
Problem & Challenges
6
•  Wikipedia,Wikipedia,Wikipedia
•  Title and first paragraph
•  Title and categories
•  Infobox
•  DBpedia, Freebase
Prior Work
7
•  Entities with no page/entry (java developer)
•  Non-standard categories (skill)
•  Domain specific knowledge (e.g., job posts)
Limitations
8
Enrich entity representation using 4 clues (features):
•  Real-world knowledge (Wikipedia)
•  Domain specific knowledge (job posts)
•  Ontology (DBpedia)
•  Lexical DB (Wordnet)
Methodology
9
Architecture
10
Query  Parser
Query
Phrase  IdenOfier
(Bayes)
[e1,e2…]
Wiki  
Index
DBpedia
 WordNet
Ontological  
features
Word  Embeddings  Vector
Classifier
 EnOty  Type
Offline
•  Wikipedia index (title, length, text, categories)
•  Word2Vec trained on ~100M job posts
•  SVM classifier
Our System
11
Online
•  top 3 hits + 10 fragments for each + categories
•  Word2Vec synset (most similar 20 words)
•  tf-idf vector
•  is_company (binary feature)
•  is_agent_noun (binary feature)
Offline
•  Wikipedia index (title, length, text, categories)
•  Word2Vec trained on ~25m job posts
•  SVM classifier
Our System
12
Online
•  top 3 hits + 10 fragments for each + categories
•  Word2Vec synset (most similar 20 words)
•  tf-idf vector
•  is_company (binary feature)
•  has_agent_noun (binary feature)
Offline
Architecture
13
Query
Query  Parser
Phrase  IdenOfier
(Bayes)
Wiki  
Index
DBpedia
Ontological  
features
WordNet
Word  Embeddings  Vector
Classifier
 EnOty  Type
[e1,e2…]
Online
Architecture
14
Query
Query  Parser
Phrase  IdenOfier
(Bayes)
Classifier
 EnOty  Type
[e1,e2…]
Wiki  
Features
Synset
Lexical  
and  
Onotolog
icl
•  IBM <company> •  LPN <job title>
Context Features
15
The	
  InternaEonal	
  Business	
  Machines	
  CorporaEon	
  (IBM)	
  is	
  an	
  American	
  
mul'na'onal	
  technology	
  and	
  consul'ng	
  corpora'on,	
  with	
  
headquarters	
  in	
  Armonk,	
  New	
  York.	
  IBM	
  manufactures	
  and	
  
markets	
  computer	
  hardware	
  had	
  originated	
  with	
  CTR's	
  Canadian	
  subsidiary.	
  
The	
  iniEalism	
  IBM	
  followed.	
  SecuriEes	
  analysts.	
  	
  In	
  2012,	
  'Fortune'	
  ranked	
  IBM	
  the	
  
No.	
  2	
  largest	
  U.S.	
  firm	
  in	
  terms	
  of	
  number	
  of	
  employees	
  (435,000	
  	
  ('Barron's'),	
  No.	
  5	
  
most	
  admired	
  company	
  ('Fortune'),	
  and	
  No.	
  18	
  most	
  innovaEve	
  company	
  
('Fast	
  Company').	
  The	
  company	
  held	
  the	
  record	
  to	
  Lenovo	
  (2005,	
  2014).	
  In	
  
2014	
  IBM	
  announced	
  that	
  it	
  would	
  go	
  'fabless'	
  by	
  offloading	
  IBM	
  Micro	
  Electronics	
  
form	
  the	
  core	
  of	
  what	
  would	
  become	
  InternaEonal	
  Business	
  Machines	
  (IBM).	
  Julius	
  
E.	
  Pitrat	
  patented	
  was	
  renamed	
  the	
  'InternaEonal	
  Business	
  Machines	
  
Corpora'on'	
  (IBM),	
  ciEng	
  the	
  need	
  to	
  align	
  its	
  name	
  the	
  company	
  	
  
produced	
  small	
  arms	
  ...	
  
Lee	
  Presson	
  and	
  the	
  Nails	
  (also	
  known	
  as	
  LPN)	
  is	
  a	
  swing	
  band	
  that	
  
formed	
  in	
  the	
  San	
  Francisco.	
  As	
  of	
  2010,	
  the	
  band	
  has	
  released	
  five	
  albums.	
  
LPN	
  differenEated	
  themselves	
  from	
  of	
  band	
  leader	
  Lee	
  Presson…	
  
•  IBM <company> •  LPN <job title>
Embedding Features (Synsets)
16
fusion
iis 
jms 
virtualizaOon  
emc 
weblogic 
atg 
esb        
mq 
nosql 
voip 
Obco 
jboss  
Tivoli 
hadoop 
avaya 
citrix  
tomcat  hp 
websphere
pracOOoner 
lvn 
nurse  
registered 
rn  vocaOonal  
psychologist 
midwife            aide      can  
licensed 
licensure 
icu  
psych 
arnp
•  2 baselines vs. combinations of various representations
•  10 fold cross-validation
•  Report P, R, and micro-averaged F1
•  Data set (177K entities)
Evaluation
17
Category Number of instances
Company 40000+
Job Title 3500+
School 100000+
Skill 25000+
Baseline Results
18
Category/Model Company Job Title School Skill
Bow 85.19 87.42 96.59 76.57
wikiw 5.35 2.24 1.06 11.18
•  Absolute increase (F1)
•  Absolute decrease (F1)
Context Features Results
19
Category/Model Company Job Title School Skill
Bow 85.19 87.42 96.59 76.57
wikiw 5.35 2.24 1.06 11.18
wikix 10.79 -0.14 1.93 15.64
•  Absolute increase (F1)
•  Absolute decrease (F1)
Context + Embedding Features Results
20
Category/Model Company Job Title School Skill
Bow 85.19 87.42 96.59 76.57
wikiw 5.35 2.24 1.06 11.18
wikix 10.79 -0.14 1.93 15.64
wikix + jobw 10.99 2.70 2.09 16.24
•  Absolute increase (F1)
•  Absolute decrease (F1)
Context + Embedding + Ontology + Lexical Features Results
21
Category/Model Company Job Title School Skill
Bow 85.19 87.42 96.59 76.57
wikiw 5.35 2.24 1.06 11.18
wikix 10.79 -0.14 1.93 15.64
wikix + jobw 10.99 2.70 2.09 16.24
wikix + jobw+ ont+ lex 11.37 3.15 2.10 16.57
•  Absolute increase (F1)
•  Absolute decrease (F1)
•  Effective approach for ETR of search query entities
•  Tailored to the job search and recruitment domain
•  Combine real-world and domain-specific knowledge
•  The ensemble entity representation
•  contextual information using Wikipedia
•  semantic information in millions of job postings
•  class type in DBpedia for Company entities
•  linguistic properties in WordNet for Job Title entities
•  Ensemble features gave 97% micro-averaged F1
•  Online ETR takes 30ms per entity type request
Conclusion
22
©  2016    CareerBuilder
Thank you!
www.aljadda.com
Twitter: @aljadda

More Related Content

What's hot

Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
Databricks
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
Andy Petrella
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
Databricks @ Strata SJ
Databricks @ Strata SJDatabricks @ Strata SJ
Databricks @ Strata SJ
Databricks
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB
Knoldus Inc.
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 

What's hot (20)

Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Databricks @ Strata SJ
Databricks @ Strata SJDatabricks @ Strata SJ
Databricks @ Strata SJ
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 

Viewers also liked

Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Spark Summit
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Spark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Spark Summit
 

Viewers also liked (20)

Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
 

Similar to How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding: Spark Summit East talk by Khalifeh Aljadda

An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...
An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...
An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...
joaomatosf_
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
CareerBuilder.com
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
Neo4j
 
Intro to xAPI - TORRANCE - LTUK20
Intro to xAPI - TORRANCE - LTUK20Intro to xAPI - TORRANCE - LTUK20
Intro to xAPI - TORRANCE - LTUK20
TorranceLearning
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Feature driven agile oriented web applications
Feature driven agile oriented web applicationsFeature driven agile oriented web applications
Feature driven agile oriented web applications
Ram G Athreya
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...
Artificial Intelligence Institute at UofSC
 
Information Governance and ediscovery in office 365 ediscovery deep dive
Information Governance and ediscovery in office 365 ediscovery deep diveInformation Governance and ediscovery in office 365 ediscovery deep dive
Information Governance and ediscovery in office 365 ediscovery deep dive
bilgore
 
Shall we search? Lviv.
Shall we search? Lviv. Shall we search? Lviv.
Shall we search? Lviv.
Vira Povkh
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
C4Media
 
PHP/MySQL Training Course in Delhi, India by IT People
PHP/MySQL Training Course in Delhi, India by IT PeoplePHP/MySQL Training Course in Delhi, India by IT People
PHP/MySQL Training Course in Delhi, India by IT People
Abhishekve
 
A Brief History of e-Learning Standards in the United States
A Brief History of e-Learning Standards in the United StatesA Brief History of e-Learning Standards in the United States
A Brief History of e-Learning Standards in the United States
Eytan Klawer
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
richwig
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
Vladimir Alexiev, PhD, PMP
 
xAPI Intro for Instructional Designers Learning While Working 2019
xAPI Intro for Instructional Designers Learning While Working 2019xAPI Intro for Instructional Designers Learning While Working 2019
xAPI Intro for Instructional Designers Learning While Working 2019
TorranceLearning
 
MWLUG 2014: Modern Domino (workshop)
MWLUG 2014: Modern Domino (workshop)MWLUG 2014: Modern Domino (workshop)
MWLUG 2014: Modern Domino (workshop)
Peter Presnell
 
Data-driven Approach to Launching your Career
Data-driven Approach to Launching your CareerData-driven Approach to Launching your Career
Data-driven Approach to Launching your Career
Viral Kadakia
 
SDWest2005Goetsch
SDWest2005GoetschSDWest2005Goetsch
SDWest2005Goetsch
Mark Goetsch
 

Similar to How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding: Spark Summit East talk by Khalifeh Aljadda (20)

An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...
An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...
An Overview of Deserialization Vulnerabilities in the Java Virtual Machine (J...
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
 
Intro to xAPI - TORRANCE - LTUK20
Intro to xAPI - TORRANCE - LTUK20Intro to xAPI - TORRANCE - LTUK20
Intro to xAPI - TORRANCE - LTUK20
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Feature driven agile oriented web applications
Feature driven agile oriented web applicationsFeature driven agile oriented web applications
Feature driven agile oriented web applications
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...
 
Information Governance and ediscovery in office 365 ediscovery deep dive
Information Governance and ediscovery in office 365 ediscovery deep diveInformation Governance and ediscovery in office 365 ediscovery deep dive
Information Governance and ediscovery in office 365 ediscovery deep dive
 
Shall we search? Lviv.
Shall we search? Lviv. Shall we search? Lviv.
Shall we search? Lviv.
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
PHP/MySQL Training Course in Delhi, India by IT People
PHP/MySQL Training Course in Delhi, India by IT PeoplePHP/MySQL Training Course in Delhi, India by IT People
PHP/MySQL Training Course in Delhi, India by IT People
 
A Brief History of e-Learning Standards in the United States
A Brief History of e-Learning Standards in the United StatesA Brief History of e-Learning Standards in the United States
A Brief History of e-Learning Standards in the United States
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
xAPI Intro for Instructional Designers Learning While Working 2019
xAPI Intro for Instructional Designers Learning While Working 2019xAPI Intro for Instructional Designers Learning While Working 2019
xAPI Intro for Instructional Designers Learning While Working 2019
 
MWLUG 2014: Modern Domino (workshop)
MWLUG 2014: Modern Domino (workshop)MWLUG 2014: Modern Domino (workshop)
MWLUG 2014: Modern Domino (workshop)
 
Data-driven Approach to Launching your Career
Data-driven Approach to Launching your CareerData-driven Approach to Launching your Career
Data-driven Approach to Launching your Career
 
SDWest2005Goetsch
SDWest2005GoetschSDWest2005Goetsch
SDWest2005Goetsch
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 

Recently uploaded (20)

Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding: Spark Summit East talk by Khalifeh Aljadda

  • 1. 2/1/17  ©  2016  CareerBuilder How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding Walid Shalaby, Khalifeh AlJadda, Mohammed Korayem, Trey Grainger
  • 2. Khalifeh AlJadda Lead Data Scientist, Search Data Science •  Joined CareerBuilder in 2013 •  PhD, Computer Science – University of Georgia •  BSc, MSc, Computer Science, Jordan University of Science and Technology Activities: •  Conference Chair of Southern Data Science Conf (www.southerndatascience.com) •  Founder and Chairman of CB Data Science Council •  Frequent public speaker in data science and Big Data conferences. •  Creator of GELATO (Glycomic Elucidation and Annotation Tool) About  Me  
  • 3. Search Technology at CareerBuilder by Numbers 3   Powering  50+  Search  Experiences  Including:   100million + Searches  per  day   30+ SoAware  Developers,  Data   ScienEsts  +  Analysts   500+ Search  Servers   1,5billion + Documents  indexed  and   searchable   1Global  Search     Technology  plaIorm   ...and  many  more
  • 4. Agenda: •  Problem & Challenges •  Proposed System •  Experiments and Results •  Conclusions
  • 5. •  Identifying regions of text corresponding to entities •  Categorizing recognized entities into predefined classes Entity Type Recognition (ETR) 5 Source:  Ward,  Ma?hew  O.,  Georges  Grinstein,  and  Daniel  Keim.  Interac(ve  data  visualiza(on:  founda(ons,  techniques,  and  applica(ons.  CRC  Press,  2010.
  • 6. •  Understand entity types in search queries •  Company, Job titles, School, Skill •  Search queries are short, context-less •  “data scientist hadoop careerbuilder”. •  Some entities have multiple surface forms •  university of north carolina charlotte, unc charlotte, uncc •  Domain specific entities •  registered nurse (rn), licensed practical nurse (lpn), director of nurse (don) Problem & Challenges 6
  • 7. •  Wikipedia,Wikipedia,Wikipedia •  Title and first paragraph •  Title and categories •  Infobox •  DBpedia, Freebase Prior Work 7
  • 8. •  Entities with no page/entry (java developer) •  Non-standard categories (skill) •  Domain specific knowledge (e.g., job posts) Limitations 8
  • 9. Enrich entity representation using 4 clues (features): •  Real-world knowledge (Wikipedia) •  Domain specific knowledge (job posts) •  Ontology (DBpedia) •  Lexical DB (Wordnet) Methodology 9
  • 10. Architecture 10 Query  Parser Query Phrase  IdenOfier (Bayes) [e1,e2…] Wiki   Index DBpedia WordNet Ontological   features Word  Embeddings  Vector Classifier EnOty  Type
  • 11. Offline •  Wikipedia index (title, length, text, categories) •  Word2Vec trained on ~100M job posts •  SVM classifier Our System 11 Online •  top 3 hits + 10 fragments for each + categories •  Word2Vec synset (most similar 20 words) •  tf-idf vector •  is_company (binary feature) •  is_agent_noun (binary feature)
  • 12. Offline •  Wikipedia index (title, length, text, categories) •  Word2Vec trained on ~25m job posts •  SVM classifier Our System 12 Online •  top 3 hits + 10 fragments for each + categories •  Word2Vec synset (most similar 20 words) •  tf-idf vector •  is_company (binary feature) •  has_agent_noun (binary feature)
  • 13. Offline Architecture 13 Query Query  Parser Phrase  IdenOfier (Bayes) Wiki   Index DBpedia Ontological   features WordNet Word  Embeddings  Vector Classifier EnOty  Type [e1,e2…]
  • 14. Online Architecture 14 Query Query  Parser Phrase  IdenOfier (Bayes) Classifier EnOty  Type [e1,e2…] Wiki   Features Synset Lexical   and   Onotolog icl
  • 15. •  IBM <company> •  LPN <job title> Context Features 15 The  InternaEonal  Business  Machines  CorporaEon  (IBM)  is  an  American   mul'na'onal  technology  and  consul'ng  corpora'on,  with   headquarters  in  Armonk,  New  York.  IBM  manufactures  and   markets  computer  hardware  had  originated  with  CTR's  Canadian  subsidiary.   The  iniEalism  IBM  followed.  SecuriEes  analysts.    In  2012,  'Fortune'  ranked  IBM  the   No.  2  largest  U.S.  firm  in  terms  of  number  of  employees  (435,000    ('Barron's'),  No.  5   most  admired  company  ('Fortune'),  and  No.  18  most  innovaEve  company   ('Fast  Company').  The  company  held  the  record  to  Lenovo  (2005,  2014).  In   2014  IBM  announced  that  it  would  go  'fabless'  by  offloading  IBM  Micro  Electronics   form  the  core  of  what  would  become  InternaEonal  Business  Machines  (IBM).  Julius   E.  Pitrat  patented  was  renamed  the  'InternaEonal  Business  Machines   Corpora'on'  (IBM),  ciEng  the  need  to  align  its  name  the  company     produced  small  arms  ...   Lee  Presson  and  the  Nails  (also  known  as  LPN)  is  a  swing  band  that   formed  in  the  San  Francisco.  As  of  2010,  the  band  has  released  five  albums.   LPN  differenEated  themselves  from  of  band  leader  Lee  Presson…  
  • 16. •  IBM <company> •  LPN <job title> Embedding Features (Synsets) 16 fusion iis jms virtualizaOon   emc weblogic atg esb         mq nosql voip Obco jboss   Tivoli hadoop avaya citrix   tomcat  hp websphere pracOOoner lvn nurse   registered rn  vocaOonal   psychologist midwife            aide      can   licensed licensure icu   psych arnp
  • 17. •  2 baselines vs. combinations of various representations •  10 fold cross-validation •  Report P, R, and micro-averaged F1 •  Data set (177K entities) Evaluation 17 Category Number of instances Company 40000+ Job Title 3500+ School 100000+ Skill 25000+
  • 18. Baseline Results 18 Category/Model Company Job Title School Skill Bow 85.19 87.42 96.59 76.57 wikiw 5.35 2.24 1.06 11.18 •  Absolute increase (F1) •  Absolute decrease (F1)
  • 19. Context Features Results 19 Category/Model Company Job Title School Skill Bow 85.19 87.42 96.59 76.57 wikiw 5.35 2.24 1.06 11.18 wikix 10.79 -0.14 1.93 15.64 •  Absolute increase (F1) •  Absolute decrease (F1)
  • 20. Context + Embedding Features Results 20 Category/Model Company Job Title School Skill Bow 85.19 87.42 96.59 76.57 wikiw 5.35 2.24 1.06 11.18 wikix 10.79 -0.14 1.93 15.64 wikix + jobw 10.99 2.70 2.09 16.24 •  Absolute increase (F1) •  Absolute decrease (F1)
  • 21. Context + Embedding + Ontology + Lexical Features Results 21 Category/Model Company Job Title School Skill Bow 85.19 87.42 96.59 76.57 wikiw 5.35 2.24 1.06 11.18 wikix 10.79 -0.14 1.93 15.64 wikix + jobw 10.99 2.70 2.09 16.24 wikix + jobw+ ont+ lex 11.37 3.15 2.10 16.57 •  Absolute increase (F1) •  Absolute decrease (F1)
  • 22. •  Effective approach for ETR of search query entities •  Tailored to the job search and recruitment domain •  Combine real-world and domain-specific knowledge •  The ensemble entity representation •  contextual information using Wikipedia •  semantic information in millions of job postings •  class type in DBpedia for Company entities •  linguistic properties in WordNet for Job Title entities •  Ensemble features gave 97% micro-averaged F1 •  Online ETR takes 30ms per entity type request Conclusion 22
  • 23. ©  2016    CareerBuilder Thank you! www.aljadda.com Twitter: @aljadda