SlideShare a Scribd company logo
RELXSearchSummIt2018RELXSearchSummIt2018
Content Engineering Tutorial
Sujit Pal, Elsevier Labs
September 25-27, 2018
RELXSearchSummIt2018
Outline
• Background (Vector Space Model / Lucene)
• Basic Text Processing
• Keyword Extraction
• Content Search and Discovery Enhancement
• Dimensionality Reduction
• Ontology Construction
• Content based Recommendations
Content Engineering Tutorial 2
RELXSearchSummIt2018
Pre-requisites
• Notebooks and web tool available at https://github.com/sujitpal/content-
engineering-tutorial
• Instructions in data/README.md and models/README.md about locations
of data and models to download.
• Parts of the tutorial depends on Solr 7.x, please download and install.
• Data and models (along with all processing) available at
• https://drive.google.com/file/d/1uGbrGu5v9yaUKB_26oz_asKe2-
SRGtsw/view?usp=sharing
• https://drive.google.com/file/d/1xTB2Qx6roKYxUCWR_uTu4pwGp4iW-
rY7/view?usp=sharing
• Code mostly written using Anaconda Python 3, see requirements.txt for full
list of additional libraries needed.
Content Engineering Tutorial 3
RELXSearchSummIt2018
Vector Space Model
• Collection of documents can be thought of as term-document matrix.
• Vector space – each token represents a dimension, weight represents
the value along that dimension for the document.
• Term Frequency, Inverse Document Frequency
• Inverted Index for Search
• Dimensionality Reduction as a useful technique for extracting
semantics of data.
• See Notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/00-vector-space-model.ipynb
Content Engineering Tutorial 4
RELXSearchSummIt2018
Processing Data
• Preprocessing NIPS data for Solr
• Solr Field Types
• Solr Analyzer Chain
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/01-preprocess.ipynb
• Baseline search implemented
• http://localhost:5000/search0
• Entire search string must appear in title or body.
Content Engineering Tutorial 5
RELXSearchSummIt2018
Search Improvement – tokenize query
• Baseline search may be too restrictive.
• Example: “neural network with attention mechanism” will return 0 results.
• Solution – tokenize search string and do OR query with each token.
• Example in tool - http://localhost:5000/search1
• High recall – too high?
Content Engineering Tutorial 6
RELXSearchSummIt2018
Exploring Data
• Word Counts
• Figuring out what to keep and what to discard
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/02-wordcounts.ipynb
Content Engineering Tutorial 7
RELXSearchSummIt2018
Custom Stopwords
• Removing stopwords maybe a way to make the tokenized OR search
less noisy?
• Leveraging IDF to detect potential stopwords for the corpus.
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/03-stopwords.ipynb
• See it applied in tool - http://localhost:5000/search2
Content Engineering Tutorial 8
RELXSearchSummIt2018
How about keywords?
• Detect keywords in text using Log Likelihood Ratio (LLR)
• Frequent bigrams (can be extended to longer grams)
• Likely collocations – Log Likelihood Ratio method
• General family of statistical methods – others are Chi-squared, t-test, etc.
• Notebooks
• Generating Frequent Bigrams - https://github.com/sujitpal/content-
engineering-tutorial/blob/master/notebooks/04-bigrams.ipynb
• Finding likely bigrams - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/05-bigrams-llr.ipynb
Content Engineering Tutorial 9
RELXSearchSummIt2018
Other keyword detection algorithms
• RAKE – rule based, unsupervised.
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/06-rake.ipynb
• MAUI – machine learning based
• Provide text and keywords as labels for training – used combination of
keywords from bigrams LLR and RAKE.
• Apply trained MAUI model on same text to predict more keywords
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/07-maui.ipynb
• Merged keywords from LLR, RAKE and MAUI manually curated.
Content Engineering Tutorial 10
RELXSearchSummIt2018
Removing near-duplicate keywords
• Using simhash algorithm to detect keywords that are very close to
each other, e.g., data model and data models.
• Uses hashing techniques (min-hash, sim-hash)
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/08-keyword-neardup.ipynb
• Using dedupe algorithm to detect keywords that are semantically
equivalent, e.g., data set and dataset.
• Uses Machine Learning
• Training code (active learning) - https://github.com/sujitpal/content-
engineering-tutorial/blob/master/scripts/dedupe_keyword_train.py
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/09-keyword-dedupe.ipynb
Content Engineering Tutorial 11
RELXSearchSummIt2018
Incorporating keywords into search
• Detect keywords in search query using Aho-Corasick algorithm.
• Solr supports Aho-Corasick via SolrTextTagger.
• Expand query to incorporate keywords
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/11-query-parsing-expansion.ipynb
• See it applied in tool - http://localhost:5000/search4
Content Engineering Tutorial 12
RELXSearchSummIt2018
Extracting organizations
• NLTK + Stanford
• NLTK with Stanford NER backend – very slow
• Stanford NER command line
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/12-org-ner-nltk-stanford.ipynb
• Spacy NER
• Quite fast and good quality
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/13-org-ner-spacy.ipynb
• Pruning predictions against dictionaries
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/14-org-ner-ahocorasick.ipynb
Content Engineering Tutorial 13
RELXSearchSummIt2018
Incorporating Facets into Solr
• We extracted keywords and ORGs for each document
• We already have authors for each document
• We can enhance search by offering facet functionality
• See in tool - http://localhost:5000/search4
Content Engineering Tutorial 14
RELXSearchSummIt2018
Make Content more discoverable
• Similar item functionality
• More Like This (MLT) – built into Solr
• Similar keywords, authors and ORGs
• Custom MLT using (keywords, authors and ORGs).
• Setting up new fields and MLT handler
• New fields - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/scripts/create_schema_2.sh
• Handler configuration - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/scripts/create_config_2.sh
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/15-load-solr.ipynb
• See in tool - http://localhost:5000/content1?id=6295
Content Engineering Tutorial 15
RELXSearchSummIt2018
Topic Modeling
• Uses gensim
• Preprocess text for cleaner topic models
• Dimensionality Reduction – projects original document into smaller
and denser “taste” space.
• Find similar documents
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/16-topic-modeling.ipynb
• See in tool - http://localhost:5000/content1?id=6295
Content Engineering Tutorial 16
RELXSearchSummIt2018
Word and Document Embeddings
• Project document vectors into smaller embedding space
• Lookup third party embeddings
• Create your own embeddings (we created our own).
• Another dimensionality reduction technique, projects document into
smaller, denser “meaning” space.
• Document vectors
• Average of word vectors for word2vec (BoW model)
• Model based for doc2Vec.
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/17-word-embeddings.ipynb
• See in tool - http://localhost:5000/content1?id=6295
Content Engineering Tutorial 17
RELXSearchSummIt2018
Building a keyword ontology
• Use keyword collocations across documents to build an ontology of
keywords
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/18-build-ontology.ipynb
• Applications
• Query expansion
• Ranking keyword similarity
• Reverse approach yields navigable network of documents.
• Explore options for Content Recommendations.
Content Engineering Tutorial 18
RELXSearchSummIt2018
Content Recommendations
• Push Mechanism – recommend documents that a user might like to
read, given the documents he has already read.
• Extension of Similar Documents, except this has multiple documents
as input.
• Most techniques for computing similarity covered earlier (e.g., Topic
Modeling, Word Vectors, etc) can be reused. Here we use NMF (Non-
negative Matrix Factorization).
• See notebook - https://github.com/sujitpal/content-engineering-
tutorial/blob/master/notebooks/19-content-recommender.ipynb
Content Engineering Tutorial 19
RELXSearchSummIt2018
• Contact: sujit.pal@elsevier.com
Content Engineering Tutorial 20
Thank you!

More Related Content

What's hot

Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Lucidworks
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Trey Grainger
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Joaquin Delgado PhD.
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
Ted Sullivan
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Evan Casey
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
DataWorks Summit
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Lucidworks
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
Trey Grainger
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
live_and_let_live
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
Philippe Mizrahi
 
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Lucidworks
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 

What's hot (20)

Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 

Similar to Search summit-2018-content-engineering-slides

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Nikos Katirtzis
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilities
Nikos Katirtzis
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
Barbara Fusinska
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptx
Shree Shree
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationOpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
Alkacon Software GmbH & Co. KG
 
TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse?
TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse? TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse?
TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse?
IXIASOFT
 
Cataloging GitHub Repositories
Cataloging GitHub RepositoriesCataloging GitHub Repositories
Cataloging GitHub Repositories
Pavneet Singh Kochhar
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
Bruce Kozuma
 
Learn from my Mistakes - Building Better Solutions in SPFx
Learn from my  Mistakes - Building Better Solutions in SPFxLearn from my  Mistakes - Building Better Solutions in SPFx
Learn from my Mistakes - Building Better Solutions in SPFx
Thomas Daly
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearch
Erhwen Kuo
 
Introduction to Microdata & Google Rich Snippets
Introduction to Microdata  & Google Rich SnippetsIntroduction to Microdata  & Google Rich Snippets
Introduction to Microdata & Google Rich Snippets
Plus91 Technologies Pvt. Ltd.
 
20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge
Karin Patenge
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
MongoDB
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
MongoDB
 

Similar to Search summit-2018-content-engineering-slides (20)

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilities
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptx
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationOpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
 
TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse?
TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse? TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse?
TC Dojo Open Session: Are You Getting the Most Out of DITA Content Reuse?
 
Cataloging GitHub Repositories
Cataloging GitHub RepositoriesCataloging GitHub Repositories
Cataloging GitHub Repositories
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
 
Learn from my Mistakes - Building Better Solutions in SPFx
Learn from my  Mistakes - Building Better Solutions in SPFxLearn from my  Mistakes - Building Better Solutions in SPFx
Learn from my Mistakes - Building Better Solutions in SPFx
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearch
 
Introduction to Microdata & Google Rich Snippets
Introduction to Microdata  & Google Rich SnippetsIntroduction to Microdata  & Google Rich Snippets
Introduction to Microdata & Google Rich Snippets
 
20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
Sujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
Sujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
Sujit Pal
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
Sujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

Search summit-2018-content-engineering-slides

  • 2. RELXSearchSummIt2018 Outline • Background (Vector Space Model / Lucene) • Basic Text Processing • Keyword Extraction • Content Search and Discovery Enhancement • Dimensionality Reduction • Ontology Construction • Content based Recommendations Content Engineering Tutorial 2
  • 3. RELXSearchSummIt2018 Pre-requisites • Notebooks and web tool available at https://github.com/sujitpal/content- engineering-tutorial • Instructions in data/README.md and models/README.md about locations of data and models to download. • Parts of the tutorial depends on Solr 7.x, please download and install. • Data and models (along with all processing) available at • https://drive.google.com/file/d/1uGbrGu5v9yaUKB_26oz_asKe2- SRGtsw/view?usp=sharing • https://drive.google.com/file/d/1xTB2Qx6roKYxUCWR_uTu4pwGp4iW- rY7/view?usp=sharing • Code mostly written using Anaconda Python 3, see requirements.txt for full list of additional libraries needed. Content Engineering Tutorial 3
  • 4. RELXSearchSummIt2018 Vector Space Model • Collection of documents can be thought of as term-document matrix. • Vector space – each token represents a dimension, weight represents the value along that dimension for the document. • Term Frequency, Inverse Document Frequency • Inverted Index for Search • Dimensionality Reduction as a useful technique for extracting semantics of data. • See Notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/00-vector-space-model.ipynb Content Engineering Tutorial 4
  • 5. RELXSearchSummIt2018 Processing Data • Preprocessing NIPS data for Solr • Solr Field Types • Solr Analyzer Chain • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/01-preprocess.ipynb • Baseline search implemented • http://localhost:5000/search0 • Entire search string must appear in title or body. Content Engineering Tutorial 5
  • 6. RELXSearchSummIt2018 Search Improvement – tokenize query • Baseline search may be too restrictive. • Example: “neural network with attention mechanism” will return 0 results. • Solution – tokenize search string and do OR query with each token. • Example in tool - http://localhost:5000/search1 • High recall – too high? Content Engineering Tutorial 6
  • 7. RELXSearchSummIt2018 Exploring Data • Word Counts • Figuring out what to keep and what to discard • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/02-wordcounts.ipynb Content Engineering Tutorial 7
  • 8. RELXSearchSummIt2018 Custom Stopwords • Removing stopwords maybe a way to make the tokenized OR search less noisy? • Leveraging IDF to detect potential stopwords for the corpus. • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/03-stopwords.ipynb • See it applied in tool - http://localhost:5000/search2 Content Engineering Tutorial 8
  • 9. RELXSearchSummIt2018 How about keywords? • Detect keywords in text using Log Likelihood Ratio (LLR) • Frequent bigrams (can be extended to longer grams) • Likely collocations – Log Likelihood Ratio method • General family of statistical methods – others are Chi-squared, t-test, etc. • Notebooks • Generating Frequent Bigrams - https://github.com/sujitpal/content- engineering-tutorial/blob/master/notebooks/04-bigrams.ipynb • Finding likely bigrams - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/05-bigrams-llr.ipynb Content Engineering Tutorial 9
  • 10. RELXSearchSummIt2018 Other keyword detection algorithms • RAKE – rule based, unsupervised. • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/06-rake.ipynb • MAUI – machine learning based • Provide text and keywords as labels for training – used combination of keywords from bigrams LLR and RAKE. • Apply trained MAUI model on same text to predict more keywords • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/07-maui.ipynb • Merged keywords from LLR, RAKE and MAUI manually curated. Content Engineering Tutorial 10
  • 11. RELXSearchSummIt2018 Removing near-duplicate keywords • Using simhash algorithm to detect keywords that are very close to each other, e.g., data model and data models. • Uses hashing techniques (min-hash, sim-hash) • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/08-keyword-neardup.ipynb • Using dedupe algorithm to detect keywords that are semantically equivalent, e.g., data set and dataset. • Uses Machine Learning • Training code (active learning) - https://github.com/sujitpal/content- engineering-tutorial/blob/master/scripts/dedupe_keyword_train.py • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/09-keyword-dedupe.ipynb Content Engineering Tutorial 11
  • 12. RELXSearchSummIt2018 Incorporating keywords into search • Detect keywords in search query using Aho-Corasick algorithm. • Solr supports Aho-Corasick via SolrTextTagger. • Expand query to incorporate keywords • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/11-query-parsing-expansion.ipynb • See it applied in tool - http://localhost:5000/search4 Content Engineering Tutorial 12
  • 13. RELXSearchSummIt2018 Extracting organizations • NLTK + Stanford • NLTK with Stanford NER backend – very slow • Stanford NER command line • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/12-org-ner-nltk-stanford.ipynb • Spacy NER • Quite fast and good quality • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/13-org-ner-spacy.ipynb • Pruning predictions against dictionaries • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/14-org-ner-ahocorasick.ipynb Content Engineering Tutorial 13
  • 14. RELXSearchSummIt2018 Incorporating Facets into Solr • We extracted keywords and ORGs for each document • We already have authors for each document • We can enhance search by offering facet functionality • See in tool - http://localhost:5000/search4 Content Engineering Tutorial 14
  • 15. RELXSearchSummIt2018 Make Content more discoverable • Similar item functionality • More Like This (MLT) – built into Solr • Similar keywords, authors and ORGs • Custom MLT using (keywords, authors and ORGs). • Setting up new fields and MLT handler • New fields - https://github.com/sujitpal/content-engineering- tutorial/blob/master/scripts/create_schema_2.sh • Handler configuration - https://github.com/sujitpal/content-engineering- tutorial/blob/master/scripts/create_config_2.sh • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/15-load-solr.ipynb • See in tool - http://localhost:5000/content1?id=6295 Content Engineering Tutorial 15
  • 16. RELXSearchSummIt2018 Topic Modeling • Uses gensim • Preprocess text for cleaner topic models • Dimensionality Reduction – projects original document into smaller and denser “taste” space. • Find similar documents • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/16-topic-modeling.ipynb • See in tool - http://localhost:5000/content1?id=6295 Content Engineering Tutorial 16
  • 17. RELXSearchSummIt2018 Word and Document Embeddings • Project document vectors into smaller embedding space • Lookup third party embeddings • Create your own embeddings (we created our own). • Another dimensionality reduction technique, projects document into smaller, denser “meaning” space. • Document vectors • Average of word vectors for word2vec (BoW model) • Model based for doc2Vec. • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/17-word-embeddings.ipynb • See in tool - http://localhost:5000/content1?id=6295 Content Engineering Tutorial 17
  • 18. RELXSearchSummIt2018 Building a keyword ontology • Use keyword collocations across documents to build an ontology of keywords • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/18-build-ontology.ipynb • Applications • Query expansion • Ranking keyword similarity • Reverse approach yields navigable network of documents. • Explore options for Content Recommendations. Content Engineering Tutorial 18
  • 19. RELXSearchSummIt2018 Content Recommendations • Push Mechanism – recommend documents that a user might like to read, given the documents he has already read. • Extension of Similar Documents, except this has multiple documents as input. • Most techniques for computing similarity covered earlier (e.g., Topic Modeling, Word Vectors, etc) can be reused. Here we use NMF (Non- negative Matrix Factorization). • See notebook - https://github.com/sujitpal/content-engineering- tutorial/blob/master/notebooks/19-content-recommender.ipynb Content Engineering Tutorial 19