SlideShare a Scribd company logo
IR: open source
state
Dmitry Kan, AlphaSense, Insider
Solutions
University of Helsinki, Information Retrieval and
Search Engines course, Feb 21, 2017
About me
● PhD in CS (Saint Petersburg State University), 2011
● Running a Search Engine team at AlphaSense since 2014
● Founded Insider Solutions in 2009: text analytics solutions +
consulting
● Co-committer on luke project: toolbox for Lucene index since 2013
What is AlphaSense
● Google for financial analysts
● Semantic research engine
● Edit, tag, annotate, share you data in a team
● Oracle, JP Morgan, Credit Suisse
● Engineering is 98% in Helsinki + 1% NYC + 1% India
● #1 fastest growing IT startup in Finland by Deloitte
(2015)
www.alpha-sense.com
● Founded 2009
● BigText Analytics APIs and on-premise solutions
○ Sentiment analysis: Russian, Chinese, English
○ Searchable trend extraction
● Consulting: startups and corporates
https://semanticanalyzer.info
Insider Solutions
Outline
● Search engine architecture
● Open source search ecosystem
● Research directions for applied IR
Search engine: building blocks
● Web crawler: Apache Nutch (based on Hadoop)
● Data ingestion pipeline: receiving, cleaning, data
extraction
● SolrCloud OR Elasticsearch (both based on Lucene)
● Shards: storing index on disk and / or memory
Lucene / Solr history timeline
Inject URLs
Create
segments
New URLs
Search Engine Software Components
● Schema
● Query parser
● Scoring algorithm
● Snippet highlighter
● Index (on-disk or in-memory)
Query analysis and suggestions
British vs US English handling
One shard of the index
Content extraction
Apache Tika for parsing formats:
● Html, XML
● PDF
● Microsoft Office & iWorks document formats
● Audio, image, video
● Mail
● Source code
Inspecting Lucene index with Luke
Implemented by Andrzej Bialecki. Since 2013 → by Dmitry Kan (Finland) and
Tomoko Uchida (Japan)
● Perform index maintenance
● Prototype similarity functions
● Search for documents, reconstruct field values from the index
● Read index from HDFS (Hadoop’s distributed file system)
● Supports Apache Solr and Elasticsearch
Learning to rank: Solr
Contributed by Bloomberg
Machine learnt model for reranking documents based on user feedback
Trained on features: views, popularity, was hit in the title, length, can view on
mobile device?
LamdaMART, RankSVM
Lucene scoring formula
Feature: is person and executive?
Feature: recency of the document
Features as signal of result importance
Learnt model
Word vectors with Lucene
Word2vec was released by Google to open source
Possible to train word2vec on Lucene index:
https://github.com/kojisekig/word2vec-lucene
● NO need to provide a text file besides Lucene index
● NO need to normalize text. Normalization already done in the index or
Analyzer does it for you when processing
● Use part of the index by specifying a filter query
Questions?
Reach me at:
dk@semanticanalyzer.info
Twitter: @dmitrykan
Quora: https://www.quora.com/profile/Dmitry-Kan
References
1. Luke: https://github.com/DmitryKey/luke
2. My blog: http://dmitrykan.blogspot.fi/
3. Solr vs Elasticsearch (overview): https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/
4. Solr vs Elasticsearch (in-depth): https://sematext.com/blog/2012/08/23/solr-vs-elasticsearch-part-1-overview/
5. Introduction to Apache Solr http://www.slideshare.net/ChristosManios/introduction-to-apache-solr-54076189
6. Word2vec-lucene: https://github.com/kojisekig/word2vec-lucene
7. Apache Tika: https://tika.apache.org/
8. Apache Solr: http://lucene.apache.org/solr/
9. Elasticsearch: https://github.com/elastic/elasticsearch
10. Learning to rank in Solr (video): https://www.youtube.com/watch?v=M7BKwJoh96s
11. Learning to rank in Solr (slides): https://lucidworks.com/2016/08/17/learning-to-rank-solr/
12. Word2vec: https://en.wikipedia.org/wiki/Word2vec#Analysis
13. Lucene scoring formula:
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

More Related Content

What's hot

Resume ashay
Resume ashayResume ashay
Resume ashay
ashay sharma
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
Lifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
Entity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsEntity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph Embeddings
Sease
 
DATA SCIENCE TRAINING IN CHENNAI
DATA SCIENCE TRAINING IN CHENNAIDATA SCIENCE TRAINING IN CHENNAI
DATA SCIENCE TRAINING IN CHENNAI
shivajirao12345
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Lifeng (Aaron) Han
 

What's hot (6)

Resume ashay
Resume ashayResume ashay
Resume ashay
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Entity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsEntity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph Embeddings
 
DATA SCIENCE TRAINING IN CHENNAI
DATA SCIENCE TRAINING IN CHENNAIDATA SCIENCE TRAINING IN CHENNAI
DATA SCIENCE TRAINING IN CHENNAI
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 

Viewers also liked

MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
Dmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
Dmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
Dmitry Kan
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryDmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
Dmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
Dmitry Kan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
Dmitry Kan
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
Dmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
Dmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
Dmitry Kan
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
Dmitry Kan
 
Proformas informatica
Proformas informaticaProformas informatica
Proformas informatica
yarlin04
 

Viewers also liked (20)

MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational Dictionary
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
 
Proformas informatica
Proformas informaticaProformas informatica
Proformas informatica
 

Similar to IR: Open source state

SG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptx
SG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptxSG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptx
SG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptx
PriyankaShah668821
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh Kulshreshtha
 
Gdsc IIIT Surat Orientation 2022.pdf
Gdsc IIIT Surat Orientation 2022.pdfGdsc IIIT Surat Orientation 2022.pdf
Gdsc IIIT Surat Orientation 2022.pdf
SparshJhariya2
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
Alessandro Benedetti
 
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta Mukherjee
 
Sudipta mukherjee 2016_2017
Sudipta mukherjee 2016_2017Sudipta mukherjee 2016_2017
Sudipta mukherjee 2016_2017
Sudipta Mukherjee
 
Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Ayushi Tongiya
 
ResumeAmanRajJuly2016
ResumeAmanRajJuly2016ResumeAmanRajJuly2016
ResumeAmanRajJuly2016Aman Raj
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
Andrea Wiggins
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey Kumar
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
Resume sb
Resume sbResume sb
Resume sb
Surbhi Bhatnagar
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
Basis Technology
 
PriyankaDighe_Resume_new
PriyankaDighe_Resume_newPriyankaDighe_Resume_new
PriyankaDighe_Resume_newPriyanka Dighe
 

Similar to IR: Open source state (20)

SG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptx
SG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptxSG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptx
SG_UserGroup_Oct20_2022_NLP_AzureLangStudio.pptx
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_Exp
 
Gdsc IIIT Surat Orientation 2022.pdf
Gdsc IIIT Surat Orientation 2022.pdfGdsc IIIT Surat Orientation 2022.pdf
Gdsc IIIT Surat Orientation 2022.pdf
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
 
Resume (2)
Resume (2)Resume (2)
Resume (2)
 
RohitSharmaResume
RohitSharmaResumeRohitSharmaResume
RohitSharmaResume
 
Sudipta mukherjee 2016_2017
Sudipta mukherjee 2016_2017Sudipta mukherjee 2016_2017
Sudipta mukherjee 2016_2017
 
Sudipta_Mukherjee_2016_2017
Sudipta_Mukherjee_2016_2017Sudipta_Mukherjee_2016_2017
Sudipta_Mukherjee_2016_2017
 
Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)
 
ResumeAmanRajJuly2016
ResumeAmanRajJuly2016ResumeAmanRajJuly2016
ResumeAmanRajJuly2016
 
TejasveeBolisetty
TejasveeBolisettyTejasveeBolisetty
TejasveeBolisetty
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016
 
Radhakrishnan Moni
Radhakrishnan MoniRadhakrishnan Moni
Radhakrishnan Moni
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Resume sb
Resume sbResume sb
Resume sb
 
Nirbhay Singh
Nirbhay SinghNirbhay Singh
Nirbhay Singh
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
 
PriyankaDighe_Resume_new
PriyankaDighe_Resume_newPriyankaDighe_Resume_new
PriyankaDighe_Resume_new
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

IR: Open source state

  • 1. IR: open source state Dmitry Kan, AlphaSense, Insider Solutions University of Helsinki, Information Retrieval and Search Engines course, Feb 21, 2017
  • 2. About me ● PhD in CS (Saint Petersburg State University), 2011 ● Running a Search Engine team at AlphaSense since 2014 ● Founded Insider Solutions in 2009: text analytics solutions + consulting ● Co-committer on luke project: toolbox for Lucene index since 2013
  • 3. What is AlphaSense ● Google for financial analysts ● Semantic research engine ● Edit, tag, annotate, share you data in a team ● Oracle, JP Morgan, Credit Suisse ● Engineering is 98% in Helsinki + 1% NYC + 1% India ● #1 fastest growing IT startup in Finland by Deloitte (2015) www.alpha-sense.com
  • 4. ● Founded 2009 ● BigText Analytics APIs and on-premise solutions ○ Sentiment analysis: Russian, Chinese, English ○ Searchable trend extraction ● Consulting: startups and corporates https://semanticanalyzer.info Insider Solutions
  • 5. Outline ● Search engine architecture ● Open source search ecosystem ● Research directions for applied IR
  • 6. Search engine: building blocks ● Web crawler: Apache Nutch (based on Hadoop) ● Data ingestion pipeline: receiving, cleaning, data extraction ● SolrCloud OR Elasticsearch (both based on Lucene) ● Shards: storing index on disk and / or memory
  • 7. Lucene / Solr history timeline
  • 8.
  • 10. Search Engine Software Components ● Schema ● Query parser ● Scoring algorithm ● Snippet highlighter ● Index (on-disk or in-memory)
  • 11. Query analysis and suggestions
  • 12. British vs US English handling
  • 13. One shard of the index
  • 14. Content extraction Apache Tika for parsing formats: ● Html, XML ● PDF ● Microsoft Office & iWorks document formats ● Audio, image, video ● Mail ● Source code
  • 15. Inspecting Lucene index with Luke Implemented by Andrzej Bialecki. Since 2013 → by Dmitry Kan (Finland) and Tomoko Uchida (Japan) ● Perform index maintenance ● Prototype similarity functions ● Search for documents, reconstruct field values from the index ● Read index from HDFS (Hadoop’s distributed file system) ● Supports Apache Solr and Elasticsearch
  • 16.
  • 17. Learning to rank: Solr Contributed by Bloomberg Machine learnt model for reranking documents based on user feedback Trained on features: views, popularity, was hit in the title, length, can view on mobile device? LamdaMART, RankSVM
  • 19.
  • 20.
  • 21.
  • 22. Feature: is person and executive?
  • 23. Feature: recency of the document
  • 24. Features as signal of result importance
  • 26. Word vectors with Lucene Word2vec was released by Google to open source Possible to train word2vec on Lucene index: https://github.com/kojisekig/word2vec-lucene ● NO need to provide a text file besides Lucene index ● NO need to normalize text. Normalization already done in the index or Analyzer does it for you when processing ● Use part of the index by specifying a filter query
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Questions? Reach me at: dk@semanticanalyzer.info Twitter: @dmitrykan Quora: https://www.quora.com/profile/Dmitry-Kan
  • 34. References 1. Luke: https://github.com/DmitryKey/luke 2. My blog: http://dmitrykan.blogspot.fi/ 3. Solr vs Elasticsearch (overview): https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/ 4. Solr vs Elasticsearch (in-depth): https://sematext.com/blog/2012/08/23/solr-vs-elasticsearch-part-1-overview/ 5. Introduction to Apache Solr http://www.slideshare.net/ChristosManios/introduction-to-apache-solr-54076189 6. Word2vec-lucene: https://github.com/kojisekig/word2vec-lucene 7. Apache Tika: https://tika.apache.org/ 8. Apache Solr: http://lucene.apache.org/solr/ 9. Elasticsearch: https://github.com/elastic/elasticsearch 10. Learning to rank in Solr (video): https://www.youtube.com/watch?v=M7BKwJoh96s 11. Learning to rank in Solr (slides): https://lucidworks.com/2016/08/17/learning-to-rank-solr/ 12. Word2vec: https://en.wikipedia.org/wiki/Word2vec#Analysis 13. Lucene scoring formula: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html