SlideShare a Scribd company logo
A document and page level retrieval solution powered by ElasticSearch
proposed to handle a business requirement in Mobius
Building an unstructured data
management solutions with ElasticSearch
and Amazon Web Services
Topics Covered
❖ The Business need we faced
❖ Why ElasticSearch to meet our challenge?
❖ Adopting the Parent-Child relationship in ElasticSearch
❖ ElasticSearch Document Database Architecture
❖ Technical Implementation of the solution
■ Plugin Creation
■ Index Creation
■ Indexing parent document
■ Indexing child document
■ Retrieving documents by query
❖ Possible Search Types in ElasticSearch
❖ How we adapted the phrase search
The Business need we faced
❖ A UK based energy intelligence company required a document store database to hold
analysis and research documents
❖ The document could be in various file formats likePDF’s, Excel, text file etc.,.
❖ Two kinds of retrieval were needed -
➢ Page level Retrieval - To retrieve specific pages that matched the search content
and tags.
➢ Document Level Retrieval - To retrieve an entire document based on the searched
content and tags.
Why ElasticSearch to meet our challenge?
❖ Other document level tagging and retrieval solutions like Aleph and OverviewDocs did
not have a clear feature for page level retrieval
❖ Likeable Features of ElasticSearch include -
➢ Open-source, broadly-distributable, readily-scalable, enterprise-grade search
engine.
➢ Can power extremely fast and accurate full-text searches for data discovery
applications.
➢ Multiple configurations and variations available to tag and index documents in
ElasticSearch like PDF’s, Excel etc.,
➢ Capable to handle up to Petabytes of data and scalable to a large extent.
Adopting the Parent-Child relationship in ElasticSearch
❖ Indexing in the document level was a common feature while page level indexing
was not available by default
❖ A tailor-made solution for page level retrieval was to be built
❖ We adopted the Parent-Child relationship in ElasticSearch to cater to our needs.
How would this work?
➢ In the Parent, Document meta information and Document Tags can be saved.
➢ Child can refer to the Parent type and can also index Page tags, Page content
and page level Page meta information.
Example of the Parent-Child relationship
ElasticSearch Document
Database Architecture
Though ElasticSearch serves as the
core search engine, to facilitate
splitting, encoding and merging of
pages during retrieval calls for a
proper document database system
The architecture comprises of four
main parts -
❖ Parser
❖ AWS S3 Storage
❖ ElasticSearch
❖ Query Processor
Overview of the ElasticSearch Document Database Architecture
1. Parser:
❖ Parses the documents, splits them, encodes them to base64
❖ Pushes actual page without base64 encode to AWS S3 and encoded page
to ElasticSearch along with AWS s3 location.
2. AWS S3 Storage:
❖ The document and pages of the document are saved here for later retrieval
by the user.
❖ This is done so that when a user searches for a document, we initially hit
the ElasticSearch, fetch the meta information about the document from
there and then retrieve the corresponding document/page from AWS S3.
3. ElasticSearch:
ElasticSearch serves as the core search engine for searching tags, documents and
pages.
4. Query Processor:
❖ The end user will query the document from here.
❖ When a search query is given, the query processor would -
➢ Hit the ElasticSearch and get the meta information
➢ Retrieves the actual document/page from AWS3. This is done to attain
maximum speed and performance.
❖ The result will then be published to the end user.
Technical
Implementation
of the solution
The retrieval process done by
ElasticSearch engine can be broadly
broken down into the following 5 steps -
● Plugin Creation
● Index Creation
● Indexing parent document
● Indexing child document
● Retrieving documents by query
1. Plugin Creation - To create the database in ElasticSearch we have to convert the pages
into base64 encoded content. We need to create a plugin to ingest base64 encoded
PDF, word, etc.,. and index them to elasticsearch.
URL: http://localhost:9200/_ingest/pipeline/parser
Method: PUT
Body: {
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
2. Index Creation - An index is to be created to index the document. Since there are no
special search requirement, a default index with parent and child mapping was formed.
URL: http://localhost:9200/Index_name
Method: PUT
Body: {
"mappings": {
"document": {},
"pages": {
"_parent": {
"type": "document"
}
}
}
}
3. Indexing parent document - When a new document is added, we have to index document
level details in parent document using below API call.
URL: http://localhost:9200/Index_name/document/parent_id
Method: POST
Body: {
Key:value
}
4. Indexing child document - Once the parent is created, the pages and the related
information in the pages can be indexed using below API.
URL: http://localhost:9200/Index_name/pages/child_id?parent=parent_id&pipeline=parser
METHOD: POST
Body: {
"filename" : "C:UsersmynameDesktopbh1.pdf",
"title" : "Quick",
"data":
"SElHSEFDQ1VSQUNZUE9TVEFMQUREUkVTU0VYVFJBQ1RJT05GUk9NV0VCUEFHRVNieVpoZX
l1YW5ZdVN1Ym1pdHRlZGlucGFydGlhbGZ1bGxsbWVudG9mdGhlcmVxdWlyZW1lbnRzZm9ydGhlZG
VncmVlb2ZNYXN0ZXJvZkNvbXB1dGVyU2NpZW5jZWF0RGFsaG91c2llVW5pdmVyc2l0eUhhbGlm
YXgsTm92YVNjb3RpYU1hcmNoMjAwN2NDb3B5cmlnaHRieVpoZXl1YW5ZdSwyMDA3" *** Base
64 encoded pages.
}
5. Retrieving documents by query - A document can be queried based on text, title, and tags
and the below method can be used for all.
URL: http://localhost:9200/Index_name/pages/_search
METHOD: POST
Body: {
"query": {
"match": {
"attachment.content": {
"query": "lorem"
}
}
}
}
Possible Search Types in ElasticSearch
There are many search types in ElasticSearch by default. Below are a few of them -
How we adapted the phrase search
❖ Our business requirement was to perform a phrase search for content matching and
exact match for tag matching.
❖ We used two types of phrase searches
➢ Page Phrase Search
➢ Document Phrase Search
Page Phrase Search
URL: http://localhost:9200/document_db/pages/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"attachment.content":{
"query":"1Q17"
}
}
}
]
}
},
"_source": [
"_type",
"_id",
"Page_Number",
"type",
"File_Name"
],
"highlight" : {
"fields" : {
"attachment.content" : {}
}
}
}
Note:
In this page search we are only selecting the needed fields
by selecting them in _source field. This is done in order to
avoid retrieving the page and base64 encoded content
which will increase the retrieved content size and at the
same time increase the time latency.
Document Phrase Search
URL: http://localhost:9200/document_db/document/_search
{
"query": {
"bool": {
"must": [{
"has_child": {
"type": "pages",
"query": {
"match_phrase": {
"attachment.content": "1-800-SEC-0330."
}
}
}
}
]
}
}
}
Concluding Thoughts
❖ The solution outlined here is used as our document store database for document/page
retrieval.
❖ It has a stunning response time that varies from few milliseconds to seconds.
❖ Though the current scope of the solution is limited to PDF documents, we are planning
to extend the same to other document types like spreadsheets and text files.
❖ Do you have another or similar workaround for document retrieval? Share your ideas
in the comment section or mail us at support@mobiusservices.com.
Do visit our blog on the topic here
https://blog.mobiusdata.com/building-unstructured-data-
management-solution-with-elasticsearch-and-aws/
Thank You

More Related Content

What's hot

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
ishmecse13
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
Sovan Misra
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
Compare Infobase Limited
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
A. LE
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET Driver
MongoDB
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
saurabh goel
 
SharePoint Saturday Durban Presentation
SharePoint Saturday Durban PresentationSharePoint Saturday Durban Presentation
SharePoint Saturday Durban Presentation
Warren Marks
 
Web Presen
Web PresenWeb Presen
Web Presen
guest79a91d
 
Technical Utilities for your Site
Technical Utilities for your SiteTechnical Utilities for your Site
Technical Utilities for your Site
Compare Infobase Limited
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
neela madheswari
 
Document databases
Document databasesDocument databases
Document databases
Qframe
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Engine
vinay arora
 
SharePoint Saturday 2010 - SharePoint 2010 Content Organizer Feature
SharePoint Saturday 2010 - SharePoint 2010 Content Organizer FeatureSharePoint Saturday 2010 - SharePoint 2010 Content Organizer Feature
SharePoint Saturday 2010 - SharePoint 2010 Content Organizer Feature
Roy Kim
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
Nitin Pande
 
working of search engine & SEO
working of search engine & SEOworking of search engine & SEO
working of search engine & SEO
Deepak Singh
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
Nikhil Deswal
 
Mongodb tutorial at Easylearning Guru
Mongodb tutorial  at Easylearning GuruMongodb tutorial  at Easylearning Guru
Mongodb tutorial at Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
Shivam Saxena
 

What's hot (20)

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET Driver
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
SharePoint Saturday Durban Presentation
SharePoint Saturday Durban PresentationSharePoint Saturday Durban Presentation
SharePoint Saturday Durban Presentation
 
Web Presen
Web PresenWeb Presen
Web Presen
 
Technical Utilities for your Site
Technical Utilities for your SiteTechnical Utilities for your Site
Technical Utilities for your Site
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
Document databases
Document databasesDocument databases
Document databases
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Engine
 
SharePoint Saturday 2010 - SharePoint 2010 Content Organizer Feature
SharePoint Saturday 2010 - SharePoint 2010 Content Organizer FeatureSharePoint Saturday 2010 - SharePoint 2010 Content Organizer Feature
SharePoint Saturday 2010 - SharePoint 2010 Content Organizer Feature
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
working of search engine & SEO
working of search engine & SEOworking of search engine & SEO
working of search engine & SEO
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 
Mongodb tutorial at Easylearning Guru
Mongodb tutorial  at Easylearning GuruMongodb tutorial  at Easylearning Guru
Mongodb tutorial at Easylearning Guru
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
 

Similar to Building an unstructured data management solution with elastic search and amazon web services

Houston tech fest dev intro to sharepoint search
Houston tech fest   dev intro to sharepoint searchHouston tech fest   dev intro to sharepoint search
Houston tech fest dev intro to sharepoint search
Michael Oryszak
 
Share Point2007 Best Practices Final
Share Point2007 Best Practices FinalShare Point2007 Best Practices Final
Share Point2007 Best Practices Final
Marianne Sweeny
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
Optum
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
Ardak Shalkarbayuli
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
Artificial Intelligence Institute at UofSC
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
Artificial Intelligence Institute at UofSC
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
IOSR Journals
 
Spsvb Developer Intro to SharePoint Search
Spsvb   Developer Intro to SharePoint SearchSpsvb   Developer Intro to SharePoint Search
Spsvb Developer Intro to SharePoint Search
Michael Oryszak
 
Spsvb Developer Intro to SharePoint Search
Spsvb   Developer Intro to SharePoint SearchSpsvb   Developer Intro to SharePoint Search
Spsvb Developer Intro to SharePoint Search
Michael Oryszak
 
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Drew Madelung
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
ObjectRocket
 
B365 saturday practical guide to building a scalable search architecture in s...
B365 saturday practical guide to building a scalable search architecture in s...B365 saturday practical guide to building a scalable search architecture in s...
B365 saturday practical guide to building a scalable search architecture in s...
Thuan Ng
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
George Stathis
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Google indexing
Google indexingGoogle indexing
Google indexing
tahoor71
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
rtpaem
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
AEM HUB
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
Ismail Mayat
 
Visualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaVisualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and Kibana
ObjectRocket
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
Nitin Pande
 

Similar to Building an unstructured data management solution with elastic search and amazon web services (20)

Houston tech fest dev intro to sharepoint search
Houston tech fest   dev intro to sharepoint searchHouston tech fest   dev intro to sharepoint search
Houston tech fest dev intro to sharepoint search
 
Share Point2007 Best Practices Final
Share Point2007 Best Practices FinalShare Point2007 Best Practices Final
Share Point2007 Best Practices Final
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Modern web search: Web Information Systems
Modern web search: Web Information SystemsModern web search: Web Information Systems
Modern web search: Web Information Systems
 
Modern web search: Lecture 11
Modern web search: Lecture 11Modern web search: Lecture 11
Modern web search: Lecture 11
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
Spsvb Developer Intro to SharePoint Search
Spsvb   Developer Intro to SharePoint SearchSpsvb   Developer Intro to SharePoint Search
Spsvb Developer Intro to SharePoint Search
 
Spsvb Developer Intro to SharePoint Search
Spsvb   Developer Intro to SharePoint SearchSpsvb   Developer Intro to SharePoint Search
Spsvb Developer Intro to SharePoint Search
 
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
Essentials for the SharePoint Power User - SharePoint Engage Raleigh 2017
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
B365 saturday practical guide to building a scalable search architecture in s...
B365 saturday practical guide to building a scalable search architecture in s...B365 saturday practical guide to building a scalable search architecture in s...
B365 saturday practical guide to building a scalable search architecture in s...
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Google indexing
Google indexingGoogle indexing
Google indexing
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Visualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaVisualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and Kibana
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 

Building an unstructured data management solution with elastic search and amazon web services

  • 1. A document and page level retrieval solution powered by ElasticSearch proposed to handle a business requirement in Mobius Building an unstructured data management solutions with ElasticSearch and Amazon Web Services
  • 2. Topics Covered ❖ The Business need we faced ❖ Why ElasticSearch to meet our challenge? ❖ Adopting the Parent-Child relationship in ElasticSearch ❖ ElasticSearch Document Database Architecture ❖ Technical Implementation of the solution ■ Plugin Creation ■ Index Creation ■ Indexing parent document ■ Indexing child document ■ Retrieving documents by query ❖ Possible Search Types in ElasticSearch ❖ How we adapted the phrase search
  • 3. The Business need we faced ❖ A UK based energy intelligence company required a document store database to hold analysis and research documents ❖ The document could be in various file formats likePDF’s, Excel, text file etc.,. ❖ Two kinds of retrieval were needed - ➢ Page level Retrieval - To retrieve specific pages that matched the search content and tags. ➢ Document Level Retrieval - To retrieve an entire document based on the searched content and tags.
  • 4. Why ElasticSearch to meet our challenge? ❖ Other document level tagging and retrieval solutions like Aleph and OverviewDocs did not have a clear feature for page level retrieval ❖ Likeable Features of ElasticSearch include - ➢ Open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. ➢ Can power extremely fast and accurate full-text searches for data discovery applications. ➢ Multiple configurations and variations available to tag and index documents in ElasticSearch like PDF’s, Excel etc., ➢ Capable to handle up to Petabytes of data and scalable to a large extent.
  • 5. Adopting the Parent-Child relationship in ElasticSearch ❖ Indexing in the document level was a common feature while page level indexing was not available by default ❖ A tailor-made solution for page level retrieval was to be built ❖ We adopted the Parent-Child relationship in ElasticSearch to cater to our needs. How would this work? ➢ In the Parent, Document meta information and Document Tags can be saved. ➢ Child can refer to the Parent type and can also index Page tags, Page content and page level Page meta information.
  • 6. Example of the Parent-Child relationship
  • 7. ElasticSearch Document Database Architecture Though ElasticSearch serves as the core search engine, to facilitate splitting, encoding and merging of pages during retrieval calls for a proper document database system The architecture comprises of four main parts - ❖ Parser ❖ AWS S3 Storage ❖ ElasticSearch ❖ Query Processor
  • 8. Overview of the ElasticSearch Document Database Architecture
  • 9. 1. Parser: ❖ Parses the documents, splits them, encodes them to base64 ❖ Pushes actual page without base64 encode to AWS S3 and encoded page to ElasticSearch along with AWS s3 location. 2. AWS S3 Storage: ❖ The document and pages of the document are saved here for later retrieval by the user. ❖ This is done so that when a user searches for a document, we initially hit the ElasticSearch, fetch the meta information about the document from there and then retrieve the corresponding document/page from AWS S3.
  • 10. 3. ElasticSearch: ElasticSearch serves as the core search engine for searching tags, documents and pages. 4. Query Processor: ❖ The end user will query the document from here. ❖ When a search query is given, the query processor would - ➢ Hit the ElasticSearch and get the meta information ➢ Retrieves the actual document/page from AWS3. This is done to attain maximum speed and performance. ❖ The result will then be published to the end user.
  • 11. Technical Implementation of the solution The retrieval process done by ElasticSearch engine can be broadly broken down into the following 5 steps - ● Plugin Creation ● Index Creation ● Indexing parent document ● Indexing child document ● Retrieving documents by query
  • 12. 1. Plugin Creation - To create the database in ElasticSearch we have to convert the pages into base64 encoded content. We need to create a plugin to ingest base64 encoded PDF, word, etc.,. and index them to elasticsearch. URL: http://localhost:9200/_ingest/pipeline/parser Method: PUT Body: { "description" : "Extract attachment information", "processors" : [ { "attachment" : { "field" : "data" } } ] }
  • 13. 2. Index Creation - An index is to be created to index the document. Since there are no special search requirement, a default index with parent and child mapping was formed. URL: http://localhost:9200/Index_name Method: PUT Body: { "mappings": { "document": {}, "pages": { "_parent": { "type": "document" } } } }
  • 14. 3. Indexing parent document - When a new document is added, we have to index document level details in parent document using below API call. URL: http://localhost:9200/Index_name/document/parent_id Method: POST Body: { Key:value }
  • 15. 4. Indexing child document - Once the parent is created, the pages and the related information in the pages can be indexed using below API. URL: http://localhost:9200/Index_name/pages/child_id?parent=parent_id&pipeline=parser METHOD: POST Body: { "filename" : "C:UsersmynameDesktopbh1.pdf", "title" : "Quick", "data": "SElHSEFDQ1VSQUNZUE9TVEFMQUREUkVTU0VYVFJBQ1RJT05GUk9NV0VCUEFHRVNieVpoZX l1YW5ZdVN1Ym1pdHRlZGlucGFydGlhbGZ1bGxsbWVudG9mdGhlcmVxdWlyZW1lbnRzZm9ydGhlZG VncmVlb2ZNYXN0ZXJvZkNvbXB1dGVyU2NpZW5jZWF0RGFsaG91c2llVW5pdmVyc2l0eUhhbGlm YXgsTm92YVNjb3RpYU1hcmNoMjAwN2NDb3B5cmlnaHRieVpoZXl1YW5ZdSwyMDA3" *** Base 64 encoded pages. }
  • 16. 5. Retrieving documents by query - A document can be queried based on text, title, and tags and the below method can be used for all. URL: http://localhost:9200/Index_name/pages/_search METHOD: POST Body: { "query": { "match": { "attachment.content": { "query": "lorem" } } } }
  • 17. Possible Search Types in ElasticSearch There are many search types in ElasticSearch by default. Below are a few of them -
  • 18.
  • 19. How we adapted the phrase search ❖ Our business requirement was to perform a phrase search for content matching and exact match for tag matching. ❖ We used two types of phrase searches ➢ Page Phrase Search ➢ Document Phrase Search
  • 20. Page Phrase Search URL: http://localhost:9200/document_db/pages/_search { "query": { "bool": { "must": [ { "match_phrase": { "attachment.content":{ "query":"1Q17" } } } ] } },
  • 21. "_source": [ "_type", "_id", "Page_Number", "type", "File_Name" ], "highlight" : { "fields" : { "attachment.content" : {} } } } Note: In this page search we are only selecting the needed fields by selecting them in _source field. This is done in order to avoid retrieving the page and base64 encoded content which will increase the retrieved content size and at the same time increase the time latency.
  • 22. Document Phrase Search URL: http://localhost:9200/document_db/document/_search { "query": { "bool": { "must": [{ "has_child": { "type": "pages", "query": { "match_phrase": { "attachment.content": "1-800-SEC-0330." } } } } ] } } }
  • 23. Concluding Thoughts ❖ The solution outlined here is used as our document store database for document/page retrieval. ❖ It has a stunning response time that varies from few milliseconds to seconds. ❖ Though the current scope of the solution is limited to PDF documents, we are planning to extend the same to other document types like spreadsheets and text files. ❖ Do you have another or similar workaround for document retrieval? Share your ideas in the comment section or mail us at support@mobiusservices.com.
  • 24. Do visit our blog on the topic here https://blog.mobiusdata.com/building-unstructured-data- management-solution-with-elasticsearch-and-aws/ Thank You