Building an unstructured data management solution with elastic search and amazon web services

A document and page level retrieval solution powered by ElasticSearch
proposed to handle a business requirement in Mobius
Building an unstructured data
management solutions with ElasticSearch
and Amazon Web Services

Topics Covered
❖ The Business need we faced
❖ Why ElasticSearch to meet our challenge?
❖ Adopting the Parent-Child relationship in ElasticSearch
❖ ElasticSearch Document Database Architecture
❖ Technical Implementation of the solution
■ Plugin Creation
■ Index Creation
■ Indexing parent document
■ Indexing child document
■ Retrieving documents by query
❖ Possible Search Types in ElasticSearch
❖ How we adapted the phrase search

The Business need we faced
❖ A UK based energy intelligence company required a document store database to hold
analysis and research documents
❖ The document could be in various file formats likePDF’s, Excel, text file etc.,.
❖ Two kinds of retrieval were needed -
➢ Page level Retrieval - To retrieve specific pages that matched the search content
and tags.
➢ Document Level Retrieval - To retrieve an entire document based on the searched
content and tags.

Why ElasticSearch to meet our challenge?
❖ Other document level tagging and retrieval solutions like Aleph and OverviewDocs did
not have a clear feature for page level retrieval
❖ Likeable Features of ElasticSearch include -
➢ Open-source, broadly-distributable, readily-scalable, enterprise-grade search
engine.
➢ Can power extremely fast and accurate full-text searches for data discovery
applications.
➢ Multiple configurations and variations available to tag and index documents in
ElasticSearch like PDF’s, Excel etc.,
➢ Capable to handle up to Petabytes of data and scalable to a large extent.

Adopting the Parent-Child relationship in ElasticSearch
❖ Indexing in the document level was a common feature while page level indexing
was not available by default
❖ A tailor-made solution for page level retrieval was to be built
❖ We adopted the Parent-Child relationship in ElasticSearch to cater to our needs.
How would this work?
➢ In the Parent, Document meta information and Document Tags can be saved.
➢ Child can refer to the Parent type and can also index Page tags, Page content
and page level Page meta information.

Example of the Parent-Child relationship

ElasticSearch Document
Database Architecture
Though ElasticSearch serves as the
core search engine, to facilitate
splitting, encoding and merging of
pages during retrieval calls for a
proper document database system
The architecture comprises of four
main parts -
❖ Parser
❖ AWS S3 Storage
❖ ElasticSearch
❖ Query Processor

Overview of the ElasticSearch Document Database Architecture

1. Parser:
❖ Parses the documents, splits them, encodes them to base64
❖ Pushes actual page without base64 encode to AWS S3 and encoded page
to ElasticSearch along with AWS s3 location.
2. AWS S3 Storage:
❖ The document and pages of the document are saved here for later retrieval
by the user.
❖ This is done so that when a user searches for a document, we initially hit
the ElasticSearch, fetch the meta information about the document from
there and then retrieve the corresponding document/page from AWS S3.

3. ElasticSearch:
ElasticSearch serves as the core search engine for searching tags, documents and
pages.
4. Query Processor:
❖ The end user will query the document from here.
❖ When a search query is given, the query processor would -
➢ Hit the ElasticSearch and get the meta information
➢ Retrieves the actual document/page from AWS3. This is done to attain
maximum speed and performance.
❖ The result will then be published to the end user.

Technical
Implementation
of the solution
The retrieval process done by
ElasticSearch engine can be broadly
broken down into the following 5 steps -
● Plugin Creation
● Index Creation
● Indexing parent document
● Indexing child document
● Retrieving documents by query

1. Plugin Creation - To create the database in ElasticSearch we have to convert the pages
into base64 encoded content. We need to create a plugin to ingest base64 encoded
PDF, word, etc.,. and index them to elasticsearch.
URL: http://localhost:9200/_ingest/pipeline/parser
Method: PUT
Body: {
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}

2. Index Creation - An index is to be created to index the document. Since there are no
special search requirement, a default index with parent and child mapping was formed.
URL: http://localhost:9200/Index_name
Method: PUT
Body: {
"mappings": {
"document": {},
"pages": {
"_parent": {
"type": "document"
}
}
}
}

3. Indexing parent document - When a new document is added, we have to index document
level details in parent document using below API call.
URL: http://localhost:9200/Index_name/document/parent_id
Method: POST
Body: {
Key:value
}

4. Indexing child document - Once the parent is created, the pages and the related
information in the pages can be indexed using below API.
URL: http://localhost:9200/Index_name/pages/child_id?parent=parent_id&pipeline=parser
METHOD: POST
Body: {
"filename" : "C:UsersmynameDesktopbh1.pdf",
"title" : "Quick",
"data":
"SElHSEFDQ1VSQUNZUE9TVEFMQUREUkVTU0VYVFJBQ1RJT05GUk9NV0VCUEFHRVNieVpoZX
l1YW5ZdVN1Ym1pdHRlZGlucGFydGlhbGZ1bGxsbWVudG9mdGhlcmVxdWlyZW1lbnRzZm9ydGhlZG
VncmVlb2ZNYXN0ZXJvZkNvbXB1dGVyU2NpZW5jZWF0RGFsaG91c2llVW5pdmVyc2l0eUhhbGlm
YXgsTm92YVNjb3RpYU1hcmNoMjAwN2NDb3B5cmlnaHRieVpoZXl1YW5ZdSwyMDA3" *** Base
64 encoded pages.
}

5. Retrieving documents by query - A document can be queried based on text, title, and tags
and the below method can be used for all.
URL: http://localhost:9200/Index_name/pages/_search
METHOD: POST
Body: {
"query": {
"match": {
"attachment.content": {
"query": "lorem"
}
}
}
}

Possible Search Types in ElasticSearch
There are many search types in ElasticSearch by default. Below are a few of them -

How we adapted the phrase search
❖ Our business requirement was to perform a phrase search for content matching and
exact match for tag matching.
❖ We used two types of phrase searches
➢ Page Phrase Search
➢ Document Phrase Search

Page Phrase Search
URL: http://localhost:9200/document_db/pages/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"attachment.content":{
"query":"1Q17"
}
}
}
]
}
},

"_source": [
"_type",
"_id",
"Page_Number",
"type",
"File_Name"
],
"highlight" : {
"fields" : {
"attachment.content" : {}
}
}
}
Note:
In this page search we are only selecting the needed fields
by selecting them in _source field. This is done in order to
avoid retrieving the page and base64 encoded content
which will increase the retrieved content size and at the
same time increase the time latency.

Document Phrase Search
URL: http://localhost:9200/document_db/document/_search
{
"query": {
"bool": {
"must": [{
"has_child": {
"type": "pages",
"query": {
"match_phrase": {
"attachment.content": "1-800-SEC-0330."
}
}
}
}
]
}
}
}

Concluding Thoughts
❖ The solution outlined here is used as our document store database for document/page
retrieval.
❖ It has a stunning response time that varies from few milliseconds to seconds.
❖ Though the current scope of the solution is limited to PDF documents, we are planning
to extend the same to other document types like spreadsheets and text files.
❖ Do you have another or similar workaround for document retrieval? Share your ideas
in the comment section or mail us at support@mobiusservices.com.

Do visit our blog on the topic here
https://blog.mobiusdata.com/building-unstructured-data-
management-solution-with-elasticsearch-and-aws/
Thank You

Building an unstructured data management solution with elastic search and amazon web services

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building an unstructured data management solution with elastic search and amazon web services

Similar to Building an unstructured data management solution with elastic search and amazon web services (20)

Recently uploaded

Recently uploaded (20)

Building an unstructured data management solution with elastic search and amazon web services