SlideShare a Scribd company logo
1 of 92
Designsafe: Using Elasticsearch to
Share and Search Data on a Science
Web Portal
Josue Balandrano Coronel
Stephen Mock
Texas Advanced Computing Center
Context
- What is DesignSafe?
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
Context
Context: DesignSafe Architecture
Django
Middleware
Science Gateway
Context: DesignSafe Architecture
Django
Middleware
Agave
Elasticsearch
RabbitMQ
Custom APIs
Science Gateway Distributed Services
Context: DesignSafe Architecture
Django
Middleware
Agave
Elasticsearch
RabbitMQ
Stampede
Maverick
Custom APIs
Corral
Science Gateway Distributed Services HPC
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
- Workspace
Context
- What is DesignSafe?
- Natural Hazards Engineering Research Infrastructure
- Shared-use research infrastructure
- Users within a project
- Users and Experimental Facilities
- Infrastructure
- Data Depot
- Workspace
- Reconnaissance
Context
- What is Agave?
Context
Context: DesignSafe Architecture
Django
Middleware
Agave
Elasticsearch
RabbitMQ
Stampede
Maverick
Custom APIs
Corral
Science Gateway Distributed Services HPC
- What is Agave?
- Provides a holistic view of core computing concepts
Context
- What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
Context
- What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
- File permissions and access
Context
- What is Agave?
- Provides a holistic view of core computing concepts
- Abstraction layer on top of HPC systems (execution and storage)
- File permissions and access
- Simpler ACL interface
Context
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Data Depot
Problem
- Discoverable and searchable data
Problem
- Discoverable and searchable data
- Main queries:
Problem
- Discoverable and searchable data
- Main queries:
- Give me every file/folder I have access and it’s not in my home dir
Problem
- Discoverable and searchable data
- Main queries:
- Give me every file/folder I have access and it’s not in my home dir
- Search within context of the UI
Problem
Elasticsearch
- Search engine based on Lucene
Elasticsearch
- Search engine based on Lucene
- RESTful API
Elasticsearch
- Search engine based on Lucene
- RESTful API
- Schema-free JSON documents
Elasticsearch
- Search engine based on Lucene
- RESTful API
- Schema-free JSON documents
- Distributed
Elasticsearch
- Search engine based on Lucene
- RESTful API
- Schema-free JSON documents
- Distributed
- Near Realtime
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch - Analyzers
- Consists of 3 blocks:
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
Removing HTML tags.
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
Hierarchical
“username/path/to/file.txt”
[“username”,
“username/path”,
“username/path/to”,
“username/path/to/file.txt”]
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
Case insensitive, i.e. lower case, or removing stop words
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
- Custom Hierarchical: Breaks on specific character
Elasticsearch - Analyzers
- Consists of 3 blocks:
- Character filters
- Tokenizers
- Token filters
- Out of the box or custom
- Standard: Divides terms on word boundaries and lowercase token filter
- Keyword: Noop analyzer
- Custom Hierarchical: Breaks on specific character
- Language: remove stop words, exclude keywords, stemming
Elasticsearch - Analyzers
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
“sytemId”: “designsafe.storage.default” =>
“designsafe.storage.default”
[“designsafe”,
“designsafe.storage”
“designsafe.storage.default”]
Data Depot
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
“sytemId”: “designsafe.storage.default” =>
“designsafe.storage.default”
[“designsafe”,
“designsafe.storage”
“designsafe.storage.default”]
“path”: “username/path/to” => “username/path/to”
“username/path/to”
[“username”,
“username/path”,
“username/path/to”]
Elasticsearch
“name”: “file.txt” => “file.txt”
[“file”, “txt”]
Elasticsearch
Data Depot
Elasticsearch - Mappings
Elasticsearch - Mappings
Elasticsearch - Mappings
Elasticsearch
- List all the files/folders I have access to in a specific system AND are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Elasticsearch
- List all the files/folders I have access to in a specific system which are not in my home directory
Data Depot
Elasticsearch
- List all the files/folders I have access to in a specific system under a specific folder
Elasticsearch
- List all the files/folders I have access to under a specific system under a specific folder
Elasticsearch
- List all the files/folders which matches a specific query string
Elasticsearch
- List all the files/folders in my home directory which matches a specific query string
Elasticsearch
- List all the files/folders in my home directory which matches a specific query string
Elasticsearch - Simple Query String
Elasticsearch - Simple Query String
- Simple language:
+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
- Will never return an error, discards invalid parts of the query.
Elasticsearch
Elasticsearch - Caveats
Elasticsearch - Caveats
- Manage dedup
Elasticsearch - Caveats
- Manage dedup
- Not a persistent DB. How to recreate index quickly
Elasticsearch - Caveats
- Manage dedup
- Not a persistent DB. How to recreate index quickly
- Synchronizing data
Elasticsearch - Caveats
- Manage dedup
- Not a persistent DB. How to recreate index quickly
- Synchronizing data
- Access management
Elasticsearch - Other Uses
Elasticsearch - Other Uses
- Site-wide search
Elasticsearch - Other Uses
- Site-wide search
- Publications metadata
Elasticsearch - Other Uses
- Site-wide search
- Publications metadata
- Quick metrics calculations
Thank You
Special thanks to:
- DesignSafe Team
- TACC
- Stephen Mock
- PEARC
- My wife: Gigimaria Flores
Email: jcoronel@tacc.utexas.edu
Twitter: @eusoj_xirdneh
IRC: josuebc @ freenode

More Related Content

What's hot

Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2Rafał Kuć
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014Roy Russo
 
Использование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуИспользование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуOlga Lavrentieva
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!Alexander Byndyu
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 

What's hot (20)

Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Использование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуИспользование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайту
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Lucene
LuceneLucene
Lucene
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 

Similar to PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Federico Panini
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with railsTom Z Zeng
 
Infinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGMInfinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGMJBug Italy
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearchAnton Udovychenko
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersBen van Mol
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackElasticsearch
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 

Similar to PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
Find and locate
Find and locateFind and locate
Find and locate
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Infinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGMInfinispan,Lucene,Hibername OGM
Infinispan,Lucene,Hibername OGM
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Editor's Notes

  1. Before diving into what Elasticsearch is and how we use it, let’s explain a little bit of context.
  2. What is DesignSafe?
  3. DesignSafe is a Science Gateway for the Natural Hazards Engineering community.
  4. At its core DesignSafe is a Shared-use research infrastructure,
  5. allowing users to share data, applications and collaborate with other users within a project
  6. and with remote experimental facilities
  7. Now, let’s take a quick look at the architecture so we can have a better idea of how we manage data.
  8. Starting from what the user sees we have a middleware which is implemented using Django and python. This is the actual web portal.
  9. Behind it we have multiple distributed services. Elasticsearch, message queues, custom APIs and Agave -- I’ll talk about Agave in a minute --.
  10. Behind that we have all of our HPC systems, execution like stampede and maverick and storage like corral.
  11. The main components of DesignSafe’s infrastructure are;
  12. the Data Depot, which is where a user can manage, discover and share data.
  13. The workspace, where a user has access to different applications which run in different HPC systems
  14. and the Reconnaissance portal where users can upload and visualize geospatial data.
  15. I mentioned Agave. So, what is agave?
  16. As we can see in this graphic, we use Agave as our main point of interaction with our HPC systems.
  17. It basically is an abstraction layer on top of everything HPC we use.
  18. This is an important concept because Agave allow us to easily manage file permissions and access,
  19. as well as providing a simple ACL interface. All of this through different friendly REST endpoints.
  20. Now, let’s focus on the Data Depot. As we can see we have different sections in the data depot.
  21. My Data is all your private data, this is your home directory.
  22. Here you can share data with any user
  23. and give it read or read/write permission through this interface
  24. Everything that has been shared with you will appear here. All of this data is also searchable.
  25. We also offer a collaboration section called My Projects. Here, a set of users are members of a project. Every user automatically has full access to everything within that project. This section also allow users to curate data and eventually create a publication, but this is not the aim of this presentation.
  26. There’s the published section where we list all the publications we have. All of these publication have DOIs and the metadata is properly rendered.
  27. I won’t go much into the details of the different types of publications that we have but I want you to take into consideration that all of the published metadata is also stored in elasticsearch. And we have some legacy publications which look like this
  28. While newer publications look like this. As we can see these are two different data models.
  29. As a counterpart we have Community Data, which is data that is public but it is not a proper publication. Mainly we store tutorials and examples.
  30. Finally, we also allow users to connect external services like Box or Dropbox so they can move data from and to these external resources.
  31. Now that we have an idea of all the different types of data we manage in the Data Depot we can have a better grasp of what the issue is
  32. All of this data has to be searchable and discoverable.
  33. So, after a lot of thinking about this we realized that we are mainly implementing two queries.
  34. One is give me everything I have access to and is not in my home directory. With this query we get everything that has been shared with a specific user and we can work within that context.
  35. The other query is to get everything pertaining to the Data Depot section the user currently is in.
  36. In order to create these queries we decided to use Elasticsearch,
  37. which is a search engine based on Lucene.
  38. Elasticsearch gives you a nice RESTful API
  39. and allow us to store schema-free JSON documents
  40. as well as being distributed. These last two characteristic are really important to us because the only thing we were sure about is that we did not know the structure of the data we were going to manage and we did not know how fast it was going to grow.
  41. Elasticsearch is also near realtime, which means that a document is available almost in realtime after being written. It usually is a minute, at the most.
  42. So, let’s take a look at how we are indexing files with Elasticsearch so we can query that information. This is called a document in Elasticsearch. As we can see most of this information is what we get from the “stat” command. Name, length, last modified, etc…
  43. We are going to focus on three specific fields. Name, systemId and path. Most of our queries are going to target these fields. There are some other metrics that can be aggregated from other fields shown here. But the thing is that indexing files in Elasticsearch requires planning. We need to figure out how are going to use the fields that get indexed. Since we already have an idea about the queries we are going to be executing then we know how we are going to use these fields. We know that we are going to filter documents depending on one or more parent folders, and as we can see we are storing this information in the field “path”. We also need to filter files depending on a specific data depot section. What we are doing here is creating a storage system for each one of the data depot sections previously described. This helps us differentiate where every file is and it is easier to manage with Agave and Corral. So, we can also see a systemId which is the identifier for that specific storage system. Finally we need to pay extra attention to the name because we want the user to be able to query filenames as well as extensions and even extra metadata that we are not showing here so we can keep this simplified. By extra metadata I mean information like user defined keywords, descriptions and other community specific data.
  44. Then we have to see if we need to manipulate any of these fields in order to make our queries faster. It is always better to store the data transformed instead of transforming it on the fly. Elasticsearch introduces the concept of analyzers. Analyzers transform data as it is being stored that way it is easier and faster to apply different queries to the same data.
  45. Analyzers consists of 3 blocks:
  46. character filters, which receives the data as a stream of characters and can be used to add, remove or change characters
  47. , e.g. removing html tags
  48. Tokenizers, which receives a stream of characters and breaks them up into individual tokens and outputs these tokens.
  49. e.g. we can use a tokenizer to store a better representation of a file path. This is called a hierarchical tokenizer. It will receive the path as a string and will output an array of every hierarchy on that path. This is what allow us to filter all the files under a specific folder faster regardless how many children or subfolders a specific folder has.
  50. Then we have token filters which receives token streams and may add, remove or change tokens.
  51. Can be used to lower case tokens or remove stop words.
  52. There are plenty of analyzers Elasticsearch offers out of the box and one can create a custom analyzer.
  53. The main analyzers we use are: Standard which divides terms on word boundaries and lower cases the stream.
  54. Keyword, which is basically a noop analyzer, meaning that the string will not be touched when being stored.
  55. A custom one which only has a hierarchical tokenizer
  56. and an english specific analyzer, this one helps to remove common stop words, exclude any custom keywords and stemming words.
  57. As an example let’s take a look at a simple file document and how analyzers transform some of these fields. We are using the standard analyzer on the file name, this transforms the data by making it case insensitive, lower casing everything, and breaking the name into words. This allows the user to search on extensions or partial names. When we store this field we store two values, one is the transform value and the other one is using the keyword analyzer, which is the same string untouched.
  58. For the system id we use the hierarchical analyzer, this is because we use internal namespaces for different storage systems. Most of the time we query against the un-analyzed value, meaning the keyword analyzer output value. This is the field which allow us to filter files depending on the context of the UI.
  59. Every one of these sections represent a different system id
  60. And we are also using the hierarchical and keyword analyzer for the path field.
  61. Now, we also need to index and filter files based on permissions. The way we manage these values is a bit simpler because we really only need a set of flags, as in “read”, “write”, “execute” and a username.
  62. This is how the permissions for a file looks like. It is an array of objects with the username and the actual permissions stored in boolean flags. With this data we can easily list all the files a user has access to and show it in a nice interface like this.
  63. Setting up these analyzers in specific fields is called mappings. Elasticsearch has an API to setup different mappings. I’ve mentioned that we usually use multiple analyzers in one field, like a hierarchical analyzer and a keyword analyzer. The way we do this is to create what is called a multifield that way we can specify which transformed data we want to query.
  64. In this example we use the HTTP PUT verb to set the mapping of a specific field. We have to specify the index and document type in the URL as well as the properties we are updating.
  65. Here we are creating a multifield with two fields, one which will reference the hierarchical value (underscore path) and another one which will reference the string unmodified (underscore exact).
  66. Now, let’s take a look at some of the actual queries we are executing. First we have the query that allow us to create the Shared With Me listing. We want to list all the files/folders a user has access to in a specific system and are not children of my home directory. I’ll show two possible ways to do this query.
  67. First, we create what is called a bool query. This type of query allows us to combine different sub queries and filters.
  68. Here we can see the filter we are using, this filter will return every document which has these specific values of username in the permissions object array and the system id. We can see how we are specifying the underscore exact field from the multifield we configured before. We want to use filters as much as possible because filters are cached.
  69. After we filter the necessary documents we retrieve all documents which path does not start with the username value. And this is going to return all the documents we are looking for.
  70. Another way to do it is to take advantage of the hierarchical analyzer we setup and match all documents which do not have the value username in the hierarchical path array.
  71. We can also leverage the hierarchical analyzer to retrieve all the files/folders a user has access to under a specific folder like this.
  72. Another query we use a lot is to grab a query string from the user and get all documents matching that query string.
  73. For this we use elasticsearch’s simple query string.
  74. It is really easy to use, we need to specify the query string and the set of fields to search on.
  75. This type of query has its own small language
  76. This type of query has its own small language and it will never return an error. If there’s any part of the query string that is not valid it will discard it.
  77. Here is an example of how it looks like in DesignSafe when we search for any pdf files.
  78. There are a few caveats when using elasticsearch
  79. Specially when indexing documents representing files in a file system, one has to be extra careful with duplicate and stale documents. This has to be managed externally since elasticsearch does not do it automatically.
  80. Elasticsearch should not be treated the same way as a persistent DB. This is because it is really easy to delete an entire index or a bunch of documents. There should always be a strategy to quickly rebuild any index and of course recurrent backups.
  81. It is always difficult to synchronize a search index with the actual data. Specially when building a search index for data in a file system. The way we tackle this is to have different scripts to recurrently index newly created data as well as permissions.
  82. Finally, special attention should be put into access management to elasticsearch. There are different way to protect your cluster, it could be using firewalls, basic HTTP authentication or using one of the multiple tools you can add to elasticsearch for authorization.
  83. We also use elasticsearch in other parts of DesignSafe
  84. like site-wide search,
  85. search and rendering of publications metadata
  86. and quick metrics calculations.