Successfully reported this slideshow.
Your SlideShare is downloading. ×

PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 92 Ad

PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

Download to read offline

Designsafe is a web portal focused on helping Natural Hazards Engineering to conduct research. Natural Hazards research spans across multiple physical locations, where the experiments take place, and multiple disciplines. Sharing and searching data is an imperative feature when doing research in multiple physical locations. We are able to handle the research needs by using a distributed database (Elasticsearch) to index important features extracted from data.

Designsafe is a web portal focused on helping Natural Hazards Engineering to conduct research. Natural Hazards research spans across multiple physical locations, where the experiments take place, and multiple disciplines. Sharing and searching data is an imperative feature when doing research in multiple physical locations. We are able to handle the research needs by using a distributed database (Elasticsearch) to index important features extracted from data.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal (20)

Advertisement

Recently uploaded (20)

PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal

  1. 1. Designsafe: Using Elasticsearch to Share and Search Data on a Science Web Portal Josue Balandrano Coronel Stephen Mock Texas Advanced Computing Center
  2. 2. Context
  3. 3. - What is DesignSafe? Context
  4. 4. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure Context
  5. 5. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure Context
  6. 6. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project Context
  7. 7. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project - Users and Experimental Facilities Context
  8. 8. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project - Users and Experimental Facilities - Infrastructure Context
  9. 9. Context: DesignSafe Architecture Django Middleware Science Gateway
  10. 10. Context: DesignSafe Architecture Django Middleware Agave Elasticsearch RabbitMQ Custom APIs Science Gateway Distributed Services
  11. 11. Context: DesignSafe Architecture Django Middleware Agave Elasticsearch RabbitMQ Stampede Maverick Custom APIs Corral Science Gateway Distributed Services HPC
  12. 12. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project - Users and Experimental Facilities - Infrastructure Context
  13. 13. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project - Users and Experimental Facilities - Infrastructure - Data Depot Context
  14. 14. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project - Users and Experimental Facilities - Infrastructure - Data Depot - Workspace Context
  15. 15. - What is DesignSafe? - Natural Hazards Engineering Research Infrastructure - Shared-use research infrastructure - Users within a project - Users and Experimental Facilities - Infrastructure - Data Depot - Workspace - Reconnaissance Context
  16. 16. - What is Agave? Context
  17. 17. Context: DesignSafe Architecture Django Middleware Agave Elasticsearch RabbitMQ Stampede Maverick Custom APIs Corral Science Gateway Distributed Services HPC
  18. 18. - What is Agave? - Provides a holistic view of core computing concepts Context
  19. 19. - What is Agave? - Provides a holistic view of core computing concepts - Abstraction layer on top of HPC systems (execution and storage) Context
  20. 20. - What is Agave? - Provides a holistic view of core computing concepts - Abstraction layer on top of HPC systems (execution and storage) - File permissions and access Context
  21. 21. - What is Agave? - Provides a holistic view of core computing concepts - Abstraction layer on top of HPC systems (execution and storage) - File permissions and access - Simpler ACL interface Context
  22. 22. Data Depot
  23. 23. Data Depot
  24. 24. Data Depot
  25. 25. Data Depot
  26. 26. Data Depot
  27. 27. Data Depot
  28. 28. Data Depot
  29. 29. Data Depot
  30. 30. Data Depot
  31. 31. Data Depot
  32. 32. Data Depot
  33. 33. Problem
  34. 34. - Discoverable and searchable data Problem
  35. 35. - Discoverable and searchable data - Main queries: Problem
  36. 36. - Discoverable and searchable data - Main queries: - Give me every file/folder I have access and it’s not in my home dir Problem
  37. 37. - Discoverable and searchable data - Main queries: - Give me every file/folder I have access and it’s not in my home dir - Search within context of the UI Problem
  38. 38. Elasticsearch
  39. 39. - Search engine based on Lucene Elasticsearch
  40. 40. - Search engine based on Lucene - RESTful API Elasticsearch
  41. 41. - Search engine based on Lucene - RESTful API - Schema-free JSON documents Elasticsearch
  42. 42. - Search engine based on Lucene - RESTful API - Schema-free JSON documents - Distributed Elasticsearch
  43. 43. - Search engine based on Lucene - RESTful API - Schema-free JSON documents - Distributed - Near Realtime Elasticsearch
  44. 44. Elasticsearch
  45. 45. Elasticsearch
  46. 46. Elasticsearch - Analyzers
  47. 47. - Consists of 3 blocks: Elasticsearch - Analyzers
  48. 48. - Consists of 3 blocks: - Character filters Elasticsearch - Analyzers
  49. 49. - Consists of 3 blocks: - Character filters Removing HTML tags. Elasticsearch - Analyzers
  50. 50. - Consists of 3 blocks: - Character filters - Tokenizers Elasticsearch - Analyzers
  51. 51. - Consists of 3 blocks: - Character filters - Tokenizers Hierarchical “username/path/to/file.txt” [“username”, “username/path”, “username/path/to”, “username/path/to/file.txt”] Elasticsearch - Analyzers
  52. 52. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters Elasticsearch - Analyzers
  53. 53. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters Case insensitive, i.e. lower case, or removing stop words Elasticsearch - Analyzers
  54. 54. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters - Out of the box or custom Elasticsearch - Analyzers
  55. 55. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters - Out of the box or custom - Standard: Divides terms on word boundaries and lowercase token filter Elasticsearch - Analyzers
  56. 56. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters - Out of the box or custom - Standard: Divides terms on word boundaries and lowercase token filter - Keyword: Noop analyzer Elasticsearch - Analyzers
  57. 57. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters - Out of the box or custom - Standard: Divides terms on word boundaries and lowercase token filter - Keyword: Noop analyzer - Custom Hierarchical: Breaks on specific character Elasticsearch - Analyzers
  58. 58. - Consists of 3 blocks: - Character filters - Tokenizers - Token filters - Out of the box or custom - Standard: Divides terms on word boundaries and lowercase token filter - Keyword: Noop analyzer - Custom Hierarchical: Breaks on specific character - Language: remove stop words, exclude keywords, stemming Elasticsearch - Analyzers
  59. 59. Elasticsearch “name”: “file.txt” => “file.txt” [“file”, “txt”]
  60. 60. Elasticsearch “name”: “file.txt” => “file.txt” [“file”, “txt”] “sytemId”: “designsafe.storage.default” => “designsafe.storage.default” [“designsafe”, “designsafe.storage” “designsafe.storage.default”]
  61. 61. Data Depot
  62. 62. Elasticsearch “name”: “file.txt” => “file.txt” [“file”, “txt”] “sytemId”: “designsafe.storage.default” => “designsafe.storage.default” [“designsafe”, “designsafe.storage” “designsafe.storage.default”] “path”: “username/path/to” => “username/path/to” “username/path/to” [“username”, “username/path”, “username/path/to”]
  63. 63. Elasticsearch “name”: “file.txt” => “file.txt” [“file”, “txt”]
  64. 64. Elasticsearch
  65. 65. Data Depot
  66. 66. Elasticsearch - Mappings
  67. 67. Elasticsearch - Mappings
  68. 68. Elasticsearch - Mappings
  69. 69. Elasticsearch - List all the files/folders I have access to in a specific system AND are not in my home directory
  70. 70. Elasticsearch - List all the files/folders I have access to in a specific system which are not in my home directory
  71. 71. Elasticsearch - List all the files/folders I have access to in a specific system which are not in my home directory
  72. 72. Elasticsearch - List all the files/folders I have access to in a specific system which are not in my home directory
  73. 73. Elasticsearch - List all the files/folders I have access to in a specific system which are not in my home directory
  74. 74. Data Depot
  75. 75. Elasticsearch - List all the files/folders I have access to in a specific system under a specific folder
  76. 76. Elasticsearch - List all the files/folders I have access to under a specific system under a specific folder
  77. 77. Elasticsearch - List all the files/folders which matches a specific query string
  78. 78. Elasticsearch - List all the files/folders in my home directory which matches a specific query string
  79. 79. Elasticsearch - List all the files/folders in my home directory which matches a specific query string
  80. 80. Elasticsearch - Simple Query String
  81. 81. Elasticsearch - Simple Query String - Simple language: + signifies AND operation | signifies OR operation - negates a single token " wraps a number of tokens to signify a phrase for searching * at the end of a term signifies a prefix query ( and ) signify precedence ~N after a word signifies edit distance (fuzziness) ~N after a phrase signifies slop amount - Will never return an error, discards invalid parts of the query.
  82. 82. Elasticsearch
  83. 83. Elasticsearch - Caveats
  84. 84. Elasticsearch - Caveats - Manage dedup
  85. 85. Elasticsearch - Caveats - Manage dedup - Not a persistent DB. How to recreate index quickly
  86. 86. Elasticsearch - Caveats - Manage dedup - Not a persistent DB. How to recreate index quickly - Synchronizing data
  87. 87. Elasticsearch - Caveats - Manage dedup - Not a persistent DB. How to recreate index quickly - Synchronizing data - Access management
  88. 88. Elasticsearch - Other Uses
  89. 89. Elasticsearch - Other Uses - Site-wide search
  90. 90. Elasticsearch - Other Uses - Site-wide search - Publications metadata
  91. 91. Elasticsearch - Other Uses - Site-wide search - Publications metadata - Quick metrics calculations
  92. 92. Thank You Special thanks to: - DesignSafe Team - TACC - Stephen Mock - PEARC - My wife: Gigimaria Flores Email: jcoronel@tacc.utexas.edu Twitter: @eusoj_xirdneh IRC: josuebc @ freenode

Editor's Notes

  • Before diving into what Elasticsearch is and how we use it, let’s explain a little bit of context.
  • What is DesignSafe?
  • DesignSafe is a Science Gateway for the Natural Hazards Engineering community.
  • At its core DesignSafe is a Shared-use research infrastructure,
  • allowing users to share data, applications and collaborate with other users within a project
  • and with remote experimental facilities
  • Now, let’s take a quick look at the architecture so we can have a better idea of how we manage data.
  • Starting from what the user sees we have a middleware which is implemented using Django and python. This is the actual web portal.
  • Behind it we have multiple distributed services. Elasticsearch, message queues, custom APIs and Agave -- I’ll talk about Agave in a minute --.
  • Behind that we have all of our HPC systems, execution like stampede and maverick and storage like corral.
  • The main components of DesignSafe’s infrastructure are;
  • the Data Depot, which is where a user can manage, discover and share data.
  • The workspace, where a user has access to different applications which run in different HPC systems
  • and the Reconnaissance portal where users can upload and visualize geospatial data.
  • I mentioned Agave. So, what is agave?
  • As we can see in this graphic, we use Agave as our main point of interaction with our HPC systems.
  • It basically is an abstraction layer on top of everything HPC we use.
  • This is an important concept because Agave allow us to easily manage file permissions and access,
  • as well as providing a simple ACL interface. All of this through different friendly REST endpoints.
  • Now, let’s focus on the Data Depot. As we can see we have different sections in the data depot.
  • My Data is all your private data, this is your home directory.
  • Here you can share data with any user
  • and give it read or read/write permission through this interface
  • Everything that has been shared with you will appear here. All of this data is also searchable.
  • We also offer a collaboration section called My Projects. Here, a set of users are members of a project. Every user automatically has full access to everything within that project. This section also allow users to curate data and eventually create a publication, but this is not the aim of this presentation.
  • There’s the published section where we list all the publications we have. All of these publication have DOIs and the metadata is properly rendered.
  • I won’t go much into the details of the different types of publications that we have but I want you to take into consideration that all of the published metadata is also stored in elasticsearch. And we have some legacy publications which look like this
  • While newer publications look like this. As we can see these are two different data models.
  • As a counterpart we have Community Data, which is data that is public but it is not a proper publication. Mainly we store tutorials and examples.
  • Finally, we also allow users to connect external services like Box or Dropbox so they can move data from and to these external resources.
  • Now that we have an idea of all the different types of data we manage in the Data Depot we can have a better grasp of what the issue is
  • All of this data has to be searchable and discoverable.
  • So, after a lot of thinking about this we realized that we are mainly implementing two queries.
  • One is give me everything I have access to and is not in my home directory. With this query we get everything that has been shared with a specific user and we can work within that context.
  • The other query is to get everything pertaining to the Data Depot section the user currently is in.
  • In order to create these queries we decided to use Elasticsearch,
  • which is a search engine based on Lucene.
  • Elasticsearch gives you a nice RESTful API
  • and allow us to store schema-free JSON documents
  • as well as being distributed. These last two characteristic are really important to us because the only thing we were sure about is that we did not know the structure of the data we were going to manage and we did not know how fast it was going to grow.
  • Elasticsearch is also near realtime, which means that a document is available almost in realtime after being written. It usually is a minute, at the most.
  • So, let’s take a look at how we are indexing files with Elasticsearch so we can query that information. This is called a document in Elasticsearch.
    As we can see most of this information is what we get from the “stat” command. Name, length, last modified, etc…
  • We are going to focus on three specific fields. Name, systemId and path. Most of our queries are going to target these fields. There are some other metrics that can be aggregated from other fields shown here. But the thing is that indexing files in Elasticsearch requires planning.
    We need to figure out how are going to use the fields that get indexed. Since we already have an idea about the queries we are going to be executing then we know how we are going to use these fields. We know that we are going to filter documents depending on one or more parent folders, and as we can see we are storing this information in the field “path”. We also need to filter files depending on a specific data depot section. What we are doing here is creating a storage system for each one of the data depot sections previously described. This helps us differentiate where every file is and it is easier to manage with Agave and Corral. So, we can also see a systemId which is the identifier for that specific storage system. Finally we need to pay extra attention to the name because we want the user to be able to query filenames as well as extensions and even extra metadata that we are not showing here so we can keep this simplified. By extra metadata I mean information like user defined keywords, descriptions and other community specific data.
  • Then we have to see if we need to manipulate any of these fields in order to make our queries faster. It is always better to store the data transformed instead of transforming it on the fly. Elasticsearch introduces the concept of analyzers. Analyzers transform data as it is being stored that way it is easier and faster to apply different queries to the same data.
  • Analyzers consists of 3 blocks:
  • character filters, which receives the data as a stream of characters and can be used to add, remove or change characters
  • , e.g. removing html tags
  • Tokenizers, which receives a stream of characters and breaks them up into individual tokens and outputs these tokens.
  • e.g. we can use a tokenizer to store a better representation of a file path. This is called a hierarchical tokenizer.
    It will receive the path as a string and will output an array of every hierarchy on that path. This is what allow us to filter all the files under a specific folder faster regardless how many children or subfolders a specific folder has.
  • Then we have token filters which receives token streams and may add, remove or change tokens.
  • Can be used to lower case tokens or remove stop words.
  • There are plenty of analyzers Elasticsearch offers out of the box and one can create a custom analyzer.
  • The main analyzers we use are: Standard which divides terms on word boundaries and lower cases the stream.
  • Keyword, which is basically a noop analyzer, meaning that the string will not be touched when being stored.
  • A custom one which only has a hierarchical tokenizer
  • and an english specific analyzer, this one helps to remove common stop words, exclude any custom keywords and stemming words.
  • As an example let’s take a look at a simple file document and how analyzers transform some of these fields. We are using the standard analyzer on the file name, this transforms the data by making it case insensitive, lower casing everything, and breaking the name into words. This allows the user to search on extensions or partial names. When we store this field we store two values, one is the transform value and the other one is using the keyword analyzer, which is the same string untouched.
  • For the system id we use the hierarchical analyzer, this is because we use internal namespaces for different storage systems. Most of the time we query against the un-analyzed value, meaning the keyword analyzer output value. This is the field which allow us to filter files depending on the context of the UI.
  • Every one of these sections represent a different system id
  • And we are also using the hierarchical and keyword analyzer for the path field.
  • Now, we also need to index and filter files based on permissions. The way we manage these values is a bit simpler because we really only need a set of flags, as in “read”, “write”, “execute” and a username.
  • This is how the permissions for a file looks like. It is an array of objects with the username and the actual permissions stored in boolean flags. With this data we can easily list all the files a user has access to and show it in a nice interface like this.
  • Setting up these analyzers in specific fields is called mappings. Elasticsearch has an API to setup different mappings. I’ve mentioned that we usually use multiple analyzers in one field, like a hierarchical analyzer and a keyword analyzer. The way we do this is to create what is called a multifield that way we can specify which transformed data we want to query.
  • In this example we use the HTTP PUT verb to set the mapping of a specific field. We have to specify the index and document type in the URL as well as the properties we are updating.
  • Here we are creating a multifield with two fields, one which will reference the hierarchical value (underscore path) and another one which will reference the string unmodified (underscore exact).
  • Now, let’s take a look at some of the actual queries we are executing. First we have the query that allow us to create the Shared With Me listing. We want to list all the files/folders a user has access to in a specific system and are not children of my home directory. I’ll show two possible ways to do this query.
  • First, we create what is called a bool query. This type of query allows us to combine different sub queries and filters.
  • Here we can see the filter we are using, this filter will return every document which has these specific values of username in the permissions object array and the system id. We can see how we are specifying the underscore exact field from the multifield we configured before. We want to use filters as much as possible because filters are cached.
  • After we filter the necessary documents we retrieve all documents which path does not start with the username value. And this is going to return all the documents we are looking for.
  • Another way to do it is to take advantage of the hierarchical analyzer we setup and match all documents which do not have the value username in the hierarchical path array.
  • We can also leverage the hierarchical analyzer to retrieve all the files/folders a user has access to under a specific folder like this.
  • Another query we use a lot is to grab a query string from the user and get all documents matching that query string.
  • For this we use elasticsearch’s simple query string.
  • It is really easy to use, we need to specify the query string and the set of fields to search on.
  • This type of query has its own small language
  • This type of query has its own small language and it will never return an error. If there’s any part of the query string that is not valid it will discard it.
  • Here is an example of how it looks like in DesignSafe when we search for any pdf files.
  • There are a few caveats when using elasticsearch
  • Specially when indexing documents representing files in a file system, one has to be extra careful with duplicate and stale documents. This has to be managed externally since elasticsearch does not do it automatically.
  • Elasticsearch should not be treated the same way as a persistent DB. This is because it is really easy to delete an entire index or a bunch of documents.
    There should always be a strategy to quickly rebuild any index and of course recurrent backups.
  • It is always difficult to synchronize a search index with the actual data. Specially when building a search index for data in a file system. The way we tackle this is to have different scripts to recurrently index newly created data as well as permissions.
  • Finally, special attention should be put into access management to elasticsearch. There are different way to protect your cluster, it could be using firewalls, basic HTTP authentication or using one of the multiple tools you can add to elasticsearch for authorization.
  • We also use elasticsearch in other parts of DesignSafe
  • like site-wide search,
  • search and rendering of publications metadata
  • and quick metrics calculations.

×