SlideShare a Scribd company logo
1 of 37
High Performance Computing for
Distributed Indexing of Scientific Data
Giuseppe Totaro, PhD
Scientific Applications Software Engineer
Computer Science for Data Intensive Applications (398M)
j p l . n a s a . g o v
Instrument Software and Science Data
Systems (3980)
• The section is responsible for
– Controlling the remote sensing instruments deployed on
and around planet Earth and beyond
– Transforming instrument observations into precision
measurements and accurate and timely information
– Maintaining and making the observations, measurements
and information usable for a broad set of applications
• Design and development of complex instrument
flight and ground software and scientific applications
• Integration of new algorithms and models into high
performance science data systems
• Development of complex test plans and unit tests
• Our teams of technologists work closely with mission
science and instrument teams
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 2
j p l . n a s a . g o v
Summary
• The simplest form of information retrieval is
to linearly scan through documents
• The way to speed up search operations is to
build an inverted index in advance
• Building an index of scientific big data is
very time consuming
• High Performance Computing for indexing
and searching heterogeneous big data
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 3
j p l . n a s a . g o v
Information Retrieval
• IR systems can be distinguished by the scale
– web search
– personal information retrieval
– domain-specific search
• Challenging blend of science and engineering
• First major concept in IR: the inverted index
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 4
“Information retrieval (IR) is finding material (usually documents)
of an unstructured (or semistructured) nature (usually text)
that satisfies an information need from within large collections
(usually stored on computers).”
[Manning, Raghavan and Schutze (2008)]
j p l . n a s a . g o v
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Doc 1 New home sales top forecasts
Doc 2 HOME SALES RISE IN JULY
Doc 3 Increase in home sales in July
Doc 4 July new home sales rise
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Inverted Index
1. Collect the documents to be indexed
2. Tokenize the text
3. Produce a list of normalized tokens
4. Index the documents
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 5
term doc. freq. documents
forecasts 1 1:20
home 4 1:5, 2:1, 3:13, 4:10
in 2 2:17, 3:10-24
increase 1 3:1
july 3 2:20, 3:27, 4:1
new 2 1:1, 4:6
rise 2 2:12, 4:21
sales 4 1:10, 2:6, 3:18, 4:15
top 1 1:16
Invertedindex
DICTIONARY POSTINGS
j p l . n a s a . g o v
The Gartner Group proposed the ‘‘4V’’ definition for big data
• “Datasets that cannot be acquired, managed, or processed on
common devices within an acceptable time”
• Scientific Big Data [Guo et al. (2014)]
– complexity, comprehensiveness, globalization, and high integration
– multidisciplinary approaches
– object and tool of research
• “3H” (High dimension, High complexity, High uncertainty)
Scientific Big Data
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 6
A special issue of Nature on Big Data
j p l . n a s a . g o v
Automatic Classification and Interpretation of Polar Datasets
Scientific Case Study
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 7
Thwaites Glacier bed topography derived from airborne radar data.
[Joughin, Smith and Medley (2014)]
j p l . n a s a . g o v
Automatic Classification and Interpretation of Polar Datasets
Scientific Case Study
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 8
[Burgess and Mattmann (2014)]
NSF ADC, https://arcticdata.io
NASA AMD, https://gcmd.gsfc.nasa.gov
NSIDC ADE, http://arctic-data-explorer.labs.nsidc.org
j p l . n a s a . g o v
Automatic Classification and Interpretation of Polar Datasets
Scientific Case Study
• TREC Dynamic Domain:
Polar Science
– Data Triage
– Web Crawl
– Data Preparation
– Inverted Indexing
– Querying the Data
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 9
Figure 3. More precise extracted content using Apache Tika and Apache Nutch for the
ACADIS Polar data repository. Our approach was able to extract 6% more of the data by
correctly identifying many previously “unknown” MIME types in ACADIS.
http://trec-dd.org/2015/2015.html
Dataset of 50,000+ crawled web pages,
scientific data (HDF, NetCDF files, Grib files),
zip files, PDFs, images and science code
related to the polar sciences and available
publicly from the NSF ADC, NASA AMD, and
NSIDC ADE.
[Burgess, Mattmann, Totaro, McGibbney
and Ramirez (2014)]
j p l . n a s a . g o v
Indexing Data Flow
• Main bottlenecks
– parsing of documents
– text analysis
– merging of segments
• Detection complexity
– parser failures
– parser error
– parser inability
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 10
data source
raw data
text + metadata
inverted index
Extraction
Parsing
Indexing
Extracting raw files
from data sources
(e.g., HDF, NetCDF,
GRIB, PDF, etc)
Parsing operations:
- mimetype detection
- language detection
- content handling
- metadata extraction
Indexing operations:
- tokenization
- stop words removal
- stemming
j p l . n a s a . g o v
ISODAC: A High Performance Solution
• How to efficiently face the inverted
indexing problem?
– Extract-Parse-Index in-memory pipelines
– Parallel Technologies (e.g., GPU, Pthreads)
– Enhancing de facto IR standard technologies
• ISODAC [Totaro, Bernaschi, Carbone, et al. (2016)]
1. New high performance solution for indexing
2. Only inverted indexes are written on disk
3. Scalable according to hardware resources
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 11
j p l . n a s a . g o v
EPI In-Memory Pipeline
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 12
Docu - Parser
Image - Extractor
Docu - Parser
Docu - Parser
Docu - Parser
Docu - Indexer
Docu - Indexer
Docu - Indexer
Docu - Indexer
S1: EXTRACT S2: PARSE S3: INDEXING
The Sleuth Kit
JNI-based buffers
j p l . n a s a . g o v
Recovery Procedure
EPI In-Memory Pipeline
• Provides resilience against both missing
files (M) and parsing failures (rounds)
• Each global fileID can be calculated by
using the following formula:
𝑓𝑖𝑙𝑒𝐼𝐷 =
𝑀 × 𝑁 0 + 𝑂 𝑘 0 𝑟𝑜𝑢𝑛𝑑𝑠 = 0
𝑠𝑒𝑛𝑡 𝑧 + 𝑀 − 𝐶 𝑘 𝑧 × 𝑁 𝑧 + 1 + 𝑂 𝑘 𝑧 + 1 𝑟𝑜𝑢𝑛𝑑𝑠 > 0
𝑧 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝐶 𝑘 𝑖 < 𝑀
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 13
Detects local
missing files (M)
Uploads
pipe id (Pk)
and list of M
Uploads pipe
seq. (Ok) and
counters (Ck)
Determines
list of global
missing files
Starts an
additional
job
Docu-Indexer Image-Extractor Coordinator
j p l . n a s a . g o v
Recovery Procedure
EPI In-Memory Pipeline
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 14
• First round
i = 0
N[0] = 6
O[0] = <0,1,2,3,4,5>
C[0] = <x0,x1,x2,x3,x4,x5>
• Second round
i+1 = 1
N[1] = 4
O[1] = <0,null,1,2,null,3>
C[1] = <y0,0,y1,y2,0,y3>
Image-Extractor Docu-Parsers Docu-Indexers
0
1
2
3
4
5
null
null
j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
• A Token object consists of:
<text, start, end, position, type> [Wetzel (2014)]
• Text Analysis is a time-consuming function
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 15
XY&Z Corporation - xyz@example.com
[XY&Z] [Corporation] [-] [xyz@example.com]
[xy] [z] [corporation] [xyz] [example] [com]
[xy] [z] [corporation] [example] [com]
[xy&z] [corporation] [xyz@example.com]
WhitespaceAnalyzer
StandardAnalyzer
Input
SimpleAnalyzer
StopAnalyzer
j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
• CUDA-based Text Analysis
– Text analysis including tokenization, filtering and stop
words removal processes
– Exploits the computational power of GPU cards
– Extends CLucene StandardAnalyzer with CUDA
kernels that perform text analysis
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 16
CLucene Libraries: http://clucene.sourceforge.net/
j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 17
One CUDA Thread per character.
Each thread applies LowerCase Filter
-1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15
0 3 8 11 2 7 10 14
my
M n a m e i s B
Each CUDA Thread performs Tokenization by locating
delimiter positions
Vector processing in order to create two vectors representing
start and end token indexes respectively.
y o b . 0
m y n a m e i s b o b . 0
Start Indexes
(related to input text)
name
is
bob
End Indexes
(related to input text)
my
name
bob
One CUDA Thread per token.
Each thread applies StopWords Filter.
j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 18
j p l . n a s a . g o v
Optimizations: Parallel Segment Merging
• A Lucene index may consist of multiple
sub-indexes or segments
• Merging algorithms are usually based on
priority queues
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 19
car
courage
car
money
sound
car
group
money
courage
group
segment1 segment2 segment3 segment4
car car carmatch merge INDEX
j p l . n a s a . g o v
Optimizations: Parallel Segment Merging
18 28
1
51 52
5
37 39
6
51 52
7
41 47
8
54 57
9
31 32
2
37 42
4
25 29
3
... ... ...
• Perform in parallel multi-thread heapify
and merging of extracted data
– Based on Parallel P.P.Q. [Deo and Prasad (1992)]
– Pthreads-based implementation [IEEE Std 1003.1c-
1995]
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 20
E
O
E
O
Thread_1Thread_0
0
1
2
...
r = 2
j p l . n a s a . g o v
ISODAC vs. Spark Indexing
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 21
75
125
239
418
242
577
917
1927
0
200
400
600
800
1000
1200
1400
1600
1800
2000
4 GB 8 GB 16 GB 32 GB
IndexingTime(seconds)
Datasets
ISODAC
Spark
j p l . n a s a . g o v
Other Group Projects at NASA JPL
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 22
Design and development of complex instrument flight and
ground software and scientific applications to analyze scientific
data and produce products within mission software systems.
DHS Automatic Semantic
Search Engine for
Suitable Standards
DARPA Memex for fight
against and prevention of
Human Trafficking
Apache committer and PMC member since April 2015
jpl.nasa.gov
https://github.com/giuseppetotaro
https://www.researchgate.net/profile/Giuseppe_Totaro
https://www.linkedin.com/in/giuseppetotaro
Giuseppe.U.Totaro@jpl.nasa.gov
Backup Slides
j p l . n a s a . g o v
Overview of the Indexing Problem
• Fast indexing and searching of Big Data
– inverted index construction is a burdensome
operation
• What is the Hadoop ecosystem?
– MapReduce [Dean and Ghemawat (2010)]
– HDFS [Shafer, Rixner and Cox (2010)]
– Ecosystem projects: Hive, Spark, Ambari, Pig,
Kafka, Flume, Zookeeper, etc
• Why not Hadoop?
– Hadoop performance
– Data Not Distributed
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 25
[Jiang et al. (2010)] [Lin et al. (2012)]
[Dong et al. (2014)]
j p l . n a s a . g o v
Software Architecture
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 26
HPC Cluster
Web GUI
DATABASE
CONNECTIONS’ LEGEND
DB input/ouput
HPC cluster
INDEX
REPO
SEARCHER
SearchAdmin
MEDIATOR
Worker Nodes
DBMS
COORDINATOR
Status
Manager
Job
Scheduler WORKER
AGENT
j p l . n a s a . g o v
Evaluation: ISODAC performance
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 27
0
200
400
600
800
1000
1200
4 GB 8 GB 16 GB 32 GB
IndexingTime(seconds)
Datasets
1 Worker node (4 pipes)
2 Worker nodes (8 pipes)
3 Worker nodes (12 pipes)
j p l . n a s a . g o v
Fault Recovery
• Simulation of two kinds of failure:
– Transient: node restarts after the failure
– Permanent: node becomes permanently
unavailable
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 28
Stress Pattern
(16 GB dataset)
ISODAC Spark
Time (secs) Overhead Time (secs) Overhead
No Failure 239 - 917 -
Transient 387 61% 1131 45%
Permanent 415 73% 2542 177%
j p l . n a s a . g o v
Similarities and Differences
Inverted Index vs. Database
When to use an index
• High volume of free form
text data to be searched
• High volume of
interactive queries
• Demand for very flexible
full text search querying
• Basic capabilities for
CRUD operations
• Basic capabilities for
organizing the data
Overlapping features
• Storage of data
• CRUD operations
• Search capabilities
• Unit of search and index
– document = table row
– field = table column
• Schema-oriented
approach
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 29
j p l . n a s a . g o v
Metadata Interoperability
• Prerequisite for uniform access to media
objects
“Metadata interoperability is a qualitative
property of metadata information objects that
enables systems and applications to work
with or use these objects across system
boundaries.” [Haslhofer and Klas (2010)]
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 30
j p l . n a s a . g o v
Metadata Heterogeneities
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 31
Predominant heterogeneities have been originally identified by:
[Sheth, Larson (1990)] [Ouksel, Sheth (1999)] [Wache (2003)] [Visser et al. (1997)]
j p l . n a s a . g o v
Example of Metadata Heterogeneities
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 32
j p l . n a s a . g o v
Metadata Mapping
• Interoperability technique that subsumes:
– schema mapping
– instance transformation
“Given a source schema 𝑆 𝑠
∈ 𝒮 and a target
schema 𝑆 𝑡
∈ 𝒮, each consisting of a set of schema
elements, 𝑒 𝑠
∈ 𝑆 𝑠
and 𝑒 𝑡
∈ 𝑆 𝑡
respectively, a
mapping 𝑀 ∈ ℳ is a directional relationship
between a set of elements 𝑒𝑖
𝑠
∈ 𝑆 𝑠 and a set of
elements 𝑒𝑗
𝑡
∈ 𝑆 𝑡
.” [Haslhofer and Klas (2010)]
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 33
crosswalks
functions
j p l . n a s a . g o v
Mapping Relationship
𝑚 ∈ 𝑀
𝑚
𝑝 ∈ 𝑃
𝑓 ∈ 𝐹
• Mapping expressions [Spaccapietra et al. (1992)]
– Exclude: 𝐼 𝑒𝑖
𝑠
∩ 𝐼 𝑒𝑗
𝑡
= ∅
– Equivalent: 𝐼 𝑒𝑖
𝑠
≡ 𝐼 𝑒𝑗
𝑡
– Include: 𝐼 𝑒𝑖
𝑠
⊆ 𝐼 𝑒𝑗
𝑡
∨ 𝐼 𝑒𝑗
𝑡
⊆ 𝐼 𝑒𝑖
𝑠
– Overlap: 𝐼 𝑒𝑖
𝑠
∩ 𝐼 𝑒𝑗
𝑡
≠ ∅ ∧ 𝐼 𝑒𝑖
𝑠
⊈ 𝐼 𝑒𝑗
𝑡
∧ 𝐼 𝑒𝑗
𝑡
⊈ 𝐼 𝑒𝑖
𝑠
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 34
Cardinality (e.g., 1:1, 1:n, n:1)
Mapping expression
Instance transformation function
j p l . n a s a . g o v
Elements of Metadata Mapping
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 35
A Survey of Techniques for Achieving Metadata Interoperability · 29
M
Ss
e1
s
e2
s
e3
s
m1
(pa
)
fX
m2
(pb
)
fy
mk(pc)
fz
en
s
St
e1
t
e2
t
e3
t
em
t
j p l . n a s a . g o v
Highly Configurable, Fine-Grained Mapping
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 36
j p l . n a s a . g o v
Highly Configurable, Fine-Grained Mapping
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 37
To be integrated into Tika (TIKA-1691) as a new component

More Related Content

What's hot

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Anubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningRafael Ferreira da Silva
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
 
Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Research Data Alliance
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningAnubhav Jain
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Anubhav Jain
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiPacificResearchPlatform
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 

What's hot (20)

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing Costs
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
 
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 

Similar to ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexing of Scientific Data

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientistsinside-BigData.com
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
 
Official resume titash_mandal_
Official resume titash_mandal_Official resume titash_mandal_
Official resume titash_mandal_Titash Mandal
 
Tracking research data footprints - slides
Tracking research data footprints - slidesTracking research data footprints - slides
Tracking research data footprints - slidesARDC
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
Iciic 2010 114
Iciic 2010 114Iciic 2010 114
Iciic 2010 114hanums1
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...aceas13tern
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptxKtonNguyn2
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsARDC
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
 
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.Anita Graser
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project ReportArnab Mukhopadhyay
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfcadejaumafiq
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineeringinside-BigData.com
 

Similar to ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexing of Scientific Data (20)

CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
HPC I/O for Computational Scientists
HPC I/O for Computational ScientistsHPC I/O for Computational Scientists
HPC I/O for Computational Scientists
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Official resume titash_mandal_
Official resume titash_mandal_Official resume titash_mandal_
Official resume titash_mandal_
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Tracking research data footprints - slides
Tracking research data footprints - slidesTracking research data footprints - slides
Tracking research data footprints - slides
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
Iciic 2010 114
Iciic 2010 114Iciic 2010 114
Iciic 2010 114
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collections
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...
 
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 

More from Advanced-Concepts-Team

2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdfAdvanced-Concepts-Team
 
Isabelle Diacaire - From Ariadnas to Industry R&D in optics and photonics
Isabelle Diacaire - From Ariadnas to Industry R&D in optics and photonicsIsabelle Diacaire - From Ariadnas to Industry R&D in optics and photonics
Isabelle Diacaire - From Ariadnas to Industry R&D in optics and photonicsAdvanced-Concepts-Team
 
The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...Advanced-Concepts-Team
 
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsPablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsAdvanced-Concepts-Team
 
Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...
Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...
Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...Advanced-Concepts-Team
 
Towards an Artificial Muse for new Ideas in Quantum Physics
Towards an Artificial Muse for new Ideas in Quantum PhysicsTowards an Artificial Muse for new Ideas in Quantum Physics
Towards an Artificial Muse for new Ideas in Quantum PhysicsAdvanced-Concepts-Team
 
EDEN ISS - A space greenhouse analogue in Antarctica
EDEN ISS - A space greenhouse analogue in AntarcticaEDEN ISS - A space greenhouse analogue in Antarctica
EDEN ISS - A space greenhouse analogue in AntarcticaAdvanced-Concepts-Team
 
Information processing with artificial spiking neural networks
Information processing with artificial spiking neural networksInformation processing with artificial spiking neural networks
Information processing with artificial spiking neural networksAdvanced-Concepts-Team
 
Exploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine LearningExploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine LearningAdvanced-Concepts-Team
 
Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...
Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...
Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...Advanced-Concepts-Team
 
HORUS: Peering into Lunar Shadowed Regions with AI
HORUS: Peering into Lunar Shadowed Regions with AIHORUS: Peering into Lunar Shadowed Regions with AI
HORUS: Peering into Lunar Shadowed Regions with AIAdvanced-Concepts-Team
 
META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...
META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...
META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...Advanced-Concepts-Team
 
The Large Interferometer For Exoplanets (LIFE) II: Key Methods and Technologies
The Large Interferometer For Exoplanets (LIFE) II: Key Methods and TechnologiesThe Large Interferometer For Exoplanets (LIFE) II: Key Methods and Technologies
The Large Interferometer For Exoplanets (LIFE) II: Key Methods and TechnologiesAdvanced-Concepts-Team
 
In vitro simulation of spaceflight environment to elucidate combined effect o...
In vitro simulation of spaceflight environment to elucidate combined effect o...In vitro simulation of spaceflight environment to elucidate combined effect o...
In vitro simulation of spaceflight environment to elucidate combined effect o...Advanced-Concepts-Team
 
The Large Interferometer For Exoplanets (LIFE): the science of characterising...
The Large Interferometer For Exoplanets (LIFE): the science of characterising...The Large Interferometer For Exoplanets (LIFE): the science of characterising...
The Large Interferometer For Exoplanets (LIFE): the science of characterising...Advanced-Concepts-Team
 
Vernal pools a new ecosystem for astrobiology studies
Vernal pools a new ecosystem for astrobiology studiesVernal pools a new ecosystem for astrobiology studies
Vernal pools a new ecosystem for astrobiology studiesAdvanced-Concepts-Team
 
Keeping a Sentinel Eye on the Volcanoes – from Space!
Keeping a Sentinel Eye on the Volcanoes – from Space!Keeping a Sentinel Eye on the Volcanoes – from Space!
Keeping a Sentinel Eye on the Volcanoes – from Space!Advanced-Concepts-Team
 

More from Advanced-Concepts-Team (20)

2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
 
Isabelle Diacaire - From Ariadnas to Industry R&D in optics and photonics
Isabelle Diacaire - From Ariadnas to Industry R&D in optics and photonicsIsabelle Diacaire - From Ariadnas to Industry R&D in optics and photonics
Isabelle Diacaire - From Ariadnas to Industry R&D in optics and photonics
 
The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...The ExoGRAVITY project - observations of exoplanets from the ground with opti...
The ExoGRAVITY project - observations of exoplanets from the ground with opti...
 
MOND_famaey.pdf
MOND_famaey.pdfMOND_famaey.pdf
MOND_famaey.pdf
 
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA DatalabsPablo Gomez - Solving Large-scale Challenges with ESA Datalabs
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
 
Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...
Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...
Jonathan Sauder - Miniaturizing Mechanical Systems for CubeSats: Design Princ...
 
Towards an Artificial Muse for new Ideas in Quantum Physics
Towards an Artificial Muse for new Ideas in Quantum PhysicsTowards an Artificial Muse for new Ideas in Quantum Physics
Towards an Artificial Muse for new Ideas in Quantum Physics
 
EDEN ISS - A space greenhouse analogue in Antarctica
EDEN ISS - A space greenhouse analogue in AntarcticaEDEN ISS - A space greenhouse analogue in Antarctica
EDEN ISS - A space greenhouse analogue in Antarctica
 
How to give a robot a soul
How to give a robot a soulHow to give a robot a soul
How to give a robot a soul
 
Information processing with artificial spiking neural networks
Information processing with artificial spiking neural networksInformation processing with artificial spiking neural networks
Information processing with artificial spiking neural networks
 
Exploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine LearningExploring Architected Materials Using Machine Learning
Exploring Architected Materials Using Machine Learning
 
Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...
Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...
Electromagnetically Actuated Systems for Modular, Self-Assembling and Self-Re...
 
HORUS: Peering into Lunar Shadowed Regions with AI
HORUS: Peering into Lunar Shadowed Regions with AIHORUS: Peering into Lunar Shadowed Regions with AI
HORUS: Peering into Lunar Shadowed Regions with AI
 
META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...
META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...
META-SPACE: Psycho-physiologically Adaptive and Personalized Virtual Reality ...
 
The Large Interferometer For Exoplanets (LIFE) II: Key Methods and Technologies
The Large Interferometer For Exoplanets (LIFE) II: Key Methods and TechnologiesThe Large Interferometer For Exoplanets (LIFE) II: Key Methods and Technologies
The Large Interferometer For Exoplanets (LIFE) II: Key Methods and Technologies
 
Black Holes and Bright Quasars
Black Holes and Bright QuasarsBlack Holes and Bright Quasars
Black Holes and Bright Quasars
 
In vitro simulation of spaceflight environment to elucidate combined effect o...
In vitro simulation of spaceflight environment to elucidate combined effect o...In vitro simulation of spaceflight environment to elucidate combined effect o...
In vitro simulation of spaceflight environment to elucidate combined effect o...
 
The Large Interferometer For Exoplanets (LIFE): the science of characterising...
The Large Interferometer For Exoplanets (LIFE): the science of characterising...The Large Interferometer For Exoplanets (LIFE): the science of characterising...
The Large Interferometer For Exoplanets (LIFE): the science of characterising...
 
Vernal pools a new ecosystem for astrobiology studies
Vernal pools a new ecosystem for astrobiology studiesVernal pools a new ecosystem for astrobiology studies
Vernal pools a new ecosystem for astrobiology studies
 
Keeping a Sentinel Eye on the Volcanoes – from Space!
Keeping a Sentinel Eye on the Volcanoes – from Space!Keeping a Sentinel Eye on the Volcanoes – from Space!
Keeping a Sentinel Eye on the Volcanoes – from Space!
 

Recently uploaded

Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...vershagrag
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...LuisMiguelPaz5
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...Amara arora$V15
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 

Recently uploaded (20)

Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexing of Scientific Data

  • 1. High Performance Computing for Distributed Indexing of Scientific Data Giuseppe Totaro, PhD Scientific Applications Software Engineer Computer Science for Data Intensive Applications (398M)
  • 2. j p l . n a s a . g o v Instrument Software and Science Data Systems (3980) • The section is responsible for – Controlling the remote sensing instruments deployed on and around planet Earth and beyond – Transforming instrument observations into precision measurements and accurate and timely information – Maintaining and making the observations, measurements and information usable for a broad set of applications • Design and development of complex instrument flight and ground software and scientific applications • Integration of new algorithms and models into high performance science data systems • Development of complex test plans and unit tests • Our teams of technologists work closely with mission science and instrument teams ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 2
  • 3. j p l . n a s a . g o v Summary • The simplest form of information retrieval is to linearly scan through documents • The way to speed up search operations is to build an inverted index in advance • Building an index of scientific big data is very time consuming • High Performance Computing for indexing and searching heterogeneous big data ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 3
  • 4. j p l . n a s a . g o v Information Retrieval • IR systems can be distinguished by the scale – web search – personal information retrieval – domain-specific search • Challenging blend of science and engineering • First major concept in IR: the inverted index ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 4 “Information retrieval (IR) is finding material (usually documents) of an unstructured (or semistructured) nature (usually text) that satisfies an information need from within large collections (usually stored on computers).” [Manning, Raghavan and Schutze (2008)]
  • 5. j p l . n a s a . g o v Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise Doc 1 New home sales top forecasts Doc 2 HOME SALES RISE IN JULY Doc 3 Increase in home sales in July Doc 4 July new home sales rise Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise Inverted Index 1. Collect the documents to be indexed 2. Tokenize the text 3. Produce a list of normalized tokens 4. Index the documents ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 5 term doc. freq. documents forecasts 1 1:20 home 4 1:5, 2:1, 3:13, 4:10 in 2 2:17, 3:10-24 increase 1 3:1 july 3 2:20, 3:27, 4:1 new 2 1:1, 4:6 rise 2 2:12, 4:21 sales 4 1:10, 2:6, 3:18, 4:15 top 1 1:16 Invertedindex DICTIONARY POSTINGS
  • 6. j p l . n a s a . g o v The Gartner Group proposed the ‘‘4V’’ definition for big data • “Datasets that cannot be acquired, managed, or processed on common devices within an acceptable time” • Scientific Big Data [Guo et al. (2014)] – complexity, comprehensiveness, globalization, and high integration – multidisciplinary approaches – object and tool of research • “3H” (High dimension, High complexity, High uncertainty) Scientific Big Data ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 6 A special issue of Nature on Big Data
  • 7. j p l . n a s a . g o v Automatic Classification and Interpretation of Polar Datasets Scientific Case Study ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 7 Thwaites Glacier bed topography derived from airborne radar data. [Joughin, Smith and Medley (2014)]
  • 8. j p l . n a s a . g o v Automatic Classification and Interpretation of Polar Datasets Scientific Case Study ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 8 [Burgess and Mattmann (2014)] NSF ADC, https://arcticdata.io NASA AMD, https://gcmd.gsfc.nasa.gov NSIDC ADE, http://arctic-data-explorer.labs.nsidc.org
  • 9. j p l . n a s a . g o v Automatic Classification and Interpretation of Polar Datasets Scientific Case Study • TREC Dynamic Domain: Polar Science – Data Triage – Web Crawl – Data Preparation – Inverted Indexing – Querying the Data ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 9 Figure 3. More precise extracted content using Apache Tika and Apache Nutch for the ACADIS Polar data repository. Our approach was able to extract 6% more of the data by correctly identifying many previously “unknown” MIME types in ACADIS. http://trec-dd.org/2015/2015.html Dataset of 50,000+ crawled web pages, scientific data (HDF, NetCDF files, Grib files), zip files, PDFs, images and science code related to the polar sciences and available publicly from the NSF ADC, NASA AMD, and NSIDC ADE. [Burgess, Mattmann, Totaro, McGibbney and Ramirez (2014)]
  • 10. j p l . n a s a . g o v Indexing Data Flow • Main bottlenecks – parsing of documents – text analysis – merging of segments • Detection complexity – parser failures – parser error – parser inability ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 10 data source raw data text + metadata inverted index Extraction Parsing Indexing Extracting raw files from data sources (e.g., HDF, NetCDF, GRIB, PDF, etc) Parsing operations: - mimetype detection - language detection - content handling - metadata extraction Indexing operations: - tokenization - stop words removal - stemming
  • 11. j p l . n a s a . g o v ISODAC: A High Performance Solution • How to efficiently face the inverted indexing problem? – Extract-Parse-Index in-memory pipelines – Parallel Technologies (e.g., GPU, Pthreads) – Enhancing de facto IR standard technologies • ISODAC [Totaro, Bernaschi, Carbone, et al. (2016)] 1. New high performance solution for indexing 2. Only inverted indexes are written on disk 3. Scalable according to hardware resources ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 11
  • 12. j p l . n a s a . g o v EPI In-Memory Pipeline ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 12 Docu - Parser Image - Extractor Docu - Parser Docu - Parser Docu - Parser Docu - Indexer Docu - Indexer Docu - Indexer Docu - Indexer S1: EXTRACT S2: PARSE S3: INDEXING The Sleuth Kit JNI-based buffers
  • 13. j p l . n a s a . g o v Recovery Procedure EPI In-Memory Pipeline • Provides resilience against both missing files (M) and parsing failures (rounds) • Each global fileID can be calculated by using the following formula: 𝑓𝑖𝑙𝑒𝐼𝐷 = 𝑀 × 𝑁 0 + 𝑂 𝑘 0 𝑟𝑜𝑢𝑛𝑑𝑠 = 0 𝑠𝑒𝑛𝑡 𝑧 + 𝑀 − 𝐶 𝑘 𝑧 × 𝑁 𝑧 + 1 + 𝑂 𝑘 𝑧 + 1 𝑟𝑜𝑢𝑛𝑑𝑠 > 0 𝑧 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝐶 𝑘 𝑖 < 𝑀 ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 13 Detects local missing files (M) Uploads pipe id (Pk) and list of M Uploads pipe seq. (Ok) and counters (Ck) Determines list of global missing files Starts an additional job Docu-Indexer Image-Extractor Coordinator
  • 14. j p l . n a s a . g o v Recovery Procedure EPI In-Memory Pipeline ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 14 • First round i = 0 N[0] = 6 O[0] = <0,1,2,3,4,5> C[0] = <x0,x1,x2,x3,x4,x5> • Second round i+1 = 1 N[1] = 4 O[1] = <0,null,1,2,null,3> C[1] = <y0,0,y1,y2,0,y3> Image-Extractor Docu-Parsers Docu-Indexers 0 1 2 3 4 5 null null
  • 15. j p l . n a s a . g o v Optimizations: Text Analysis on GPU • A Token object consists of: <text, start, end, position, type> [Wetzel (2014)] • Text Analysis is a time-consuming function ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 15 XY&Z Corporation - xyz@example.com [XY&Z] [Corporation] [-] [xyz@example.com] [xy] [z] [corporation] [xyz] [example] [com] [xy] [z] [corporation] [example] [com] [xy&z] [corporation] [xyz@example.com] WhitespaceAnalyzer StandardAnalyzer Input SimpleAnalyzer StopAnalyzer
  • 16. j p l . n a s a . g o v Optimizations: Text Analysis on GPU • CUDA-based Text Analysis – Text analysis including tokenization, filtering and stop words removal processes – Exploits the computational power of GPU cards – Extends CLucene StandardAnalyzer with CUDA kernels that perform text analysis ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 16 CLucene Libraries: http://clucene.sourceforge.net/
  • 17. j p l . n a s a . g o v Optimizations: Text Analysis on GPU ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 17 One CUDA Thread per character. Each thread applies LowerCase Filter -1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15 0 3 8 11 2 7 10 14 my M n a m e i s B Each CUDA Thread performs Tokenization by locating delimiter positions Vector processing in order to create two vectors representing start and end token indexes respectively. y o b . 0 m y n a m e i s b o b . 0 Start Indexes (related to input text) name is bob End Indexes (related to input text) my name bob One CUDA Thread per token. Each thread applies StopWords Filter.
  • 18. j p l . n a s a . g o v Optimizations: Text Analysis on GPU ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 18
  • 19. j p l . n a s a . g o v Optimizations: Parallel Segment Merging • A Lucene index may consist of multiple sub-indexes or segments • Merging algorithms are usually based on priority queues ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 19 car courage car money sound car group money courage group segment1 segment2 segment3 segment4 car car carmatch merge INDEX
  • 20. j p l . n a s a . g o v Optimizations: Parallel Segment Merging 18 28 1 51 52 5 37 39 6 51 52 7 41 47 8 54 57 9 31 32 2 37 42 4 25 29 3 ... ... ... • Perform in parallel multi-thread heapify and merging of extracted data – Based on Parallel P.P.Q. [Deo and Prasad (1992)] – Pthreads-based implementation [IEEE Std 1003.1c- 1995] ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 20 E O E O Thread_1Thread_0 0 1 2 ... r = 2
  • 21. j p l . n a s a . g o v ISODAC vs. Spark Indexing ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 21 75 125 239 418 242 577 917 1927 0 200 400 600 800 1000 1200 1400 1600 1800 2000 4 GB 8 GB 16 GB 32 GB IndexingTime(seconds) Datasets ISODAC Spark
  • 22. j p l . n a s a . g o v Other Group Projects at NASA JPL ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 22 Design and development of complex instrument flight and ground software and scientific applications to analyze scientific data and produce products within mission software systems. DHS Automatic Semantic Search Engine for Suitable Standards DARPA Memex for fight against and prevention of Human Trafficking Apache committer and PMC member since April 2015
  • 25. j p l . n a s a . g o v Overview of the Indexing Problem • Fast indexing and searching of Big Data – inverted index construction is a burdensome operation • What is the Hadoop ecosystem? – MapReduce [Dean and Ghemawat (2010)] – HDFS [Shafer, Rixner and Cox (2010)] – Ecosystem projects: Hive, Spark, Ambari, Pig, Kafka, Flume, Zookeeper, etc • Why not Hadoop? – Hadoop performance – Data Not Distributed ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 25 [Jiang et al. (2010)] [Lin et al. (2012)] [Dong et al. (2014)]
  • 26. j p l . n a s a . g o v Software Architecture ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 26 HPC Cluster Web GUI DATABASE CONNECTIONS’ LEGEND DB input/ouput HPC cluster INDEX REPO SEARCHER SearchAdmin MEDIATOR Worker Nodes DBMS COORDINATOR Status Manager Job Scheduler WORKER AGENT
  • 27. j p l . n a s a . g o v Evaluation: ISODAC performance ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 27 0 200 400 600 800 1000 1200 4 GB 8 GB 16 GB 32 GB IndexingTime(seconds) Datasets 1 Worker node (4 pipes) 2 Worker nodes (8 pipes) 3 Worker nodes (12 pipes)
  • 28. j p l . n a s a . g o v Fault Recovery • Simulation of two kinds of failure: – Transient: node restarts after the failure – Permanent: node becomes permanently unavailable ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 28 Stress Pattern (16 GB dataset) ISODAC Spark Time (secs) Overhead Time (secs) Overhead No Failure 239 - 917 - Transient 387 61% 1131 45% Permanent 415 73% 2542 177%
  • 29. j p l . n a s a . g o v Similarities and Differences Inverted Index vs. Database When to use an index • High volume of free form text data to be searched • High volume of interactive queries • Demand for very flexible full text search querying • Basic capabilities for CRUD operations • Basic capabilities for organizing the data Overlapping features • Storage of data • CRUD operations • Search capabilities • Unit of search and index – document = table row – field = table column • Schema-oriented approach ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 29
  • 30. j p l . n a s a . g o v Metadata Interoperability • Prerequisite for uniform access to media objects “Metadata interoperability is a qualitative property of metadata information objects that enables systems and applications to work with or use these objects across system boundaries.” [Haslhofer and Klas (2010)] ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 30
  • 31. j p l . n a s a . g o v Metadata Heterogeneities ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 31 Predominant heterogeneities have been originally identified by: [Sheth, Larson (1990)] [Ouksel, Sheth (1999)] [Wache (2003)] [Visser et al. (1997)]
  • 32. j p l . n a s a . g o v Example of Metadata Heterogeneities ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 32
  • 33. j p l . n a s a . g o v Metadata Mapping • Interoperability technique that subsumes: – schema mapping – instance transformation “Given a source schema 𝑆 𝑠 ∈ 𝒮 and a target schema 𝑆 𝑡 ∈ 𝒮, each consisting of a set of schema elements, 𝑒 𝑠 ∈ 𝑆 𝑠 and 𝑒 𝑡 ∈ 𝑆 𝑡 respectively, a mapping 𝑀 ∈ ℳ is a directional relationship between a set of elements 𝑒𝑖 𝑠 ∈ 𝑆 𝑠 and a set of elements 𝑒𝑗 𝑡 ∈ 𝑆 𝑡 .” [Haslhofer and Klas (2010)] ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 33 crosswalks functions
  • 34. j p l . n a s a . g o v Mapping Relationship 𝑚 ∈ 𝑀 𝑚 𝑝 ∈ 𝑃 𝑓 ∈ 𝐹 • Mapping expressions [Spaccapietra et al. (1992)] – Exclude: 𝐼 𝑒𝑖 𝑠 ∩ 𝐼 𝑒𝑗 𝑡 = ∅ – Equivalent: 𝐼 𝑒𝑖 𝑠 ≡ 𝐼 𝑒𝑗 𝑡 – Include: 𝐼 𝑒𝑖 𝑠 ⊆ 𝐼 𝑒𝑗 𝑡 ∨ 𝐼 𝑒𝑗 𝑡 ⊆ 𝐼 𝑒𝑖 𝑠 – Overlap: 𝐼 𝑒𝑖 𝑠 ∩ 𝐼 𝑒𝑗 𝑡 ≠ ∅ ∧ 𝐼 𝑒𝑖 𝑠 ⊈ 𝐼 𝑒𝑗 𝑡 ∧ 𝐼 𝑒𝑗 𝑡 ⊈ 𝐼 𝑒𝑖 𝑠 ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 34 Cardinality (e.g., 1:1, 1:n, n:1) Mapping expression Instance transformation function
  • 35. j p l . n a s a . g o v Elements of Metadata Mapping ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 35 A Survey of Techniques for Achieving Metadata Interoperability · 29 M Ss e1 s e2 s e3 s m1 (pa ) fX m2 (pb ) fy mk(pc) fz en s St e1 t e2 t e3 t em t
  • 36. j p l . n a s a . g o v Highly Configurable, Fine-Grained Mapping ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 36
  • 37. j p l . n a s a . g o v Highly Configurable, Fine-Grained Mapping ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 37 To be integrated into Tika (TIKA-1691) as a new component