Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.
Abortion pills in Jeddah | +966572737505 | Get Cytotec
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexing of Scientific Data
1. High Performance Computing for
Distributed Indexing of Scientific Data
Giuseppe Totaro, PhD
Scientific Applications Software Engineer
Computer Science for Data Intensive Applications (398M)
2. j p l . n a s a . g o v
Instrument Software and Science Data
Systems (3980)
• The section is responsible for
– Controlling the remote sensing instruments deployed on
and around planet Earth and beyond
– Transforming instrument observations into precision
measurements and accurate and timely information
– Maintaining and making the observations, measurements
and information usable for a broad set of applications
• Design and development of complex instrument
flight and ground software and scientific applications
• Integration of new algorithms and models into high
performance science data systems
• Development of complex test plans and unit tests
• Our teams of technologists work closely with mission
science and instrument teams
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 2
3. j p l . n a s a . g o v
Summary
• The simplest form of information retrieval is
to linearly scan through documents
• The way to speed up search operations is to
build an inverted index in advance
• Building an index of scientific big data is
very time consuming
• High Performance Computing for indexing
and searching heterogeneous big data
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 3
4. j p l . n a s a . g o v
Information Retrieval
• IR systems can be distinguished by the scale
– web search
– personal information retrieval
– domain-specific search
• Challenging blend of science and engineering
• First major concept in IR: the inverted index
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 4
“Information retrieval (IR) is finding material (usually documents)
of an unstructured (or semistructured) nature (usually text)
that satisfies an information need from within large collections
(usually stored on computers).”
[Manning, Raghavan and Schutze (2008)]
5. j p l . n a s a . g o v
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Doc 1 New home sales top forecasts
Doc 2 HOME SALES RISE IN JULY
Doc 3 Increase in home sales in July
Doc 4 July new home sales rise
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Inverted Index
1. Collect the documents to be indexed
2. Tokenize the text
3. Produce a list of normalized tokens
4. Index the documents
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 5
term doc. freq. documents
forecasts 1 1:20
home 4 1:5, 2:1, 3:13, 4:10
in 2 2:17, 3:10-24
increase 1 3:1
july 3 2:20, 3:27, 4:1
new 2 1:1, 4:6
rise 2 2:12, 4:21
sales 4 1:10, 2:6, 3:18, 4:15
top 1 1:16
Invertedindex
DICTIONARY POSTINGS
6. j p l . n a s a . g o v
The Gartner Group proposed the ‘‘4V’’ definition for big data
• “Datasets that cannot be acquired, managed, or processed on
common devices within an acceptable time”
• Scientific Big Data [Guo et al. (2014)]
– complexity, comprehensiveness, globalization, and high integration
– multidisciplinary approaches
– object and tool of research
• “3H” (High dimension, High complexity, High uncertainty)
Scientific Big Data
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 6
A special issue of Nature on Big Data
7. j p l . n a s a . g o v
Automatic Classification and Interpretation of Polar Datasets
Scientific Case Study
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 7
Thwaites Glacier bed topography derived from airborne radar data.
[Joughin, Smith and Medley (2014)]
8. j p l . n a s a . g o v
Automatic Classification and Interpretation of Polar Datasets
Scientific Case Study
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 8
[Burgess and Mattmann (2014)]
NSF ADC, https://arcticdata.io
NASA AMD, https://gcmd.gsfc.nasa.gov
NSIDC ADE, http://arctic-data-explorer.labs.nsidc.org
9. j p l . n a s a . g o v
Automatic Classification and Interpretation of Polar Datasets
Scientific Case Study
• TREC Dynamic Domain:
Polar Science
– Data Triage
– Web Crawl
– Data Preparation
– Inverted Indexing
– Querying the Data
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 9
Figure 3. More precise extracted content using Apache Tika and Apache Nutch for the
ACADIS Polar data repository. Our approach was able to extract 6% more of the data by
correctly identifying many previously “unknown” MIME types in ACADIS.
http://trec-dd.org/2015/2015.html
Dataset of 50,000+ crawled web pages,
scientific data (HDF, NetCDF files, Grib files),
zip files, PDFs, images and science code
related to the polar sciences and available
publicly from the NSF ADC, NASA AMD, and
NSIDC ADE.
[Burgess, Mattmann, Totaro, McGibbney
and Ramirez (2014)]
10. j p l . n a s a . g o v
Indexing Data Flow
• Main bottlenecks
– parsing of documents
– text analysis
– merging of segments
• Detection complexity
– parser failures
– parser error
– parser inability
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 10
data source
raw data
text + metadata
inverted index
Extraction
Parsing
Indexing
Extracting raw files
from data sources
(e.g., HDF, NetCDF,
GRIB, PDF, etc)
Parsing operations:
- mimetype detection
- language detection
- content handling
- metadata extraction
Indexing operations:
- tokenization
- stop words removal
- stemming
11. j p l . n a s a . g o v
ISODAC: A High Performance Solution
• How to efficiently face the inverted
indexing problem?
– Extract-Parse-Index in-memory pipelines
– Parallel Technologies (e.g., GPU, Pthreads)
– Enhancing de facto IR standard technologies
• ISODAC [Totaro, Bernaschi, Carbone, et al. (2016)]
1. New high performance solution for indexing
2. Only inverted indexes are written on disk
3. Scalable according to hardware resources
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 11
12. j p l . n a s a . g o v
EPI In-Memory Pipeline
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 12
Docu - Parser
Image - Extractor
Docu - Parser
Docu - Parser
Docu - Parser
Docu - Indexer
Docu - Indexer
Docu - Indexer
Docu - Indexer
S1: EXTRACT S2: PARSE S3: INDEXING
The Sleuth Kit
JNI-based buffers
13. j p l . n a s a . g o v
Recovery Procedure
EPI In-Memory Pipeline
• Provides resilience against both missing
files (M) and parsing failures (rounds)
• Each global fileID can be calculated by
using the following formula:
𝑓𝑖𝑙𝑒𝐼𝐷 =
𝑀 × 𝑁 0 + 𝑂 𝑘 0 𝑟𝑜𝑢𝑛𝑑𝑠 = 0
𝑠𝑒𝑛𝑡 𝑧 + 𝑀 − 𝐶 𝑘 𝑧 × 𝑁 𝑧 + 1 + 𝑂 𝑘 𝑧 + 1 𝑟𝑜𝑢𝑛𝑑𝑠 > 0
𝑧 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝐶 𝑘 𝑖 < 𝑀
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 13
Detects local
missing files (M)
Uploads
pipe id (Pk)
and list of M
Uploads pipe
seq. (Ok) and
counters (Ck)
Determines
list of global
missing files
Starts an
additional
job
Docu-Indexer Image-Extractor Coordinator
14. j p l . n a s a . g o v
Recovery Procedure
EPI In-Memory Pipeline
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 14
• First round
i = 0
N[0] = 6
O[0] = <0,1,2,3,4,5>
C[0] = <x0,x1,x2,x3,x4,x5>
• Second round
i+1 = 1
N[1] = 4
O[1] = <0,null,1,2,null,3>
C[1] = <y0,0,y1,y2,0,y3>
Image-Extractor Docu-Parsers Docu-Indexers
0
1
2
3
4
5
null
null
15. j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
• A Token object consists of:
<text, start, end, position, type> [Wetzel (2014)]
• Text Analysis is a time-consuming function
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 15
XY&Z Corporation - xyz@example.com
[XY&Z] [Corporation] [-] [xyz@example.com]
[xy] [z] [corporation] [xyz] [example] [com]
[xy] [z] [corporation] [example] [com]
[xy&z] [corporation] [xyz@example.com]
WhitespaceAnalyzer
StandardAnalyzer
Input
SimpleAnalyzer
StopAnalyzer
16. j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
• CUDA-based Text Analysis
– Text analysis including tokenization, filtering and stop
words removal processes
– Exploits the computational power of GPU cards
– Extends CLucene StandardAnalyzer with CUDA
kernels that perform text analysis
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 16
CLucene Libraries: http://clucene.sourceforge.net/
17. j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 17
One CUDA Thread per character.
Each thread applies LowerCase Filter
-1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15
0 3 8 11 2 7 10 14
my
M n a m e i s B
Each CUDA Thread performs Tokenization by locating
delimiter positions
Vector processing in order to create two vectors representing
start and end token indexes respectively.
y o b . 0
m y n a m e i s b o b . 0
Start Indexes
(related to input text)
name
is
bob
End Indexes
(related to input text)
my
name
bob
One CUDA Thread per token.
Each thread applies StopWords Filter.
18. j p l . n a s a . g o v
Optimizations: Text Analysis on GPU
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 18
19. j p l . n a s a . g o v
Optimizations: Parallel Segment Merging
• A Lucene index may consist of multiple
sub-indexes or segments
• Merging algorithms are usually based on
priority queues
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 19
car
courage
car
money
sound
car
group
money
courage
group
segment1 segment2 segment3 segment4
car car carmatch merge INDEX
20. j p l . n a s a . g o v
Optimizations: Parallel Segment Merging
18 28
1
51 52
5
37 39
6
51 52
7
41 47
8
54 57
9
31 32
2
37 42
4
25 29
3
... ... ...
• Perform in parallel multi-thread heapify
and merging of extracted data
– Based on Parallel P.P.Q. [Deo and Prasad (1992)]
– Pthreads-based implementation [IEEE Std 1003.1c-
1995]
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 20
E
O
E
O
Thread_1Thread_0
0
1
2
...
r = 2
21. j p l . n a s a . g o v
ISODAC vs. Spark Indexing
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 21
75
125
239
418
242
577
917
1927
0
200
400
600
800
1000
1200
1400
1600
1800
2000
4 GB 8 GB 16 GB 32 GB
IndexingTime(seconds)
Datasets
ISODAC
Spark
22. j p l . n a s a . g o v
Other Group Projects at NASA JPL
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 22
Design and development of complex instrument flight and
ground software and scientific applications to analyze scientific
data and produce products within mission software systems.
DHS Automatic Semantic
Search Engine for
Suitable Standards
DARPA Memex for fight
against and prevention of
Human Trafficking
Apache committer and PMC member since April 2015
25. j p l . n a s a . g o v
Overview of the Indexing Problem
• Fast indexing and searching of Big Data
– inverted index construction is a burdensome
operation
• What is the Hadoop ecosystem?
– MapReduce [Dean and Ghemawat (2010)]
– HDFS [Shafer, Rixner and Cox (2010)]
– Ecosystem projects: Hive, Spark, Ambari, Pig,
Kafka, Flume, Zookeeper, etc
• Why not Hadoop?
– Hadoop performance
– Data Not Distributed
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 25
[Jiang et al. (2010)] [Lin et al. (2012)]
[Dong et al. (2014)]
26. j p l . n a s a . g o v
Software Architecture
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 26
HPC Cluster
Web GUI
DATABASE
CONNECTIONS’ LEGEND
DB input/ouput
HPC cluster
INDEX
REPO
SEARCHER
SearchAdmin
MEDIATOR
Worker Nodes
DBMS
COORDINATOR
Status
Manager
Job
Scheduler WORKER
AGENT
27. j p l . n a s a . g o v
Evaluation: ISODAC performance
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 27
0
200
400
600
800
1000
1200
4 GB 8 GB 16 GB 32 GB
IndexingTime(seconds)
Datasets
1 Worker node (4 pipes)
2 Worker nodes (8 pipes)
3 Worker nodes (12 pipes)
28. j p l . n a s a . g o v
Fault Recovery
• Simulation of two kinds of failure:
– Transient: node restarts after the failure
– Permanent: node becomes permanently
unavailable
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 28
Stress Pattern
(16 GB dataset)
ISODAC Spark
Time (secs) Overhead Time (secs) Overhead
No Failure 239 - 917 -
Transient 387 61% 1131 45%
Permanent 415 73% 2542 177%
29. j p l . n a s a . g o v
Similarities and Differences
Inverted Index vs. Database
When to use an index
• High volume of free form
text data to be searched
• High volume of
interactive queries
• Demand for very flexible
full text search querying
• Basic capabilities for
CRUD operations
• Basic capabilities for
organizing the data
Overlapping features
• Storage of data
• CRUD operations
• Search capabilities
• Unit of search and index
– document = table row
– field = table column
• Schema-oriented
approach
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 29
30. j p l . n a s a . g o v
Metadata Interoperability
• Prerequisite for uniform access to media
objects
“Metadata interoperability is a qualitative
property of metadata information objects that
enables systems and applications to work
with or use these objects across system
boundaries.” [Haslhofer and Klas (2010)]
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 30
31. j p l . n a s a . g o v
Metadata Heterogeneities
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 31
Predominant heterogeneities have been originally identified by:
[Sheth, Larson (1990)] [Ouksel, Sheth (1999)] [Wache (2003)] [Visser et al. (1997)]
32. j p l . n a s a . g o v
Example of Metadata Heterogeneities
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 32
33. j p l . n a s a . g o v
Metadata Mapping
• Interoperability technique that subsumes:
– schema mapping
– instance transformation
“Given a source schema 𝑆 𝑠
∈ 𝒮 and a target
schema 𝑆 𝑡
∈ 𝒮, each consisting of a set of schema
elements, 𝑒 𝑠
∈ 𝑆 𝑠
and 𝑒 𝑡
∈ 𝑆 𝑡
respectively, a
mapping 𝑀 ∈ ℳ is a directional relationship
between a set of elements 𝑒𝑖
𝑠
∈ 𝑆 𝑠 and a set of
elements 𝑒𝑗
𝑡
∈ 𝑆 𝑡
.” [Haslhofer and Klas (2010)]
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 33
crosswalks
functions
34. j p l . n a s a . g o v
Mapping Relationship
𝑚 ∈ 𝑀
𝑚
𝑝 ∈ 𝑃
𝑓 ∈ 𝐹
• Mapping expressions [Spaccapietra et al. (1992)]
– Exclude: 𝐼 𝑒𝑖
𝑠
∩ 𝐼 𝑒𝑗
𝑡
= ∅
– Equivalent: 𝐼 𝑒𝑖
𝑠
≡ 𝐼 𝑒𝑗
𝑡
– Include: 𝐼 𝑒𝑖
𝑠
⊆ 𝐼 𝑒𝑗
𝑡
∨ 𝐼 𝑒𝑗
𝑡
⊆ 𝐼 𝑒𝑖
𝑠
– Overlap: 𝐼 𝑒𝑖
𝑠
∩ 𝐼 𝑒𝑗
𝑡
≠ ∅ ∧ 𝐼 𝑒𝑖
𝑠
⊈ 𝐼 𝑒𝑗
𝑡
∧ 𝐼 𝑒𝑗
𝑡
⊈ 𝐼 𝑒𝑖
𝑠
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 34
Cardinality (e.g., 1:1, 1:n, n:1)
Mapping expression
Instance transformation function
35. j p l . n a s a . g o v
Elements of Metadata Mapping
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 35
A Survey of Techniques for Achieving Metadata Interoperability · 29
M
Ss
e1
s
e2
s
e3
s
m1
(pa
)
fX
m2
(pb
)
fy
mk(pc)
fz
en
s
St
e1
t
e2
t
e3
t
em
t
36. j p l . n a s a . g o v
Highly Configurable, Fine-Grained Mapping
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 36
37. j p l . n a s a . g o v
Highly Configurable, Fine-Grained Mapping
ESA-ESTEC Nov 17, 2017 High Performance Computing for Distributed Indexing of Scientific Data 37
To be integrated into Tika (TIKA-1691) as a new component