Large Scale Indexing

We believe that writing software should be fun and easy
Large Scale
Indexing
The Share-VDE Project

We believe that writing software should be fun and
easy
Andrea Gazzarini

easy
Website
https://spaziocodice.com
Blog
https://spaziocodice.com/thoughts-on-coding
Contacts
https://spaziocodice.com/contact-us/
"Software Programming is the act of writing computer code that enables computer
software to function. People who program software are called Computer
Programmers"
We are Computer Programmers, definitely

easy

Domain
Model
easy

easy
The Domain Model Tenants And Provenances
Tenant
P2
P1
P3
Pn
Share-VDE manages a Knowledge Base which consists of
clustered, integrated and enriched entities
In Share-VDE, a Tenant is represented by a set of
institutions contributing to the same Knowledge base
An institution within a tenant is called a provenance.
We use that term because we always want to retain the relationship
between Share-VDE entities and data that originally contributed to
their building.

easy
The Domain Model The Share Family
T1
T2
T4
T3
Multiple tenants form the Share Family. Family members interoperate
through a centralized registry.
Registry

easy
The Domain Model The Knowledge Base
P2
P1
P3
Pn
…
…
…
Data
Data
Data
Knowledge Base

easy
The Domain Model The Prism
9
P1
title Alice in wonderland
titleAlternative Alice’s adventures under ground
author https://svde.org/people/201
P2
P3
P4
title Alice’s adventures under ground
titleAlternative Journeys in Wonderland
sameAs http://dbpedia.org/resource/Alice's_Adventures_i
n_Wonderland
Dbpedia
sameAs https://www.wikidata.org/wiki/Q189875 Wikidata
sameAs https://data.bnf.fr/ark:/12148/cb358500385#about bnf

easy
The Domain Model attributes, relationships, links
Name Value Provenance
titleAlternative Journeys in Wonderland
P1 P2 P3
Name Provenance
author P1 P2 P4
sameAs http://dbpedia.org/resource/Alice's_Adventures_in_Wonderland Dbpedia
sameAs https://www.wikidata.org/wiki/Q189875 Wikidata
sameAs https://data.bnf.fr/ark:/12148/cb358500385#about bnf
P1 P2
P4

easy
The Domain Model Record-Level provenance
Each record coming from a provenance
contributes in building/enriching one or
more Share-VDE clusters.
Therefore, a Share-VDE cluster can be
seen as a prism where each face
represents data coming from a given
provenance
Each Share-VDE cluster maintains a link
to the records it originated from

easy
The Idea In A Nutshell
1,....
1,...
2,...
…
1,...
1,...
2,..
Split
1 => …
1 => …
2 => …
Map
2 => …
1 => …
1 => …
2 => …
2 => …
2 => …
2 => …
2 => …
1 => …
Reduce
1 => …
1 => …
1 => …
…
2 => …
2 => …
2 => …
…
Cloud

easy
The RDBMS data organization

easy
The RDBMS data organization
The entity table is expected to contain around
750M rows
The attribute table is expected to contain
around 750M * 50 rows
The relationship table is expected to contain
around 750M * 37 rows
The link table is expected to contain around
750M * 20 rows

How??
How??
easy

easy
Spring Batch
Although the Spring philosophy makes things usually easy,
the huge and complex context ended up with a complicated
overall architecture
Simplistic Multi-threaded App
Mainly used as a prototype, no scaling
Apache Spark
Problems in data retrieval from RDBMS
(Failed) Approaches

The
Recipe
easy

easy
Ingredient #1 PostgreSQL COPY Command
The PostgreSQL COPY command moves data between tables and files. It is a built-in tool, and it offers
a very high export throughput, especially if the wrapped SELECT command is kept as simple as
possible (e.g., no ORDER BY, GROUP BY, or DISTINCT clauses).
PostgreSQL
So far, so good. We don’t need any grouping or distinct result set, but what about the order? To build a
document, everything (properties, relationships, links) belonging to a given entity should be together in
the exported file; otherwise, it is impossible to know when the entity definition ends without scrolling
the entire set, which we want to avoid.
attributes.dump
relationships.dump
links.dump

easy
Ingredient #1 PostgreSQL COPY Command: Output
Note the order: a given entity (e.g. 1)
appears in multiple files and even
within the same file, its rows are not
consecutive.

easy
Ingredient #2 Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and
generating big data sets with a parallel, distributed algorithm on a cluster.
The input of a MapReduce Job is
● small number of huge text files
● structured csv files
● rows within files are not sorted
You don’t need to install a Hadoop cluster: Amazon
has a powerful service called Elastic Map Reduce
(EMR) thats allows you to configure and start
ephemeral clusters with few clicks

easy
Ingredient #2 Hadoop MapReduce: The Map Phase
The purpose of the Mapper is to transform the original input files content in a list of K,V pairs.

easy
Ingredient #2 Hadoop MapReduce: The Shuffle & Sort Phase
The subsequent phase is called shuffle: data is transferred from mappers to reducers. Before doing
that, all intermediate key-value pairs generated by the mappers get sorted and grouped by key

easy
Ingredient #2 Overview
???

easy
Ingredient #3 Apache Solr
The last puzzle piece is Apache Solr, the popular, open-source enterprise search platform built on
Apache Lucene.

easy
Ingredient #3 Apache Solr: challenges
The main infrastructural challenge of the Solr cluster is related to data, the initial size, and the
expected growth rate. Solr requires creating the distributed index with a predefined number of
slices (shards), so in a given moment, with a given amount of data and a given growth/increment
rate, we should always make sure
● the cluster is well-balanced: each shard should manage a reasonable amount of data;
“reasonable” here refers to the resources owned by the hosting node
● hardware resources (e.g., RAM, CPU, Disk) are appropriately allocated according to
indexing and query throughput requirements
In Share-VDE, data is expected to grow on a large scale: we started with a few million entities (15-
20), and we know the production system will hold more or less two billion documents.
Estimating in advance a cluster that should hold such an amount of data is quite complex; the
under/over estimation is a high risk with could lead to relevant consequences in terms of money.

easy
Ingredient #3 Apache Solr: challenges
We have chosen an incremental approach:
● plan the cumulative updates in terms of incoming data from institutions
● start with X as the initial data size, Y as the expected growth rate until the cluster reaches Z
● estimate and create a Solr cluster for holding the X-Z scenario
● when the cluster reaches Z, it becomes the new X (new initial size), the loop starts at a
higher level, and we reindex everything in the new bigger cluster
We found two significant benefits of using the approach above
● Each cluster estimation is very tied to the data it holds
● Estimation iterations are very short
● We can understand, accurately estimate, and adapt the upper limit of each cluster instance,
as soon as the data size increases

Stay Out of
Troubles
easy

easy
Lesson Learned #1 Apache Solr: challenges
Java 8 is the supported Java Virtual Machine (JVM) for cluster instances created using Amazon
EMR release version 5.0.0 or later.
Although Java 8 is a bit old right now, that is not an issue in 99% of cases.
The only thing is that you should be aware of that in advance, to avoid (like what happened to us) a
consistent refactoring because your code uses some cool stuff introduced later.

easy
Lesson Learned #1 Documents Batching
The atomic indexing unit in Apache Solr is called Document (actually a SolrInputDocument
instance). Regardless of the way documents are received, once they arrive in Solr, they are indexed
one by one. Clients can improve the overall indexing throughput by batching documents instead
of sending them separately as soon as they are created.
There are several indexing client classes available in Solrj, the official Solr Java client, but apart
from the ConcurrentUpdateSolrClient, which transparently accumulates documents and creates
batches, the other SolrClient subclasses require the caller to take care of that batching work.
CloudSolrClient, the client we use for targeting a SolrCloud is not an exception to that rule; in a
massive update scenario, is always better to create a batch of documents
CloudSolrClient divides your batch into small sub-groups, each one targeting the corresponding
leader shard. This is important for defining the batch size, which should consider the cluster size.

easy
Lesson Learned #1 Isolate Indexing Failures (1 / 3)
Documents are always indexed one by one, regardless if they are sent in batches or not.
When batches of documents are sent to Apache Solr for indexing, by default the process stops at
the first failure, therefore skipping the remaining documents. This behavior is counterintuitive
because the operation result is unpredictable from a client’s perspective.
In an ideal world, there should be zero indexing failures, but in reality, errors happen for many
reasons, that are not always under our direct control.

easy
For example, if we have the following document list:
The actual sub-set of documents that are indexed changes depending on where the first failure
occurs: if it happens at d1, no document will be indexed; if it happens on d2, only d1 will be indexed,
and so on.
That could be good if that corresponds to the expected behavior. In our case what we needed was a
more “lenient” behavior where:
● all failed documents are logged (not just the first one)
● the batch is processed entirely, without stopping at first failure

easy

easy
Lesson Learned #1 Avoid Sending Unnecessary Data
Although it could sound as trivial, tuning this specific aspect could make a relevant difference.
Data can be skipped in three ways:
● Database: using a flag or something that excludes a given row from being retrieved
● Indexing Client: the Hadoop Reducer can remove fields according to a given logic
● Apache Solr: IgnoreFieldUpdateProcessor, dynamic field with a special “ignored” datatype

easy
Lesson Learned #1 Dry Run
We found strongly useful to have a simple boolean parameter for excluding Solr at all from the
pipeline.
That allows to isolate the Hadoop pipeline and understand if there are bottlenecks.
X

easy
Lesson Learned #1 Hadoop
Surprisingly, we didn’t change so much the Hadoop configuration; that probably means default
values perform quite well, at least in the benchmarks we’ve ran so far.
Settings we changed are the following:
● dfs.blocksize: as soon as data grows (3 GB, 30 GB, 300 GB), we found some minor benefit
in increasing the block size from 128 to 512 MB
● mapreduce.reduce.merge.inmem.threshold (-1), mapreduce.task.io.sort.mb (300): these
changes reduced the total spilled records, initially reported as too high (3 times the map
output records)
As said, even using the default values without any change, the results were a bit improved, not so
much different from what you will see in the next section.

Links
Links
easy

easy
Forum
The place where the community discuss topics and suggests
things.
The project website
Case Study
The blog post that originated this talk
Useful Links

info@spaziocodice.com
+39 0761 1916790
https://spaziocodice.com
We believe that writing software should be fun and easy

Large Scale Indexing

More Related Content

Similar to Large Scale Indexing

More from Sease

Recently uploaded

Large Scale Indexing