We believe that writing software should be fun and easy
Large Scale
Indexing
The Share-VDE Project
We believe that writing software should be fun and
easy
Andrea Gazzarini
We believe that writing software should be fun and
easy
Website
https://spaziocodice.com
Blog
https://spaziocodice.com/thoughts-on-coding
Contacts
https://spaziocodice.com/contact-us/
"Software Programming is the act of writing computer code that enables computer
software to function. People who program software are called Computer
Programmers"
We are Computer Programmers, definitely
We believe that writing software should be fun and
easy
The Share-VDE Project
Domain
Model
We believe that writing software should be fun and
easy
We believe that writing software should be fun and
easy
The Domain Model Tenants And Provenances
Tenant
P2
P1
P3
Pn
Share-VDE manages a Knowledge Base which consists of
clustered, integrated and enriched entities
In Share-VDE, a Tenant is represented by a set of
institutions contributing to the same Knowledge base
An institution within a tenant is called a provenance.
We use that term because we always want to retain the relationship
between Share-VDE entities and data that originally contributed to
their building.
We believe that writing software should be fun and
easy
The Domain Model The Share Family
T1
T2
T4
T3
Multiple tenants form the Share Family. Family members interoperate
through a centralized registry.
Registry
We believe that writing software should be fun and
easy
The Domain Model The Knowledge Base
P2
P1
P3
Pn
…
…
…
Data
Data
Data
Knowledge Base
We believe that writing software should be fun and
easy
The Domain Model The Prism
9
P1
title Alice in wonderland
titleAlternative Alice’s adventures under ground
author https://svde.org/people/201
P2
title Alice in wonderland
titleAlternative Alice’s adventures under ground
author https://svde.org/people/201
P3
title Alice in wonderland
P4
title Alice’s adventures under ground
titleAlternative Journeys in Wonderland
author https://svde.org/people/201
sameAs http://dbpedia.org/resource/Alice's_Adventures_i
n_Wonderland
Dbpedia
sameAs https://www.wikidata.org/wiki/Q189875 Wikidata
sameAs https://data.bnf.fr/ark:/12148/cb358500385#about bnf
We believe that writing software should be fun and
easy
The Domain Model attributes, relationships, links
Name Value Provenance
title Alice in wonderland
titleAlternative Alice’s adventures under ground
titleAlternative Journeys in Wonderland
P1 P2 P3
Name Provenance
author P1 P2 P4
sameAs http://dbpedia.org/resource/Alice's_Adventures_in_Wonderland Dbpedia
sameAs https://www.wikidata.org/wiki/Q189875 Wikidata
sameAs https://data.bnf.fr/ark:/12148/cb358500385#about bnf
P1 P2
P4
We believe that writing software should be fun and
easy
The Domain Model Record-Level provenance
Each record coming from a provenance
contributes in building/enriching one or
more Share-VDE clusters.
Therefore, a Share-VDE cluster can be
seen as a prism where each face
represents data coming from a given
provenance
Each Share-VDE cluster maintains a link
to the records it originated from
We believe that writing software should be fun and
easy
The Idea In A Nutshell
1,....
1,...
2,...
…
1,...
1,...
2,..
Split
1 => …
1 => …
2 => …
Map
2 => …
1 => …
1 => …
2 => …
2 => …
2 => …
2 => …
2 => …
1 => …
Reduce
1 => …
1 => …
1 => …
…
2 => …
2 => …
2 => …
…
Cloud
We believe that writing software should be fun and
easy
The RDBMS data organization
We believe that writing software should be fun and
easy
The RDBMS data organization
The entity table is expected to contain around
750M rows
The attribute table is expected to contain
around 750M * 50 rows
The relationship table is expected to contain
around 750M * 37 rows
The link table is expected to contain around
750M * 20 rows
How??
How??
We believe that writing software should be fun and
easy
We believe that writing software should be fun and
easy
Spring Batch
Although the Spring philosophy makes things usually easy,
the huge and complex context ended up with a complicated
overall architecture
Simplistic Multi-threaded App
Mainly used as a prototype, no scaling
Apache Spark
Problems in data retrieval from RDBMS
(Failed) Approaches
The
Recipe
We believe that writing software should be fun and
easy
We believe that writing software should be fun and
easy
Ingredient #1 PostgreSQL COPY Command
The PostgreSQL COPY command moves data between tables and files. It is a built-in tool, and it offers
a very high export throughput, especially if the wrapped SELECT command is kept as simple as
possible (e.g., no ORDER BY, GROUP BY, or DISTINCT clauses).
PostgreSQL
So far, so good. We don’t need any grouping or distinct result set, but what about the order? To build a
document, everything (properties, relationships, links) belonging to a given entity should be together in
the exported file; otherwise, it is impossible to know when the entity definition ends without scrolling
the entire set, which we want to avoid.
attributes.dump
relationships.dump
links.dump
We believe that writing software should be fun and
easy
Ingredient #1 PostgreSQL COPY Command: Output
Note the order: a given entity (e.g. 1)
appears in multiple files and even
within the same file, its rows are not
consecutive.
We believe that writing software should be fun and
easy
Ingredient #2 Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and
generating big data sets with a parallel, distributed algorithm on a cluster.
The input of a MapReduce Job is
ā— small number of huge text files
ā— structured csv files
ā— rows within files are not sorted
You don’t need to install a Hadoop cluster: Amazon
has a powerful service called Elastic Map Reduce
(EMR) thats allows you to configure and start
ephemeral clusters with few clicks
We believe that writing software should be fun and
easy
Ingredient #2 Hadoop MapReduce: The Map Phase
The purpose of the Mapper is to transform the original input files content in a list of K,V pairs.
We believe that writing software should be fun and
easy
Ingredient #2 Hadoop MapReduce: The Shuffle & Sort Phase
The subsequent phase is called shuffle: data is transferred from mappers to reducers. Before doing
that, all intermediate key-value pairs generated by the mappers get sorted and grouped by key
We believe that writing software should be fun and
easy
Ingredient #2 Overview
???
We believe that writing software should be fun and
easy
Ingredient #3 Apache Solr
The last puzzle piece is Apache Solr, the popular, open-source enterprise search platform built on
Apache Lucene.
We believe that writing software should be fun and
easy
Ingredient #3 Apache Solr: challenges
The main infrastructural challenge of the Solr cluster is related to data, the initial size, and the
expected growth rate. Solr requires creating the distributed index with a predefined number of
slices (shards), so in a given moment, with a given amount of data and a given growth/increment
rate, we should always make sure
ā— the cluster is well-balanced: each shard should manage a reasonable amount of data;
ā€œreasonableā€ here refers to the resources owned by the hosting node
ā— hardware resources (e.g., RAM, CPU, Disk) are appropriately allocated according to
indexing and query throughput requirements
In Share-VDE, data is expected to grow on a large scale: we started with a few million entities (15-
20), and we know the production system will hold more or less two billion documents.
Estimating in advance a cluster that should hold such an amount of data is quite complex; the
under/over estimation is a high risk with could lead to relevant consequences in terms of money.
We believe that writing software should be fun and
easy
Ingredient #3 Apache Solr: challenges
We have chosen an incremental approach:
ā— plan the cumulative updates in terms of incoming data from institutions
ā— start with X as the initial data size, Y as the expected growth rate until the cluster reaches Z
ā— estimate and create a Solr cluster for holding the X-Z scenario
ā— when the cluster reaches Z, it becomes the new X (new initial size), the loop starts at a
higher level, and we reindex everything in the new bigger cluster
We found two significant benefits of using the approach above
ā— Each cluster estimation is very tied to the data it holds
ā— Estimation iterations are very short
ā— We can understand, accurately estimate, and adapt the upper limit of each cluster instance,
as soon as the data size increases
Stay Out of
Troubles
We believe that writing software should be fun and
easy
We believe that writing software should be fun and
easy
Lesson Learned #1 Apache Solr: challenges
Java 8 is the supported Java Virtual Machine (JVM) for cluster instances created using Amazon
EMR release version 5.0.0 or later.
Although Java 8 is a bit old right now, that is not an issue in 99% of cases.
The only thing is that you should be aware of that in advance, to avoid (like what happened to us) a
consistent refactoring because your code uses some cool stuff introduced later.
We believe that writing software should be fun and
easy
Lesson Learned #1 Documents Batching
The atomic indexing unit in Apache Solr is called Document (actually a SolrInputDocument
instance). Regardless of the way documents are received, once they arrive in Solr, they are indexed
one by one. Clients can improve the overall indexing throughput by batching documents instead
of sending them separately as soon as they are created.
There are several indexing client classes available in Solrj, the official Solr Java client, but apart
from the ConcurrentUpdateSolrClient, which transparently accumulates documents and creates
batches, the other SolrClient subclasses require the caller to take care of that batching work.
CloudSolrClient, the client we use for targeting a SolrCloud is not an exception to that rule; in a
massive update scenario, is always better to create a batch of documents
CloudSolrClient divides your batch into small sub-groups, each one targeting the corresponding
leader shard. This is important for defining the batch size, which should consider the cluster size.
We believe that writing software should be fun and
easy
Lesson Learned #1 Isolate Indexing Failures (1 / 3)
Documents are always indexed one by one, regardless if they are sent in batches or not.
When batches of documents are sent to Apache Solr for indexing, by default the process stops at
the first failure, therefore skipping the remaining documents. This behavior is counterintuitive
because the operation result is unpredictable from a client’s perspective.
In an ideal world, there should be zero indexing failures, but in reality, errors happen for many
reasons, that are not always under our direct control.
We believe that writing software should be fun and
easy
Lesson Learned #1 Isolate Indexing Failures (2 / 3)
For example, if we have the following document list:
The actual sub-set of documents that are indexed changes depending on where the first failure
occurs: if it happens at d1, no document will be indexed; if it happens on d2, only d1 will be indexed,
and so on.
That could be good if that corresponds to the expected behavior. In our case what we needed was a
more ā€œlenientā€ behavior where:
ā— all failed documents are logged (not just the first one)
ā— the batch is processed entirely, without stopping at first failure
We believe that writing software should be fun and
easy
Lesson Learned #1 Isolate Indexing Failures (3 / 3)
We believe that writing software should be fun and
easy
Lesson Learned #1 Avoid Sending Unnecessary Data
Although it could sound as trivial, tuning this specific aspect could make a relevant difference.
Data can be skipped in three ways:
ā— Database: using a flag or something that excludes a given row from being retrieved
ā— Indexing Client: the Hadoop Reducer can remove fields according to a given logic
ā— Apache Solr: IgnoreFieldUpdateProcessor, dynamic field with a special ā€œignoredā€ datatype
We believe that writing software should be fun and
easy
Lesson Learned #1 Dry Run
We found strongly useful to have a simple boolean parameter for excluding Solr at all from the
pipeline.
That allows to isolate the Hadoop pipeline and understand if there are bottlenecks.
X
We believe that writing software should be fun and
easy
Lesson Learned #1 Hadoop
Surprisingly, we didn’t change so much the Hadoop configuration; that probably means default
values perform quite well, at least in the benchmarks we’ve ran so far.
Settings we changed are the following:
ā— dfs.blocksize: as soon as data grows (3 GB, 30 GB, 300 GB), we found some minor benefit
in increasing the block size from 128 to 512 MB
ā— mapreduce.reduce.merge.inmem.threshold (-1), mapreduce.task.io.sort.mb (300): these
changes reduced the total spilled records, initially reported as too high (3 times the map
output records)
As said, even using the default values without any change, the results were a bit improved, not so
much different from what you will see in the next section.
Links
Links
We believe that writing software should be fun and
easy
We believe that writing software should be fun and
easy
Forum
The place where the community discuss topics and suggests
things.
The Share-VDE Project
The project website
Case Study
The blog post that originated this talk
Useful Links
info@spaziocodice.com
+39 0761 1916790
https://spaziocodice.com
We believe that writing software should be fun and easy

Large Scale Indexing

  • 1.
    We believe thatwriting software should be fun and easy Large Scale Indexing The Share-VDE Project
  • 2.
    We believe thatwriting software should be fun and easy Andrea Gazzarini
  • 3.
    We believe thatwriting software should be fun and easy Website https://spaziocodice.com Blog https://spaziocodice.com/thoughts-on-coding Contacts https://spaziocodice.com/contact-us/ "Software Programming is the act of writing computer code that enables computer software to function. People who program software are called Computer Programmers" We are Computer Programmers, definitely
  • 4.
    We believe thatwriting software should be fun and easy The Share-VDE Project
  • 5.
    Domain Model We believe thatwriting software should be fun and easy
  • 6.
    We believe thatwriting software should be fun and easy The Domain Model Tenants And Provenances Tenant P2 P1 P3 Pn Share-VDE manages a Knowledge Base which consists of clustered, integrated and enriched entities In Share-VDE, a Tenant is represented by a set of institutions contributing to the same Knowledge base An institution within a tenant is called a provenance. We use that term because we always want to retain the relationship between Share-VDE entities and data that originally contributed to their building.
  • 7.
    We believe thatwriting software should be fun and easy The Domain Model The Share Family T1 T2 T4 T3 Multiple tenants form the Share Family. Family members interoperate through a centralized registry. Registry
  • 8.
    We believe thatwriting software should be fun and easy The Domain Model The Knowledge Base P2 P1 P3 Pn … … … Data Data Data Knowledge Base
  • 9.
    We believe thatwriting software should be fun and easy The Domain Model The Prism 9 P1 title Alice in wonderland titleAlternative Alice’s adventures under ground author https://svde.org/people/201 P2 title Alice in wonderland titleAlternative Alice’s adventures under ground author https://svde.org/people/201 P3 title Alice in wonderland P4 title Alice’s adventures under ground titleAlternative Journeys in Wonderland author https://svde.org/people/201 sameAs http://dbpedia.org/resource/Alice's_Adventures_i n_Wonderland Dbpedia sameAs https://www.wikidata.org/wiki/Q189875 Wikidata sameAs https://data.bnf.fr/ark:/12148/cb358500385#about bnf
  • 10.
    We believe thatwriting software should be fun and easy The Domain Model attributes, relationships, links Name Value Provenance title Alice in wonderland titleAlternative Alice’s adventures under ground titleAlternative Journeys in Wonderland P1 P2 P3 Name Provenance author P1 P2 P4 sameAs http://dbpedia.org/resource/Alice's_Adventures_in_Wonderland Dbpedia sameAs https://www.wikidata.org/wiki/Q189875 Wikidata sameAs https://data.bnf.fr/ark:/12148/cb358500385#about bnf P1 P2 P4
  • 11.
    We believe thatwriting software should be fun and easy The Domain Model Record-Level provenance Each record coming from a provenance contributes in building/enriching one or more Share-VDE clusters. Therefore, a Share-VDE cluster can be seen as a prism where each face represents data coming from a given provenance Each Share-VDE cluster maintains a link to the records it originated from
  • 12.
    We believe thatwriting software should be fun and easy The Idea In A Nutshell 1,.... 1,... 2,... … 1,... 1,... 2,.. Split 1 => … 1 => … 2 => … Map 2 => … 1 => … 1 => … 2 => … 2 => … 2 => … 2 => … 2 => … 1 => … Reduce 1 => … 1 => … 1 => … … 2 => … 2 => … 2 => … … Cloud
  • 13.
    We believe thatwriting software should be fun and easy The RDBMS data organization
  • 14.
    We believe thatwriting software should be fun and easy The RDBMS data organization The entity table is expected to contain around 750M rows The attribute table is expected to contain around 750M * 50 rows The relationship table is expected to contain around 750M * 37 rows The link table is expected to contain around 750M * 20 rows
  • 15.
    How?? How?? We believe thatwriting software should be fun and easy
  • 16.
    We believe thatwriting software should be fun and easy Spring Batch Although the Spring philosophy makes things usually easy, the huge and complex context ended up with a complicated overall architecture Simplistic Multi-threaded App Mainly used as a prototype, no scaling Apache Spark Problems in data retrieval from RDBMS (Failed) Approaches
  • 17.
    The Recipe We believe thatwriting software should be fun and easy
  • 18.
    We believe thatwriting software should be fun and easy Ingredient #1 PostgreSQL COPY Command The PostgreSQL COPY command moves data between tables and files. It is a built-in tool, and it offers a very high export throughput, especially if the wrapped SELECT command is kept as simple as possible (e.g., no ORDER BY, GROUP BY, or DISTINCT clauses). PostgreSQL So far, so good. We don’t need any grouping or distinct result set, but what about the order? To build a document, everything (properties, relationships, links) belonging to a given entity should be together in the exported file; otherwise, it is impossible to know when the entity definition ends without scrolling the entire set, which we want to avoid. attributes.dump relationships.dump links.dump
  • 19.
    We believe thatwriting software should be fun and easy Ingredient #1 PostgreSQL COPY Command: Output Note the order: a given entity (e.g. 1) appears in multiple files and even within the same file, its rows are not consecutive.
  • 20.
    We believe thatwriting software should be fun and easy Ingredient #2 Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The input of a MapReduce Job is ā— small number of huge text files ā— structured csv files ā— rows within files are not sorted You don’t need to install a Hadoop cluster: Amazon has a powerful service called Elastic Map Reduce (EMR) thats allows you to configure and start ephemeral clusters with few clicks
  • 21.
    We believe thatwriting software should be fun and easy Ingredient #2 Hadoop MapReduce: The Map Phase The purpose of the Mapper is to transform the original input files content in a list of K,V pairs.
  • 22.
    We believe thatwriting software should be fun and easy Ingredient #2 Hadoop MapReduce: The Shuffle & Sort Phase The subsequent phase is called shuffle: data is transferred from mappers to reducers. Before doing that, all intermediate key-value pairs generated by the mappers get sorted and grouped by key
  • 23.
    We believe thatwriting software should be fun and easy Ingredient #2 Overview ???
  • 24.
    We believe thatwriting software should be fun and easy Ingredient #3 Apache Solr The last puzzle piece is Apache Solr, the popular, open-source enterprise search platform built on Apache Lucene.
  • 25.
    We believe thatwriting software should be fun and easy Ingredient #3 Apache Solr: challenges The main infrastructural challenge of the Solr cluster is related to data, the initial size, and the expected growth rate. Solr requires creating the distributed index with a predefined number of slices (shards), so in a given moment, with a given amount of data and a given growth/increment rate, we should always make sure ā— the cluster is well-balanced: each shard should manage a reasonable amount of data; ā€œreasonableā€ here refers to the resources owned by the hosting node ā— hardware resources (e.g., RAM, CPU, Disk) are appropriately allocated according to indexing and query throughput requirements In Share-VDE, data is expected to grow on a large scale: we started with a few million entities (15- 20), and we know the production system will hold more or less two billion documents. Estimating in advance a cluster that should hold such an amount of data is quite complex; the under/over estimation is a high risk with could lead to relevant consequences in terms of money.
  • 26.
    We believe thatwriting software should be fun and easy Ingredient #3 Apache Solr: challenges We have chosen an incremental approach: ā— plan the cumulative updates in terms of incoming data from institutions ā— start with X as the initial data size, Y as the expected growth rate until the cluster reaches Z ā— estimate and create a Solr cluster for holding the X-Z scenario ā— when the cluster reaches Z, it becomes the new X (new initial size), the loop starts at a higher level, and we reindex everything in the new bigger cluster We found two significant benefits of using the approach above ā— Each cluster estimation is very tied to the data it holds ā— Estimation iterations are very short ā— We can understand, accurately estimate, and adapt the upper limit of each cluster instance, as soon as the data size increases
  • 27.
    Stay Out of Troubles Webelieve that writing software should be fun and easy
  • 28.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Apache Solr: challenges Java 8 is the supported Java Virtual Machine (JVM) for cluster instances created using Amazon EMR release version 5.0.0 or later. Although Java 8 is a bit old right now, that is not an issue in 99% of cases. The only thing is that you should be aware of that in advance, to avoid (like what happened to us) a consistent refactoring because your code uses some cool stuff introduced later.
  • 29.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Documents Batching The atomic indexing unit in Apache Solr is called Document (actually a SolrInputDocument instance). Regardless of the way documents are received, once they arrive in Solr, they are indexed one by one. Clients can improve the overall indexing throughput by batching documents instead of sending them separately as soon as they are created. There are several indexing client classes available in Solrj, the official Solr Java client, but apart from the ConcurrentUpdateSolrClient, which transparently accumulates documents and creates batches, the other SolrClient subclasses require the caller to take care of that batching work. CloudSolrClient, the client we use for targeting a SolrCloud is not an exception to that rule; in a massive update scenario, is always better to create a batch of documents CloudSolrClient divides your batch into small sub-groups, each one targeting the corresponding leader shard. This is important for defining the batch size, which should consider the cluster size.
  • 30.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Isolate Indexing Failures (1 / 3) Documents are always indexed one by one, regardless if they are sent in batches or not. When batches of documents are sent to Apache Solr for indexing, by default the process stops at the first failure, therefore skipping the remaining documents. This behavior is counterintuitive because the operation result is unpredictable from a client’s perspective. In an ideal world, there should be zero indexing failures, but in reality, errors happen for many reasons, that are not always under our direct control.
  • 31.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Isolate Indexing Failures (2 / 3) For example, if we have the following document list: The actual sub-set of documents that are indexed changes depending on where the first failure occurs: if it happens at d1, no document will be indexed; if it happens on d2, only d1 will be indexed, and so on. That could be good if that corresponds to the expected behavior. In our case what we needed was a more ā€œlenientā€ behavior where: ā— all failed documents are logged (not just the first one) ā— the batch is processed entirely, without stopping at first failure
  • 32.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Isolate Indexing Failures (3 / 3)
  • 33.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Avoid Sending Unnecessary Data Although it could sound as trivial, tuning this specific aspect could make a relevant difference. Data can be skipped in three ways: ā— Database: using a flag or something that excludes a given row from being retrieved ā— Indexing Client: the Hadoop Reducer can remove fields according to a given logic ā— Apache Solr: IgnoreFieldUpdateProcessor, dynamic field with a special ā€œignoredā€ datatype
  • 34.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Dry Run We found strongly useful to have a simple boolean parameter for excluding Solr at all from the pipeline. That allows to isolate the Hadoop pipeline and understand if there are bottlenecks. X
  • 35.
    We believe thatwriting software should be fun and easy Lesson Learned #1 Hadoop Surprisingly, we didn’t change so much the Hadoop configuration; that probably means default values perform quite well, at least in the benchmarks we’ve ran so far. Settings we changed are the following: ā— dfs.blocksize: as soon as data grows (3 GB, 30 GB, 300 GB), we found some minor benefit in increasing the block size from 128 to 512 MB ā— mapreduce.reduce.merge.inmem.threshold (-1), mapreduce.task.io.sort.mb (300): these changes reduced the total spilled records, initially reported as too high (3 times the map output records) As said, even using the default values without any change, the results were a bit improved, not so much different from what you will see in the next section.
  • 36.
    Links Links We believe thatwriting software should be fun and easy
  • 37.
    We believe thatwriting software should be fun and easy Forum The place where the community discuss topics and suggests things. The Share-VDE Project The project website Case Study The blog post that originated this talk Useful Links
  • 38.
    info@spaziocodice.com +39 0761 1916790 https://spaziocodice.com Webelieve that writing software should be fun and easy