3. 3
01
Table of Content
Introduce zanox and me
What is OLAP?
Solrcloud with Doc-Values
HBase with End-Point Coprocessors
Response Time Comparison
Summary
Questions
Tip 7 Tip 8
Tip 1 Tip 2 Tip 3
Tip 4 Tip 5 Tip 6
4. 4
02
Who am I?
Chief Search Architect at Zanox
9+ years experience in building distributed search systems based on Lucene
Lucid Certified Apache Solr/Lucene Developer
6+ years of using Hadoop and 4+ years of using HBase for data mining
Cloudera Certified Developer for Apache Hadoop and HBase
I have applied different machine-learning techniques mainly to optimise resource usage while
performing distributed search during my PhD study at Technical University Berlin
See my book: “Beyond Centralised Search Engines An Agent-Based Filtering Framework”
How can you reach me?
dragan.milosevic@zanox.com
http://www.linkedin.com/in/draganmilosevic
8. 8
01
Lucene Stored Fields
Lucene Stored Fields are optimized for retrieving the values of fields for relatively
small number of documents (top hits to be presented to user) Tip 1
9. Lucene Stored Fields
Lucene Stored Fields are optimized for retrieving the values of fields for relatively
small number of documents (top hits to be presented to user)
It is difficult to use them efficiently for reporting because many documents
(thousands, even millions) are selected and their fields will be needed
Machine learning has to be used to analyze queries and determine the optimal order of
documents and optimize the loading of hits if stored fields are used for reporting
Lucene Revolution
San Diego 2013
Analytics in OLAP
with Lucene and Hadoop
Tip 2
Tip 1
11. 11
01
Doc-Values Step-by-Step – schema.xml & Search Component class
Enable Doc-Values in shema.xml for fields that should be aggregated (with stored=“false”)
<field name="adrank" type="tfloat" indexed=“false" stored=“false" docValues="true"/>
Implement Search Component that uses Doc-Values and aggregates
public class SummingComponent extends SearchComponent {
public void prepare(ResponseBuilder rb) throws IOException
{ rb.setNeedDocSet(true); }
public void process(ResponseBuilder rb) throws IOException {
DocIterator itr = rb.getResults().docSet.iterator();
LeafReader reader = rb.req.getSearcher().getLeafReader();
NumericDocValues docValues = reader.getNumericDocValues("adrank");
double sum = 0;
while (itr.hasNext()) { sum += Float.intBitsToFloat((int) docValues.get(itr.nextDoc())); }
rb.rsp.add("sum", sum);
}
Tip 3
12. Doc-Values Step-by-Step – solrconfig.xml
Build jar with search component, put it somewhere in contrib and add it in solrconfig.xml
<lib dir="${solr.install.dir:../../../..}/contrib/zanox/lib" regex=".*.jar" />
Register newly created search component for summing in solrconfig.xml
<searchComponent name="summer" class="com.zanox.search.SummingComponent">
<str name=“…">…</str>
</searchComponent>
Use search component in request handler
<requestHandler name="/summing" class="solr.SearchHandler">
<lst name="defaults"> … </lst> …
<arr name="components">
<str>query</str>
<str>summer</str>
</arr>
</requestHandler>
13. What is HBase?
“HBase is an open source, non-relational, distributed database modeled after
Google's BigTable and written in Java. It is developed as part of Apache Software
Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed
Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-
tolerant way of storing large quantities of sparse data …”
https://en.wikipedia.org/wiki/Apache_HBase
“we could simply state that HBase is a faithful, open source implementation of Google’s
BigTable. But that would be a bit too simplistic …” HBase: The Definitive Guide, Lars
George
“HBase is a database: the Hadoop database. It’s often described as a sparse,
distributed, persistent, multidimensional sorted map, which is indexed by row-key,
column key, and timestamp. You’ll hear people refer to it as a key value store, a
column family-oriented database … “ HBase in Action, Nick Dimiduk, Amandeep
Khurana
17. Column Family Challenge
There are hundreds of columns that should be assigned to few column families
The goal is to minimize the number of column families to be accessed for requested query
Fewer column families needed means fewer files to be checked
Because of the way how splitting of regions is implemented in HBase region servers,
the difference in size between column families should be minimized
Small column families will be spread across many regions, many files will be needed
to be checked while processing queries and response time will increase
The irrelevant columns that are not needed for query should be minimized
It is allowed to duplicate data and put column in multiple column families to optimize access
Important columns that are needed for many queries will be duplicated
But the amount of duplication should be minimized not to waste space
NP-hard optimization problem
18.
19. End Point Coprocessors
End-Point Coprocessors provide a way to run custom code on
region-servers and aggregation can be this code
They are therefore analogous to both stored-procedures and
map-reduce jobs as they execute logic where data resides
Without them, client will have to retrieve all values locally
before applying logic and probably network bandwidth will be
bottleneck
They behave as network friendly as possible, as only the
summarized results from region servers are sent to client
HBase provides high level interface for calling End-Point
Coprocessors and selects regions to be queried automatically
Tip 4
20. HBase Reporting Guidelines
Queries have to be known and analyzed profoundly because
Row-keys should be designed so that needed records are saved next to each other
and therefore can be accessed quickly Tip 5
21. HBase Reporting Guidelines
Queries have to be known and analyzed profoundly because
Row-keys should be designed so that needed records are saved next to each other
and therefore can be accessed quickly
It is necessary to decide which column should be saved in which column-family so that
the number of accessed column-families is minimized
Tip 5
Tip 6
22. HBase Reporting Guidelines
Queries have to be known and analyzed profoundly because
Row-keys should be designed so that needed records are saved next to each other
and therefore can be accessed quickly
It is necessary to decide which column should be saved in which column-family so that the
number of accessed column-families is minimized
Queries that are unexpected can be very expensive because
The structure of row-keys solely decides the order in which records are saved and consequently
needed records will not be saved next to each other and therefore cannot be accessed quickly
Even though this sounds as a huge disadvantage, typically there are standard reports (queries)
that are the only ones made available to end-users and unexpected queries are not possible
Tip 5
Tip 6
23. Response Time Comparison – Single Column
Tip 7
If aggregation should be performed on single attribute (column) there is no clear winner
Therefore use the technology that you are more familiar with
24. Response Time Comparison – Multiple Columns
Tip 8
If aggregation should be performed on many big attributes (columns) there is clear
winner
Use HBase with row-keys and column-families that are optimized for expected queries
25. Summary
Tip 3
Tip 1
Tip 2
Tip 6
Tip 4
Tip 5
Tip 8
Tip 7
Do not use Lucene Stored Fields for reporting
If not convinced watch Analytics in OLAP with Lucene and Hadoop
Activate Doc-Values and enjoy great aggregation performance
Plug aggregation code as End-Point Coprocessor in HBase
Design row-keys based on queries so that needed records are next to each other
Optimize the structure of column-families for queries requesting many
aggregations
Pick any technology if aggregation is performed over single (few) attributes
Use optimized HBase for aggregating many big attributes