Successfully reported this slideshow.
Your SlideShare is downloading.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Solr on HDFS 
Past, Present, and Future 
Mark Miller, Cloudera
About Me 
Lucene Committer, Solr Committer. 
Works for Cloudera. 
A lot of work on Lucene, Solr, and SolrCloud.
Some Basics 
Solr 
A distributed, fault tolerant search engine using Lucene as it’s core search library. 
HDFS 
A distribu...
Solr on HDFS 
Wouldn’t it be nice if Solr could run on HDFS. 
If you are running other things on HDFS, it simplifies opera...
Solr on HDFS in the Past. 
• Apache Blur is one of the more successful marriages of Lucene and HDFS. 
• We borrowed some c...
How HDFS Writes Data 
Remote Remote Remote Remote 
Local 
Solr 
Write An attempt is made to make a local copy 
and as many...
Co-Located Solr and HFDS Data Nodes 
HDFS HDFS HDFS HDFS 
Solr Solr Solr Solr 
We recommend that HDFS data nodes and Solr ...
Non Local Data 
• BlockCache is first line of defense, but it’s good to get local data again. 
• Optimize is more painful ...
HdfsDirectory 
• Fairly simple and straightforward implementation. 
• Full support required making the Directory interface...
“The Block Cache” 
A replacement for the OS filesystem cache, especially for the case when there is no 
local data. 
Even ...
Inside the Block Cache. 
ConcurrentLinkedHashMap<BlockCacheKey,BlockCacheLocation> 
ByteBuffer[] banks 
int numberOfBlocks...
The Global Block Cache 
The initial Block Cache implementation used a separate Block Cache for every unique 
index directo...
Performance 
In many average cases, performance looks really good - very comparable to local 
filesystem performance, thou...
Tuning the Block Cache 
Sizing 
By default, each ‘slab’ is 128 MB. Raise the slab count to increase by 128 MB slabs. 
Bloc...
HDFS Transaction Log 
We also moved the Transaction Log to HDFS. 
Implementation has held up okay, some improvements neede...
The autoAddReplicas Feature 
A new feature that is currently only available when using a shared filesystem like 
HDFS. 
Th...
The autoAddReplicas Feature 2 
HDFS HXDFS HDFS HDFS 
Solr SXolr Solr Solr
The Future 
At Cloudera, we are building an Enterprise Data Hub. 
In our vision, the more that runs on HDFS, the better. 
...
Block Cache Improvements 
Apache Blur has a Block Cache V2. 
Uses variable sized blocks. 
Optionally uses Unsafe for direc...
HDFS Only Replication When Using Replicas 
Currently, if you want to use SolrCloud replicas, data is replicated both by HD...
The End 
Mark Miller 
@heismark
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Upcoming SlideShare
Loading in …5
×

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera

  1. 1. Solr on HDFS Past, Present, and Future Mark Miller, Cloudera
  2. 2. About Me Lucene Committer, Solr Committer. Works for Cloudera. A lot of work on Lucene, Solr, and SolrCloud.
  3. 3. Some Basics Solr A distributed, fault tolerant search engine using Lucene as it’s core search library. HDFS A distributed, fault tolerant filesystem that is part of the Hadoop project.
  4. 4. Solr on HDFS Wouldn’t it be nice if Solr could run on HDFS. If you are running other things on HDFS, it simplifies operations. If you are building indexes with MapReduce, merging them into your cluster becomes easy. You can do some other kind of cool things when you have are using a shared file system. Most attempts in the past have not really caught on.
  5. 5. Solr on HDFS in the Past. • Apache Blur is one of the more successful marriages of Lucene and HDFS. • We borrowed some code from them to seed Solr on HDFS. • Others have copied indexes between local filesystem and HDFS. • Most people felt that running Lucene or Solr straight on HDFS would be too slow.
  6. 6. How HDFS Writes Data Remote Remote Remote Remote Local Solr Write An attempt is made to make a local copy and as many remote copies as necessary to satisfy the replication factor configuration.
  7. 7. Co-Located Solr and HFDS Data Nodes HDFS HDFS HDFS HDFS Solr Solr Solr Solr We recommend that HDFS data nodes and Solr nodes are co-located so that the default case involves fast, local data.
  8. 8. Non Local Data • BlockCache is first line of defense, but it’s good to get local data again. • Optimize is more painful option. • An HDFS affinity feature could be useful. • A tool that simply wrote out a copy of the index with no merging might be interesting.
  9. 9. HdfsDirectory • Fairly simple and straightforward implementation. • Full support required making the Directory interface a first class citizen in Solr. • Largest part was making Replication work with non local filesystem directories. • With large enough ‘buffer’ sizes, works reasonably well as long as the data is local. • Really needs some kind of cache to be reasonable though.
  10. 10. “The Block Cache” A replacement for the OS filesystem cache, especially for the case when there is no local data. Even with local data, making it larger will beneficially reduce HDFS traffic in many cases. Block Cache HDFS Solr
  11. 11. Inside the Block Cache. ConcurrentLinkedHashMap<BlockCacheKey,BlockCacheLocation> ByteBuffer[] banks int numberOfBlocksPerBank Each ByteBuffer of size ‘blockSize’. Used locations tracked by ‘lock’ bitset.
  12. 12. The Global Block Cache The initial Block Cache implementation used a separate Block Cache for every unique index directory used by Solr in HDFS. There are many limitations around this strategy. It hinders capacity planning, it’s not very efficient, and it bites you at the worst times. The Global Block Cache is meant to be a single Block Cache to be used by all SolrCore’s for every directory. This makes sizing very simple - determine how much RAM you can spare for the Block Cache and size it that way once and forget it.
  13. 13. Performance In many average cases, performance looks really good - very comparable to local filesystem performance, though usually somewhat slower. In other cases, adjusting various settings for the Block Cache can help with performance. We have recently found some changes to improve performance.
  14. 14. Tuning the Block Cache Sizing By default, each ‘slab’ is 128 MB. Raise the slab count to increase by 128 MB slabs. Block Size (8 KB default) Not originally configurable, but certain use cases appear to work better with 4 KB.
  15. 15. HDFS Transaction Log We also moved the Transaction Log to HDFS. Implementation has held up okay, some improvements needed, a large replay performance issue improved. The HDFSDirectory and Block Cache have had a much larger impact. No truncate support in HDFS, so we work around it by replaying the whole log in some failed recovery cases where local filesystem impl just drops the log.
  16. 16. The autoAddReplicas Feature A new feature that is currently only available when using a shared filesystem like HDFS. The Overseer monitors the cluster state and fires off SolrCore create command pointing to existing data in HDFS when a node goes down.
  17. 17. The autoAddReplicas Feature 2 HDFS HXDFS HDFS HDFS Solr SXolr Solr Solr
  18. 18. The Future At Cloudera, we are building an Enterprise Data Hub. In our vision, the more that runs on HDFS, the better. We will continue to improve and push forward HDFS support in SolrCloud.
  19. 19. Block Cache Improvements Apache Blur has a Block Cache V2. Uses variable sized blocks. Optionally uses Unsafe for direct memory management. The V1 Block Cache has some performance limitations. * Copying bytes from off heap to IndexInput buffer. * Concurrent access of the cache. * Sequential reads have to pull a lot of blocks from the cache. * Each DirectByteBuffer has some overhead, including a Cleaner object that can affect GC and add to RAM reqs.
  20. 20. HDFS Only Replication When Using Replicas Currently, if you want to use SolrCloud replicas, data is replicated both by HDFS and by Solr. HDFS replication factor = 1 is not a very good solution. autoAddReplicas is one possible solution. We will be working on another solution where only the leader writes to an index in HDFS while replicas read from it.
  21. 21. The End Mark Miller @heismark

    Be the first to comment

    Login to see the comments

  • luckyzhw

    Apr. 9, 2015
  • alexclear

    May. 16, 2015
  • gengmao

    Jul. 13, 2015
  • jmau2002

    Nov. 12, 2015
  • hypermin

    Dec. 5, 2016
  • ravikumaritha9

    Jan. 10, 2017
  • KiranPrabhu5

    Aug. 8, 2018
  • Yzhang1

    Aug. 20, 2018
  • mrugenmike

    Aug. 21, 2018

Views

Total views

4,408

On Slideshare

0

From embeds

0

Number of embeds

96

Actions

Downloads

91

Shares

0

Comments

0

Likes

9

×