Cloudera Search
Mike Drob
Who Am I?
Apache Accumulo PMC
Apache Curator PMC
Hobbyist contributor
- Various Apache Projects
- Junit, Jcommander, JLine2
Volunteer with FIRST LEGO League (FLL)
Search/Solr is my ${DayJob}
Agenda
We will cover:
- Overview of projects involved
- Architectural discussion of Solr on Hadoop
We will not cover:
- Performance, Tuning, or Optimizations
- Writing custom applications
- Tutorials (kind of)
Why Search?
Hadoop for Everyone!
Typical case:
Ingest data to storage engine (HDFS, HBase, etc...)
Process data (MR, Hive, Impala)
Experts know MapReduce
Savvy users know SQL
Everyone knows Search!
Use Case
Image Credit: Alex Moundalexis; Used With Permission
Use Case
FACETING
HIGHLIGHTING
CONTENT
Use Case
Does not contain “Hadoop” in title...
SCORING
Search on Hadoop History
•Katta – Distributed Lucene
•Blur – Lucene on Hadoop
•SolBase - Lucene + HBase @ Photobucket
•HBASE-3529 – Lucene on HBase
•SOLR-1301 – MR Indexer
•Ad-Hoc
Family Tree
...
Strengthen the Family Bonds
•No need to build something radically new - we
have the pieces we need.
•Focus on integration points.
•Create high quality, first class integrations and
contribute the work to the projects involved.
•Focus on integration and quality first - then
performance and scale.
Very fast and feature rich ‘core’ search engine
library.
Compact and powerful, Lucene is an extremely
popular full-text search library.
Provides low level APIs for analyzing, indexing, and
searching text, along with a myriad of related
features.
Just the core - either you write the ‘glue’ or use a
higher level search engine built with Lucene.
Solr (pronounced "solar") is an open source
enterprise search platform from the Apache Lucene
project. Its major features include full-text search,
hit highlighting, faceted search, dynamic clustering,
database integration, and rich document (e.g.,
Word, PDF) handling. Providing distributed search
and index replication, Solr is highly scalable. Solr is
the most popular enterprise search engine.
- Wikipedia
Node (JVM)
Architecture & Terms
Core
(Index Dir)
Host
Physical
Logical
Collection
Shard 1
Replicas
Shard 2Shard 3
SolrCloud
Solr Integration
•Read and Write directly to HDFS
•First Class Custom Directory Support in Solr
•Support Solr Replication on HDFS
•Other improvements around usability and
configuration
Putting the Index in HDFS
•Extend Lucene's Directory & DirectoryFactory to
abstract HDFS implementation
•Solr relies on the FS cache to operate at full speed,
while HDFS not known for it’s random access speed.
•Apache Blur has already solved this with an
HdfsDirectory that works on top of a BlockDirectory.
•The “block cache” caches the hot blocks of the index
off heap (direct byte array) and takes the place of the
FS cache.
Putting TransactionLog in HDFS
•TransactionLog is a basic WAL
•HdfsUpdateLog added - extends UpdateLog
•Triggered by setting the UpdateLog dataDir to a path
starting with hdfs:/
•Benefits from same extensive testing as used on
UpdateLog
Running Solr on HDFS
•Cloudera Manager can do all of this for you.
•Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a
location in hdfs.
•Set LockType to ‘hdfs’
•Use an UpdateLog dataDir location that begins with ‘hdfs:/’
•i.e. java -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lockType=solr.HdfsLockFactory
-Dsolr.updatelog=hdfs://host:port/path -jar start.jar
Solr Replication on HDFS
•Take advantage of “distributed filesystem” and allow
for something similar to HBase regions.
•If a node goes down, the data is still available in
HDFS - allow for that index to be automatically
served by a node that is still up if it has the capacity.
Solr
Node
Solr
Node
Solr
Node
HDFS
MR Index Building
•Scalable index creation via map-reduce
•Many initial ‘homegrown’ implementations sent documents from
reducer to SolrCloud over http
•To really scale, you want the reducers to create the indexes in
HDFS and then load them up with Solr
•The ideal impl will allow using as many reducers as are available
in your hadoop cluster, and then merge the indexes down to the
correct number of ‘shards’
MR Index Building
Mapper:
Parse input into
indexable
document
Mapper:
Parse input into
indexable
document
Mapper:
Parse input into
indexable
document
Index
shard 1
Index
shard 2
Arbitrary reducing steps of indexing and merging
End-Reducer (shard 1):
Index document
End-Reducer (shard 2):
Index document
SolrCloud Aware
•Can ‘inspect’ ZooKeeper to learn about Solr cluster.
•What URLs to GoLive to.
•The Schema to use when building indexes.
•Match hash -> shard assignments of a Solr cluster.
GoLive
•After building your indexes with map-reduce, how do
you deploy them to your Solr cluster?
•We want it to be easy - so we built the GoLive
option.
•GoLive allows you to easily merge the indexes you
have created atomically into a live running Solr
cluster.
•Paired with the ZooKeeper Aware ability, this allows
you to simply point your map-reduce job to your Solr
cluster and it will automatically discover how many
shards to build and what locations to deliver the final
indexes to in HDFS.
HBase Integration
•Collaboration between NGData & Cloudera
•NGData created the Lily data management platform
•Lily HBase Indexer
•Service which acts as a HBase replication listener
•HBase replication features, such as filtering, supported
•Replication updates trigger indexing of updates (rows)
•Integrates Morphlines library for ETL of rows
•AL2 licensed on github https://github.com/ngdata
HBase Integration
HDFS
HBase
interactiveload
Indexer(s)
Triggersonupdates
Solr server
Solr server
Solr server
Solr server
Solr server
Hue Integration
Hue
•Simple UI
•Navigated, faceted drill down
•Customizable display
•Full text search, standard Solr
API and query language
Hue Integration
Sentry Integration (Security)
Collection-Level (Query, Update, Admin)
Document-Level (Filter on document metadata)
Also supports KRB and SSL
Getting Started (Links)
Quickstart VM Download
Search Tutorials
Mike Drob, Cloudera

Cloudera search

  • 1.
  • 2.
    Who Am I? ApacheAccumulo PMC Apache Curator PMC Hobbyist contributor - Various Apache Projects - Junit, Jcommander, JLine2 Volunteer with FIRST LEGO League (FLL) Search/Solr is my ${DayJob}
  • 3.
    Agenda We will cover: -Overview of projects involved - Architectural discussion of Solr on Hadoop We will not cover: - Performance, Tuning, or Optimizations - Writing custom applications - Tutorials (kind of)
  • 4.
    Why Search? Hadoop forEveryone! Typical case: Ingest data to storage engine (HDFS, HBase, etc...) Process data (MR, Hive, Impala) Experts know MapReduce Savvy users know SQL Everyone knows Search!
  • 5.
    Use Case Image Credit:Alex Moundalexis; Used With Permission
  • 6.
  • 7.
    Use Case Does notcontain “Hadoop” in title... SCORING
  • 8.
    Search on HadoopHistory •Katta – Distributed Lucene •Blur – Lucene on Hadoop •SolBase - Lucene + HBase @ Photobucket •HBASE-3529 – Lucene on HBase •SOLR-1301 – MR Indexer •Ad-Hoc
  • 9.
  • 10.
    Strengthen the FamilyBonds •No need to build something radically new - we have the pieces we need. •Focus on integration points. •Create high quality, first class integrations and contribute the work to the projects involved. •Focus on integration and quality first - then performance and scale.
  • 11.
    Very fast andfeature rich ‘core’ search engine library. Compact and powerful, Lucene is an extremely popular full-text search library. Provides low level APIs for analyzing, indexing, and searching text, along with a myriad of related features. Just the core - either you write the ‘glue’ or use a higher level search engine built with Lucene.
  • 12.
    Solr (pronounced "solar")is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine. - Wikipedia
  • 13.
    Node (JVM) Architecture &Terms Core (Index Dir) Host Physical Logical Collection Shard 1 Replicas Shard 2Shard 3
  • 14.
  • 15.
    Solr Integration •Read andWrite directly to HDFS •First Class Custom Directory Support in Solr •Support Solr Replication on HDFS •Other improvements around usability and configuration
  • 16.
    Putting the Indexin HDFS •Extend Lucene's Directory & DirectoryFactory to abstract HDFS implementation •Solr relies on the FS cache to operate at full speed, while HDFS not known for it’s random access speed. •Apache Blur has already solved this with an HdfsDirectory that works on top of a BlockDirectory. •The “block cache” caches the hot blocks of the index off heap (direct byte array) and takes the place of the FS cache.
  • 17.
    Putting TransactionLog inHDFS •TransactionLog is a basic WAL •HdfsUpdateLog added - extends UpdateLog •Triggered by setting the UpdateLog dataDir to a path starting with hdfs:/ •Benefits from same extensive testing as used on UpdateLog
  • 18.
    Running Solr onHDFS •Cloudera Manager can do all of this for you. •Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in hdfs. •Set LockType to ‘hdfs’ •Use an UpdateLog dataDir location that begins with ‘hdfs:/’ •i.e. java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lockType=solr.HdfsLockFactory -Dsolr.updatelog=hdfs://host:port/path -jar start.jar
  • 19.
    Solr Replication onHDFS •Take advantage of “distributed filesystem” and allow for something similar to HBase regions. •If a node goes down, the data is still available in HDFS - allow for that index to be automatically served by a node that is still up if it has the capacity. Solr Node Solr Node Solr Node HDFS
  • 20.
    MR Index Building •Scalableindex creation via map-reduce •Many initial ‘homegrown’ implementations sent documents from reducer to SolrCloud over http •To really scale, you want the reducers to create the indexes in HDFS and then load them up with Solr •The ideal impl will allow using as many reducers as are available in your hadoop cluster, and then merge the indexes down to the correct number of ‘shards’
  • 21.
    MR Index Building Mapper: Parseinput into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • 22.
    SolrCloud Aware •Can ‘inspect’ZooKeeper to learn about Solr cluster. •What URLs to GoLive to. •The Schema to use when building indexes. •Match hash -> shard assignments of a Solr cluster.
  • 23.
    GoLive •After building yourindexes with map-reduce, how do you deploy them to your Solr cluster? •We want it to be easy - so we built the GoLive option. •GoLive allows you to easily merge the indexes you have created atomically into a live running Solr cluster. •Paired with the ZooKeeper Aware ability, this allows you to simply point your map-reduce job to your Solr cluster and it will automatically discover how many shards to build and what locations to deliver the final indexes to in HDFS.
  • 24.
    HBase Integration •Collaboration betweenNGData & Cloudera •NGData created the Lily data management platform •Lily HBase Indexer •Service which acts as a HBase replication listener •HBase replication features, such as filtering, supported •Replication updates trigger indexing of updates (rows) •Integrates Morphlines library for ETL of rows •AL2 licensed on github https://github.com/ngdata
  • 25.
  • 26.
    Hue Integration Hue •Simple UI •Navigated,faceted drill down •Customizable display •Full text search, standard Solr API and query language
  • 27.
  • 28.
    Sentry Integration (Security) Collection-Level(Query, Update, Admin) Document-Level (Filter on document metadata) Also supports KRB and SSL
  • 29.
    Getting Started (Links) QuickstartVM Download Search Tutorials
  • 30.