HDP Search Overview (APACHE SOLR & HADOOP)

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search Overview

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 HDP Search Overview
 Hadoop Integration
 Security

HDP Search
HDP 2.6 contains support for:
• Apache Solr 6.6.2 (with Lucene)
• Banana 1.6.12 (Search & Time-Series Visualization)
• Hadoop connectors (Hbase, Hive, Hbase, Pig)
• SDK for Spark
• Apache Ranger integration (Collection level)

Apache Solr / Lucene
• Streaming Expressions
• Parallel SQL Interface
• Cross Data Center Replication (SolrCloud only)
• DocValues
curl http://localhost:8983/solr/gettingstarted/sql?q=*:*&stmt=SELECT%20max(price)%20FROM%20gettingstarted
curl --data-urlencode 'expr=search(enron_emails, q="from:1800flowers*", qt="/export")' http://localhost:8983/solr/enron_emails/stream
Solr / Lucene provides a fast NoSQL engine for textual search, time series
analysis, spatial and SQL queries, and many more use cases
What’s in SOLR 6?

HDP Search : Deployment Options
Page 5
Configuration Advantages Disadvantages
Local storage backed SOLR Cluster
• Scale independently
• Scale easily for increased query volume
• No need to carefully orchestrate resource
allocations among workloads, indexing, and
querying
• Multiple clusters to admin and manage
HDFS backed SOLR Cluster
• Single cluster to administration / manage
• Leverages Hadoop file system advantages
(replication)
• Query response time typically lower, 500ms
compared to 100ms

How to store Solr’s index on HDFS?
Update core’s solrconfig.xml set
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>hdfs</lockType>
Page 6
1
2
SOLR Indexs on HDFS
 Store Indexes In HDFS
 Kerberos Supported
 Wise to co-locate SOLR
with data nodes

Scalable Indexing In HDFS Using Lucidworks Hadoop connector
MapReduce job
– CSV
– Microsoft Office files
– Grok (log data)
– Zip
– Solr XML
– Seq files
– WARC
Apache Pig & Hive
– Write your own pig/hive scripts
to index content
• Use hive/pig for
preprocessing and joining
• Output the resulting
datasets to Solr
HDFS
MapReduce or Pig Job
Raw Documents Lucene Indexes

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob
-DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author
-DcsvFirstLineComment
-DidField=id
-DcsvDelimiter=","
-Dlww.commit.on.close=true
-cls com.lucidworks.hadoop.ingest.CSVIngestMapper
-c labs
-i csv/*
-of com.lucidworks.hadoop.io.LWMapRedOutputFormat
-zk localhost:2181
How to Index using MapReduce
Ingest Mappers Include: CSV, Directory, Grok, RegEx, SequenceFile, SolrXML, WARC, Zip

Index using Apache NiFi for robust data pipelines
How Apache NiFi works with Apache Solr
• SolrCloud or Standalone
• Leverages SolrJ
• GetSolr – Extract new documents based on time/date field
• PutSolrContentStream - Stream data to be indexed into SOLR
• Use various handlers: csv, json, xml, etc
Use Cases
Connect to any source, translate, transform, enrich and index it!
 Ingest various protocols (JSON, Avro, XML, etc)
 Real time or scheduled
 Great for log ingest

Index Tiering
Hot Real-Time Indexing and Querying
Warm Active Querying, No Indexing
Cold Index is offline
Frozen Index is archived
Rotating indexes on a schedule.
Use collection aliasing by time & expire the older shards
current -> myindex_20151225
n_1 -> myindex_20151224
n_2 -> myindex_20151223
/admin/collections?action=CREATESHARD:
/admin/collections?action=CREATEALIAS:
/admin/collections?action=DELETESHARD:

Collection Security with Ranger
Setup & Options
1) Create a policy on a SOLR collection
2) Assign to Users and Groups
3) Select Permissions
Read, Write, Create, Admin, Select
4) Delegate Admins
5) Limit to specific IP Addresses
6) Audit log the policy

Security Filters
 Search Component – Simple for record filtering to a security user/group mapping
Apache Manifold CF
http://wiki.apache.org/solr/SolrSecurity#Manifold_CF_.28Connector_Framework.29
 PostFilter – Costly, but can handle complex ACL logic
https://lucidworks.com/blog/2015/05/15/custom-security-filtering-solr-5/
 Pseudo Joins – Fetch distinct of which users can view which docs, join to search results
Request Handler or Search Component

HDP Search Overview (APACHE SOLR & HADOOP)

More Related Content

What's hot

Similar to HDP Search Overview (APACHE SOLR & HADOOP)

Recently uploaded

HDP Search Overview (APACHE SOLR & HADOOP)