Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search Overview
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 HDP Search Overview
 Hadoop Integration
 Security
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Search
HDP 2.6 contains support for:
• Apache Solr 6.6.2 (with Lucene)
• Banana 1.6.12 (Search & Time-Series Visualization)
• Hadoop connectors (Hbase, Hive, Hbase, Pig)
• SDK for Spark
• Apache Ranger integration (Collection level)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Solr / Lucene
• Streaming Expressions
• Parallel SQL Interface
• Cross Data Center Replication (SolrCloud only)
• DocValues
curl http://localhost:8983/solr/gettingstarted/sql?q=*:*&stmt=SELECT%20max(price)%20FROM%20gettingstarted
curl --data-urlencode 'expr=search(enron_emails, q="from:1800flowers*", qt="/export")' http://localhost:8983/solr/enron_emails/stream
Solr / Lucene provides a fast NoSQL engine for textual search, time series
analysis, spatial and SQL queries, and many more use cases
What’s in SOLR 6?
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Search : Deployment Options
Page 5
Configuration Advantages Disadvantages
Local storage backed SOLR Cluster
• Scale independently
• Scale easily for increased query volume
• No need to carefully orchestrate resource
allocations among workloads, indexing, and
querying
• Multiple clusters to admin and manage
HDFS backed SOLR Cluster
• Single cluster to administration / manage
• Leverages Hadoop file system advantages
(replication)
• Query response time typically lower, 500ms
compared to 100ms
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How to store Solr’s index on HDFS?
Update core’s solrconfig.xml set
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>hdfs</lockType>
Page 6
1
2
SOLR Indexs on HDFS
 Store Indexes In HDFS
 Kerberos Supported
 Wise to co-locate SOLR
with data nodes
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Indexing In HDFS Using Lucidworks Hadoop connector
MapReduce job
– CSV
– Microsoft Office files
– Grok (log data)
– Zip
– Solr XML
– Seq files
– WARC
Apache Pig & Hive
– Write your own pig/hive scripts
to index content
• Use hive/pig for
preprocessing and joining
• Output the resulting
datasets to Solr
HDFS
MapReduce or Pig Job
Raw Documents Lucene Indexes
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob
-DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author
-DcsvFirstLineComment
-DidField=id
-DcsvDelimiter=","
-Dlww.commit.on.close=true
-cls com.lucidworks.hadoop.ingest.CSVIngestMapper
-c labs
-i csv/*
-of com.lucidworks.hadoop.io.LWMapRedOutputFormat
-zk localhost:2181
How to Index using MapReduce
Ingest Mappers Include: CSV, Directory, Grok, RegEx, SequenceFile, SolrXML, WARC, Zip
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Index using Apache NiFi for robust data pipelines
How Apache NiFi works with Apache Solr
• SolrCloud or Standalone
• Leverages SolrJ
• GetSolr – Extract new documents based on time/date field
• PutSolrContentStream - Stream data to be indexed into SOLR
• Use various handlers: csv, json, xml, etc
Use Cases
Connect to any source, translate, transform, enrich and index it!
 Ingest various protocols (JSON, Avro, XML, etc)
 Real time or scheduled
 Great for log ingest
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Index Tiering
Hot Real-Time Indexing and Querying
Warm Active Querying, No Indexing
Cold Index is offline
Frozen Index is archived
Rotating indexes on a schedule.
Use collection aliasing by time & expire the older shards
current -> myindex_20151225
n_1 -> myindex_20151224
n_2 -> myindex_20151223
/admin/collections?action=CREATESHARD:
/admin/collections?action=CREATEALIAS:
/admin/collections?action=DELETESHARD:
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Collection Security with Ranger
Setup & Options
1) Create a policy on a SOLR collection
2) Assign to Users and Groups
3) Select Permissions
Read, Write, Create, Admin, Select
4) Delegate Admins
5) Limit to specific IP Addresses
6) Audit log the policy
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Filters
 Search Component – Simple for record filtering to a security user/group mapping
Apache Manifold CF
http://wiki.apache.org/solr/SolrSecurity#Manifold_CF_.28Connector_Framework.29
 PostFilter – Costly, but can handle complex ACL logic
https://lucidworks.com/blog/2015/05/15/custom-security-filtering-solr-5/
 Pseudo Joins – Fetch distinct of which users can view which docs, join to search results
Request Handler or Search Component

HDP Search Overview (APACHE SOLR & HADOOP)

  • 1.
    Page1 © HortonworksInc. 2011 – 2015. All Rights Reserved HDP Search Overview
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda  HDP Search Overview  Hadoop Integration  Security
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved HDP Search HDP 2.6 contains support for: • Apache Solr 6.6.2 (with Lucene) • Banana 1.6.12 (Search & Time-Series Visualization) • Hadoop connectors (Hbase, Hive, Hbase, Pig) • SDK for Spark • Apache Ranger integration (Collection level)
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Solr / Lucene • Streaming Expressions • Parallel SQL Interface • Cross Data Center Replication (SolrCloud only) • DocValues curl http://localhost:8983/solr/gettingstarted/sql?q=*:*&stmt=SELECT%20max(price)%20FROM%20gettingstarted curl --data-urlencode 'expr=search(enron_emails, q="from:1800flowers*", qt="/export")' http://localhost:8983/solr/enron_emails/stream Solr / Lucene provides a fast NoSQL engine for textual search, time series analysis, spatial and SQL queries, and many more use cases What’s in SOLR 6?
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved HDP Search : Deployment Options Page 5 Configuration Advantages Disadvantages Local storage backed SOLR Cluster • Scale independently • Scale easily for increased query volume • No need to carefully orchestrate resource allocations among workloads, indexing, and querying • Multiple clusters to admin and manage HDFS backed SOLR Cluster • Single cluster to administration / manage • Leverages Hadoop file system advantages (replication) • Query response time typically lower, 500ms compared to 100ms
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved How to store Solr’s index on HDFS? Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType> Page 6 1 2 SOLR Indexs on HDFS  Store Indexes In HDFS  Kerberos Supported  Wise to co-locate SOLR with data nodes
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Scalable Indexing In HDFS Using Lucidworks Hadoop connector MapReduce job – CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC Apache Pig & Hive – Write your own pig/hive scripts to index content • Use hive/pig for preprocessing and joining • Output the resulting datasets to Solr HDFS MapReduce or Pig Job Raw Documents Lucene Indexes
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author -DcsvFirstLineComment -DidField=id -DcsvDelimiter="," -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c labs -i csv/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk localhost:2181 How to Index using MapReduce Ingest Mappers Include: CSV, Directory, Grok, RegEx, SequenceFile, SolrXML, WARC, Zip
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Index using Apache NiFi for robust data pipelines How Apache NiFi works with Apache Solr • SolrCloud or Standalone • Leverages SolrJ • GetSolr – Extract new documents based on time/date field • PutSolrContentStream - Stream data to be indexed into SOLR • Use various handlers: csv, json, xml, etc Use Cases Connect to any source, translate, transform, enrich and index it!  Ingest various protocols (JSON, Avro, XML, etc)  Real time or scheduled  Great for log ingest
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Index Tiering Hot Real-Time Indexing and Querying Warm Active Querying, No Indexing Cold Index is offline Frozen Index is archived Rotating indexes on a schedule. Use collection aliasing by time & expire the older shards current -> myindex_20151225 n_1 -> myindex_20151224 n_2 -> myindex_20151223 /admin/collections?action=CREATESHARD: /admin/collections?action=CREATEALIAS: /admin/collections?action=DELETESHARD:
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Collection Security with Ranger Setup & Options 1) Create a policy on a SOLR collection 2) Assign to Users and Groups 3) Select Permissions Read, Write, Create, Admin, Select 4) Delegate Admins 5) Limit to specific IP Addresses 6) Audit log the policy
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Security Filters  Search Component – Simple for record filtering to a security user/group mapping Apache Manifold CF http://wiki.apache.org/solr/SolrSecurity#Manifold_CF_.28Connector_Framework.29  PostFilter – Costly, but can handle complex ACL logic https://lucidworks.com/blog/2015/05/15/custom-security-filtering-solr-5/  Pseudo Joins – Fetch distinct of which users can view which docs, join to search results Request Handler or Search Component