Hortonworks Technical Workshop - HDP Search

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search Workshop
Hortonworks. We do Hadoop.
1/29/2013

Agenda
•  Hortonworks Data Platform 2.2
•  Apache Solr
•  Query & Ingest Documents with Apache Solr
•  Solr & Hadoop
•  Index on HDFS
•  MapReduce, Hive & Pig
•  Solr Cloud
•  Sizing
•  Demo

HDP 2.2

HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN

HDP 2.2: Reliable, Consistent & Current
HDP is Apache Hadoop not “based on” Hadoop

HDP Search
HDP 2.2 contains support for:
•  Apache Solr 4.10 with Lucense
•  Banana (Time-series visualization)
•  Lucidworks Hadoop connector

Apache Solr

What is Apache Solr
•  A system built to search text
•  A specialized type of database management System
•  A platform to build search applications on
•  Customizable, open source software

Why Apache Solr
Specialized tools do the job better!
•  Solr performs much better, for text search, than a relational database
•  Solr knows about languages
» E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς
•  Solr has features specific to text search,
» E.g. highlighting search results

Where does Apache Solr fit?

Apache Solr’s Architecture

Apache Solr’s inner Architecture

Basics Of Inverted Index
Doc ID Content
1 I like dog
2 I like cat
3 I like dog and
cat1
Term Doc ID
I 1,2,3
like 1,2,3
dog 1,3
cat 2,3
Document Inverted Index
Question
Find documents with Dog & Cat
Answer
Intersect the index for dog and cat
(1,3) (2,3) = 3

SOLR Indexing
•  Define document structure using
schema.xml
•  Convert document from source
format to a format supported by solr
(xml, json, csv)
•  Add Documents to SOLR
<doc>
<field name="id">1</field>
<field name="screen_name">@thelabdude</field>
<field name=”cat">post</field>
</doc>
Sample Document in XML Format

Solr’s schema & fields
Before adding documents to Solr, you need to specify the schema, represented in a file called
schema.xml.
The schema declares:
-  Fields
-  Field used as the unique/primary key
-  Field type
-  How to index and search each a field
Field Types
In Solr, every field has a type. E.g.: float, long, double, date, text
Defining a field:
<field name="id" type="text" indexed="true" stored="true" multiValued="true"/>

Dynamic Fields
Dynamic fields allow Solr to index fields that you did not explicitly define in your schema
Like a regular field except it has a name with a wildcard in it.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
For more field details see: http://wiki.apache.org/solr/SchemaXml

Query & Index Documents with Solr

Adding & Deleting From SOLR
Solr offers a REST like interface for indexing and searching:
Add to Index:
curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary
@books.json -H 'Content-type:application/json'
Delete from Index:
curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></
delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/
xml; charset=utf-8'
csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)

How To Query
Solr offers a REST like interface for indexing and searching:
Query
http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true

Solr Java API (SolrJ)
Index:
SolrServer server = new HttpSolrServer("http://HOST:8983/solr/");
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( "id", "id1", 1.0f );
doc1.addField( "name", "doc1", 1.0f );
server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add()
server.commit();
Query:
SolrQuery solrQuery = new SolrQuery().setQuery("ipod”);
QueryResponse rsp = server.query(solrQuery);
Iterator<SolrDocument> iter = rsp.getResults().iterator();
while (iter.hasNext()) {
SolrDocument resultDoc = iter.next();
String content = (String) resultDoc.getFieldValue("content");
}

Solr & Hadoop

HDP Search : Deployment Options
Page 22
Configuration Advantages Disadvantages
Solr deployed in an
independent cluster
•  Scale independently
•  Scale easily for increased query volume
•  No need to carefully orchestrate resource
allocations among workloads, indexing, and
querying
•  Multiple clusters to admin and manage
Solr index deployed on
HDFS node
•  Single cluster to administration / manage
•  Leverages Hadoop file system advantages
•  Not supported for kerberized cluster

How to store Solr’s index on HDFS?
Update core’s solrconfig.xml set
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>hdfs</lockType>
Page 23
1
2

Scalable Indexing In HDFS Using Lucidworks Hadoop connector
• MapReduce job
– CSV
– Microsoft Office files
– Grok (log data)
– Zip
– Solr XML
– Seq files
– WARC
• Apache Pig & Hive
– Write your own pig/hive scripts
to index content
– Use hive/pig for
preprocessing and joining
– Output the resulting datasets
to Solr
HDFS
MapReduce or Pig Job
Solr
Raw Documents Lucene Indexes

Ingest csv files using Map Reduce
Scenario I: Ingest CSV data stored on local disk
ot@127.0.0.1:/root/csv
java -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/
2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/
2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws-
job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*"
com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,
1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user -
DcsvDelimiter="|" -Dlww.commit.on.close=true -cls
com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of
com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://
172.16.227.204:8983/solr
Page 25

Query Solr Index in Hive
Scenario II: Query index data via Hive
CREATE EXTERNAL TABLE solr (id string, location string,
event_timestamp String, deviceid String, heartrate BigInt,
user String) STORED BY
'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/
tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://
172.16.227.204:8983/solr', 'solr.collection' = 'hr',
'solr.query' = '*:*');
SELECT user,heartrate FROM solr
Page 26

Index existing Hive-data
Scenario III: Index data stored in Hive
(copy legacydata.csv to hdfs:///user/guest/legacy)
CREATE TABLE legacyhr (id string, location string, time
string, hr int, user string) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '|';
LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR;
INSERT INTO TABLE solr SELECT id, location, time,
'nodevice', hr, user FROM legacyhr;
Page 27

Transform & Index documents with Pig
Scenario IV: Transform and index data stored on HDFS
Data.pig
REGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar';
set solr.collection '$collection';
A = load '/user/guest/pigdata' using PigStorage(';') as
(id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l
:long,user_s:chararray);
–  — ID comes first, then field name, value
B = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2,
'deviceid', $3, 'heartrate', $4, 'user', $5;
ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc();
pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig
Page 28

Solr Cloud

SolrCloud
Apache Solr includes Fault Tolerance & High Availability:
SolrCloud
•  Distributed indexing
•  Distributed search
•  Central configuration for the entire cluster
•  Automatic load balancing and fail-over for queries
•  ZooKeeper integration for cluster coordination and configuration.

Sizing

Sizing Guidelines (Handle With Care)
• 100-250 Million docs per solr server
• 4 solr servers per physical machine
– Physical Machine – 20 cores, 128GB RAM
• Queries/s in double digit ms response time up to 30.

Demo

Hortonworks Technical Workshop - HDP Search

More Related Content

What's hot

Viewers also liked

Similar to Hortonworks Technical Workshop - HDP Search

More from Hortonworks

Recently uploaded

Hortonworks Technical Workshop - HDP Search