Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search Workshop
Hortonworks. We do Hadoop.
1/29/2013
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
•  Hortonworks Data Platform 2.2
•  Apache Solr
•  Query & Ingest Documents with Apache Solr
•  Solr & Hadoop
•  Index on HDFS
•  MapReduce, Hive & Pig
•  Solr Cloud
•  Sizing
•  Demo
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP 2.2
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP 2.2: Reliable, Consistent & Current
HDP is Apache Hadoop not “based on” Hadoop
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search
HDP 2.2 contains support for:
•  Apache Solr 4.10 with Lucense
•  Banana (Time-series visualization)
•  Lucidworks Hadoop connector
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Solr
•  A system built to search text
•  A specialized type of database management System
•  A platform to build search applications on
•  Customizable, open source software
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Apache Solr
Specialized tools do the job better!
•  Solr performs much better, for text search, than a relational database
•  Solr knows about languages
» E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς
•  Solr has features specific to text search,
» E.g. highlighting search results
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Where does Apache Solr fit?
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr’s Architecture
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Solr’s inner Architecture
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics Of Inverted Index
Doc ID Content
1 I like dog
2 I like cat
3 I like dog and
cat1
Term Doc ID
I 1,2,3
like 1,2,3
dog 1,3
cat 2,3
Document Inverted Index
Question
Find documents with Dog & Cat
Answer
Intersect the index for dog and cat
(1,3) (2,3) = 3
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SOLR Indexing
•  Define document structure using
schema.xml
•  Convert document from source
format to a format supported by solr
(xml, json, csv)
•  Add Documents to SOLR
<doc>
<field name="id">1</field>
<field name="screen_name">@thelabdude</field>
<field name=”cat">post</field>
</doc>
Sample Document in XML Format
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr’s schema & fields
Before adding documents to Solr, you need to specify the schema, represented in a file called
schema.xml.
The schema declares:
-  Fields
-  Field used as the unique/primary key
-  Field type
-  How to index and search each a field
Field Types
In Solr, every field has a type. E.g.: float, long, double, date, text
Defining a field:
<field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Dynamic Fields
Dynamic fields allow Solr to index fields that you did not explicitly define in your schema
Like a regular field except it has a name with a wildcard in it.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
For more field details see: http://wiki.apache.org/solr/SchemaXml
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query & Index Documents with Solr
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Adding & Deleting From SOLR
Solr offers a REST like interface for indexing and searching:
Add to Index:
curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary
@books.json -H 'Content-type:application/json'
Delete from Index:
curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></
delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/
xml; charset=utf-8'
csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How To Query
Solr offers a REST like interface for indexing and searching:
Query
http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Java API (SolrJ)
Index:
SolrServer server = new HttpSolrServer("http://HOST:8983/solr/");
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( "id", "id1", 1.0f );
doc1.addField( "name", "doc1", 1.0f );
server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add()
server.commit();
Query:
SolrQuery solrQuery = new SolrQuery().setQuery("ipod”);
QueryResponse rsp = server.query(solrQuery);
Iterator<SolrDocument> iter = rsp.getResults().iterator();
while (iter.hasNext()) {
SolrDocument resultDoc = iter.next();
String content = (String) resultDoc.getFieldValue("content");
}
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr & Hadoop
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Search : Deployment Options
Page 22
Configuration Advantages Disadvantages
Solr deployed in an
independent cluster
•  Scale independently
•  Scale easily for increased query volume
•  No need to carefully orchestrate resource
allocations among workloads, indexing, and
querying
•  Multiple clusters to admin and manage
Solr index deployed on
HDFS node
•  Single cluster to administration / manage
•  Leverages Hadoop file system advantages
•  Not supported for kerberized cluster
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to store Solr’s index on HDFS?
Update core’s solrconfig.xml set
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>hdfs</lockType>
Page 23
1
2
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Scalable Indexing In HDFS Using Lucidworks Hadoop connector
• MapReduce job
– CSV
– Microsoft Office files
– Grok (log data)
– Zip
– Solr XML
– Seq files
– WARC
• Apache Pig & Hive
– Write your own pig/hive scripts
to index content
– Use hive/pig for
preprocessing and joining
– Output the resulting datasets
to Solr
HDFS
MapReduce or Pig Job
Solr
Raw Documents Lucene Indexes
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ingest csv files using Map Reduce
Scenario I: Ingest CSV data stored on local disk
ot@127.0.0.1:/root/csv
java -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/
2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/
2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws-
job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*"
com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,
1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user -
DcsvDelimiter="|" -Dlww.commit.on.close=true -cls
com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of
com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://
172.16.227.204:8983/solr
Page 25
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Query Solr Index in Hive
Scenario II: Query index data via Hive
CREATE EXTERNAL TABLE solr (id string, location string,
event_timestamp String, deviceid String, heartrate BigInt,
user String) STORED BY
'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/
tmp/solr' TBLPROPERTIES('solr.server.url' = 'http://
172.16.227.204:8983/solr', 'solr.collection' = 'hr',
'solr.query' = '*:*');
SELECT user,heartrate FROM solr
Page 26
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Index existing Hive-data
Scenario III: Index data stored in Hive
(copy legacydata.csv to hdfs:///user/guest/legacy)
CREATE TABLE legacyhr (id string, location string, time
string, hr int, user string) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '|';
LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR;
INSERT INTO TABLE solr SELECT id, location, time,
'nodevice', hr, user FROM legacyhr;
Page 27
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transform & Index documents with Pig
Scenario IV: Transform and index data stored on HDFS
Data.pig
REGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar';
set solr.collection '$collection';
A = load '/user/guest/pigdata' using PigStorage(';') as
(id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l
:long,user_s:chararray);
–  — ID comes first, then field name, value
B = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2,
'deviceid', $3, 'heartrate', $4, 'user', $5;
ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc();
pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig
Page 28
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Cloud
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SolrCloud
Apache Solr includes Fault Tolerance & High Availability:
SolrCloud
•  Distributed indexing
•  Distributed search
•  Central configuration for the entire cluster
•  Automatic load balancing and fail-over for queries
•  ZooKeeper integration for cluster coordination and configuration.
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing Guidelines (Handle With Care)
• 100-250 Million docs per solr server
• 4 solr servers per physical machine
– Physical Machine – 20 cores, 128GB RAM
• Queries/s in double digit ms response time up to 30.
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Demo

Hortonworks Technical Workshop - HDP Search

  • 1.
    Page 1 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search Workshop Hortonworks. We do Hadoop. 1/29/2013
  • 2.
    Page 2 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda •  Hortonworks Data Platform 2.2 •  Apache Solr •  Query & Ingest Documents with Apache Solr •  Solr & Hadoop •  Index on HDFS •  MapReduce, Hive & Pig •  Solr Cloud •  Sizing •  Demo
  • 3.
    Page 3 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP 2.2
  • 4.
    Page 4 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN
  • 5.
    Page 5 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP 2.2: Reliable, Consistent & Current HDP is Apache Hadoop not “based on” Hadoop
  • 6.
    Page 6 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search HDP 2.2 contains support for: •  Apache Solr 4.10 with Lucense •  Banana (Time-series visualization) •  Lucidworks Hadoop connector
  • 7.
    Page 7 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr
  • 8.
    Page 8 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Solr •  A system built to search text •  A specialized type of database management System •  A platform to build search applications on •  Customizable, open source software
  • 9.
    Page 9 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Apache Solr Specialized tools do the job better! •  Solr performs much better, for text search, than a relational database •  Solr knows about languages » E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς •  Solr has features specific to text search, » E.g. highlighting search results
  • 10.
    Page 10 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Where does Apache Solr fit?
  • 11.
    Page 11 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr’s Architecture
  • 12.
    Page 12 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr’s inner Architecture
  • 13.
    Page 13 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Basics Of Inverted Index Doc ID Content 1 I like dog 2 I like cat 3 I like dog and cat1 Term Doc ID I 1,2,3 like 1,2,3 dog 1,3 cat 2,3 Document Inverted Index Question Find documents with Dog & Cat Answer Intersect the index for dog and cat (1,3) (2,3) = 3
  • 14.
    Page 14 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved SOLR Indexing •  Define document structure using schema.xml •  Convert document from source format to a format supported by solr (xml, json, csv) •  Add Documents to SOLR <doc> <field name="id">1</field> <field name="screen_name">@thelabdude</field> <field name=”cat">post</field> </doc> Sample Document in XML Format
  • 15.
    Page 15 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr’s schema & fields Before adding documents to Solr, you need to specify the schema, represented in a file called schema.xml. The schema declares: -  Fields -  Field used as the unique/primary key -  Field type -  How to index and search each a field Field Types In Solr, every field has a type. E.g.: float, long, double, date, text Defining a field: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
  • 16.
    Page 16 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Dynamic Fields Dynamic fields allow Solr to index fields that you did not explicitly define in your schema Like a regular field except it has a name with a wildcard in it. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> For more field details see: http://wiki.apache.org/solr/SchemaXml
  • 17.
    Page 17 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Query & Index Documents with Solr
  • 18.
    Page 18 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Adding & Deleting From SOLR Solr offers a REST like interface for indexing and searching: Add to Index: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json' Delete from Index: curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></ delete>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/ xml; charset=utf-8' csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)
  • 19.
    Page 19 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved How To Query Solr offers a REST like interface for indexing and searching: Query http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true
  • 20.
    Page 20 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Java API (SolrJ) Index: SolrServer server = new HttpSolrServer("http://HOST:8983/solr/"); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( "id", "id1", 1.0f ); doc1.addField( "name", "doc1", 1.0f ); server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add() server.commit(); Query: SolrQuery solrQuery = new SolrQuery().setQuery("ipod”); QueryResponse rsp = server.query(solrQuery); Iterator<SolrDocument> iter = rsp.getResults().iterator(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue("content"); }
  • 21.
    Page 21 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr & Hadoop
  • 22.
    Page 22 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search : Deployment Options Page 22 Configuration Advantages Disadvantages Solr deployed in an independent cluster •  Scale independently •  Scale easily for increased query volume •  No need to carefully orchestrate resource allocations among workloads, indexing, and querying •  Multiple clusters to admin and manage Solr index deployed on HDFS node •  Single cluster to administration / manage •  Leverages Hadoop file system advantages •  Not supported for kerberized cluster
  • 23.
    Page 23 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved How to store Solr’s index on HDFS? Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType> Page 23 1 2
  • 24.
    Page 24 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Scalable Indexing In HDFS Using Lucidworks Hadoop connector • MapReduce job – CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC • Apache Pig & Hive – Write your own pig/hive scripts to index content – Use hive/pig for preprocessing and joining – Output the resulting datasets to Solr HDFS MapReduce or Pig Job Solr Raw Documents Lucene Indexes
  • 25.
    Page 25 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Ingest csv files using Map Reduce Scenario I: Ingest CSV data stored on local disk ot@127.0.0.1:/root/csv java -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/ 2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/ 2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws- job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*" com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id, 1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user - DcsvDelimiter="|" -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http:// 172.16.227.204:8983/solr Page 25
  • 26.
    Page 26 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Solr Index in Hive Scenario II: Query index data via Hive CREATE EXTERNAL TABLE solr (id string, location string, event_timestamp String, deviceid String, heartrate BigInt, user String) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/ tmp/solr' TBLPROPERTIES('solr.server.url' = 'http:// 172.16.227.204:8983/solr', 'solr.collection' = 'hr', 'solr.query' = '*:*'); SELECT user,heartrate FROM solr Page 26
  • 27.
    Page 27 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Index existing Hive-data Scenario III: Index data stored in Hive (copy legacydata.csv to hdfs:///user/guest/legacy) CREATE TABLE legacyhr (id string, location string, time string, hr int, user string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR; INSERT INTO TABLE solr SELECT id, location, time, 'nodevice', hr, user FROM legacyhr; Page 27
  • 28.
    Page 28 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Transform & Index documents with Pig Scenario IV: Transform and index data stored on HDFS Data.pig REGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar'; set solr.collection '$collection'; A = load '/user/guest/pigdata' using PigStorage(';') as (id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l :long,user_s:chararray); –  — ID comes first, then field name, value B = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2, 'deviceid', $3, 'heartrate', $4, 'user', $5; ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc(); pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig Page 28
  • 29.
    Page 29 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Cloud
  • 30.
    Page 30 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved SolrCloud Apache Solr includes Fault Tolerance & High Availability: SolrCloud •  Distributed indexing •  Distributed search •  Central configuration for the entire cluster •  Automatic load balancing and fail-over for queries •  ZooKeeper integration for cluster coordination and configuration.
  • 31.
    Page 31 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing
  • 32.
    Page 32 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing Guidelines (Handle With Care) • 100-250 Million docs per solr server • 4 solr servers per physical machine – Physical Machine – 20 cores, 128GB RAM • Queries/s in double digit ms response time up to 30.
  • 33.
    Page 33 ©Hortonworks Inc. 2011 – 2015. All Rights Reserved Demo