Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hortonworks Technical Workshop - HDP Search

5,599 views

Published on

The Enterprise Data Lake has become the defacto repository of both structured and unstructured data within an enterprise. Being able to discover information across both structured and unstructured data using search is a key capability of enterprise data lake. In this workshop, we will provide an in-depth overview of HDP Search with focus on configuration, sizing and tuning. We will also deliver a working example to showcase the usage of HDP Search along with the rest of platform capabilities to deliver real world solution.

Published in: Technology

Hortonworks Technical Workshop - HDP Search

  1. 1. Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search Workshop Hortonworks. We do Hadoop. 1/29/2013
  2. 2. Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda •  Hortonworks Data Platform 2.2 •  Apache Solr •  Query & Ingest Documents with Apache Solr •  Solr & Hadoop •  Index on HDFS •  MapReduce, Hive & Pig •  Solr Cloud •  Sizing •  Demo
  3. 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP 2.2
  4. 4. Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN
  5. 5. Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP 2.2: Reliable, Consistent & Current HDP is Apache Hadoop not “based on” Hadoop
  6. 6. Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search HDP 2.2 contains support for: •  Apache Solr 4.10 with Lucense •  Banana (Time-series visualization) •  Lucidworks Hadoop connector
  7. 7. Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr
  8. 8. Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Solr •  A system built to search text •  A specialized type of database management System •  A platform to build search applications on •  Customizable, open source software
  9. 9. Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Apache Solr Specialized tools do the job better! •  Solr performs much better, for text search, than a relational database •  Solr knows about languages » E.g. lowercasing ὈΔΥΣΕΎΣ produces ὀδυσεύς •  Solr has features specific to text search, » E.g. highlighting search results
  10. 10. Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Where does Apache Solr fit?
  11. 11. Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr’s Architecture
  12. 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Solr’s inner Architecture
  13. 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Basics Of Inverted Index Doc ID Content 1 I like dog 2 I like cat 3 I like dog and cat1 Term Doc ID I 1,2,3 like 1,2,3 dog 1,3 cat 2,3 Document Inverted Index Question Find documents with Dog & Cat Answer Intersect the index for dog and cat (1,3) (2,3) = 3
  14. 14. Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SOLR Indexing •  Define document structure using schema.xml •  Convert document from source format to a format supported by solr (xml, json, csv) •  Add Documents to SOLR <doc> <field name="id">1</field> <field name="screen_name">@thelabdude</field> <field name=”cat">post</field> </doc> Sample Document in XML Format
  15. 15. Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr’s schema & fields Before adding documents to Solr, you need to specify the schema, represented in a file called schema.xml. The schema declares: -  Fields -  Field used as the unique/primary key -  Field type -  How to index and search each a field Field Types In Solr, every field has a type. E.g.: float, long, double, date, text Defining a field: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/>
  16. 16. Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Dynamic Fields Dynamic fields allow Solr to index fields that you did not explicitly define in your schema Like a regular field except it has a name with a wildcard in it. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> For more field details see: http://wiki.apache.org/solr/SchemaXml
  17. 17. Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query & Index Documents with Solr
  18. 18. Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Adding & Deleting From SOLR Solr offers a REST like interface for indexing and searching: Add to Index: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json' Delete from Index: curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></ delete>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/ xml; charset=utf-8' csv, json and xml handled directely. Solr leverages Apache Tika for complex document types (pdf, word, etc.)
  19. 19. Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How To Query Solr offers a REST like interface for indexing and searching: Query http://localhost:8983/solr/select?q=name:monsters&wt=json&indent=true
  20. 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Java API (SolrJ) Index: SolrServer server = new HttpSolrServer("http://HOST:8983/solr/"); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( "id", "id1", 1.0f ); doc1.addField( "name", "doc1", 1.0f ); server.add( docs ); // You can also stream docs in a single HTTP Request by providing an Iterator to add() server.commit(); Query: SolrQuery solrQuery = new SolrQuery().setQuery("ipod”); QueryResponse rsp = server.query(solrQuery); Iterator<SolrDocument> iter = rsp.getResults().iterator(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue("content"); }
  21. 21. Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr & Hadoop
  22. 22. Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP Search : Deployment Options Page 22 Configuration Advantages Disadvantages Solr deployed in an independent cluster •  Scale independently •  Scale easily for increased query volume •  No need to carefully orchestrate resource allocations among workloads, indexing, and querying •  Multiple clusters to admin and manage Solr index deployed on HDFS node •  Single cluster to administration / manage •  Leverages Hadoop file system advantages •  Not supported for kerberized cluster
  23. 23. Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How to store Solr’s index on HDFS? Update core’s solrconfig.xml set <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://sandbox:8020/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">true</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory> <lockType>hdfs</lockType> Page 23 1 2
  24. 24. Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Scalable Indexing In HDFS Using Lucidworks Hadoop connector • MapReduce job – CSV – Microsoft Office files – Grok (log data) – Zip – Solr XML – Seq files – WARC • Apache Pig & Hive – Write your own pig/hive scripts to index content – Use hive/pig for preprocessing and joining – Output the resulting datasets to Solr HDFS MapReduce or Pig Job Solr Raw Documents Lucene Indexes
  25. 25. Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Ingest csv files using Map Reduce Scenario I: Ingest CSV data stored on local disk ot@127.0.0.1:/root/csv java -classpath "/usr/hdp/2.2.0.0-2041/hadoop-yarn/*:/usr/hdp/ 2.2.0.0-2041/hadoop-mapreduce/*:/opt/solr/lw/lib/*:/usr/hdp/ 2.2.0.0-2041/hadoop/lib/*:/opt/solr/lucidworks-hadoop-lws- job-1.3.0.jar:/usr/hdp/2.2.0.0-2041/hadoop/*" com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id, 1=location,2=event_timestamp,3=deviceid,4=heartrate,5=user - DcsvDelimiter="|" -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hr -i ./csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http:// 172.16.227.204:8983/solr Page 25
  26. 26. Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Query Solr Index in Hive Scenario II: Query index data via Hive CREATE EXTERNAL TABLE solr (id string, location string, event_timestamp String, deviceid String, heartrate BigInt, user String) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/ tmp/solr' TBLPROPERTIES('solr.server.url' = 'http:// 172.16.227.204:8983/solr', 'solr.collection' = 'hr', 'solr.query' = '*:*'); SELECT user,heartrate FROM solr Page 26
  27. 27. Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Index existing Hive-data Scenario III: Index data stored in Hive (copy legacydata.csv to hdfs:///user/guest/legacy) CREATE TABLE legacyhr (id string, location string, time string, hr int, user string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; LOAD DATA INPATH '/user/guest/legacy' INTO TABLE LEGACYHR; INSERT INTO TABLE solr SELECT id, location, time, 'nodevice', hr, user FROM legacyhr; Page 27
  28. 28. Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Transform & Index documents with Pig Scenario IV: Transform and index data stored on HDFS Data.pig REGISTER '/opt/hadoop-lws-job-2.0.1-0-0-hadoop2.jar'; set solr.collection '$collection'; A = load '/user/guest/pigdata' using PigStorage(';') as (id_s:chararray,location_s:chararray,event_timestamp_s:chararray,deviceid_s:chararray,heartrate_l :long,user_s:chararray); –  — ID comes first, then field name, value B = FOREACH A GENERATE $0, 'location', '29.4238889,-98.4933333', 'event_timestamp',$2, 'deviceid', $3, 'heartrate', $4, 'user', $5; ok = store B into '$solrUrl' using com.lucidworks.hadoop.pig.SolrStoreFunc(); pig -p solrUrl=http://172.16.227.204:8983/solr -p collection=hr data.pig Page 28
  29. 29. Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Cloud
  30. 30. Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SolrCloud Apache Solr includes Fault Tolerance & High Availability: SolrCloud •  Distributed indexing •  Distributed search •  Central configuration for the entire cluster •  Automatic load balancing and fail-over for queries •  ZooKeeper integration for cluster coordination and configuration.
  31. 31. Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing
  32. 32. Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing Guidelines (Handle With Care) • 100-250 Million docs per solr server • 4 solr servers per physical machine – Physical Machine – 20 cores, 128GB RAM • Queries/s in double digit ms response time up to 30.
  33. 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Demo

×