Your SlideShare is downloading. ×
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

533
views

Published on

You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to …

You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.

Published in: Software

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
533
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Building Google-in-a-box:! using Apache SolrCloud and Bigtop to index your bigdata
  • 2. Who’s this guy?
  • 3. Roman Shaposhnik! @rhatr or rvs@apache.org •  Sr. Manager at Pivotal Inc. building a team of ASF contributors •  ASF junkie •  VP of Apache Incubator, former VP of Apache Bigtop •  Hadoop/Sqoop/Giraph committer •  contributor across the Hadoop ecosystem) •  Used to be root@Cloudera •  Used to be a PHB at Yahoo! •  Used to be a UNIX hacker at Sun microsystems •  First time author: “Giraph in action”
  • 4. What’s this all about?
  • 5. This is NOT this kind of talk
  • 6. This is this kind of a talk:
  • 7. What are we building?
  • 8. WWW analytics platform HDFS HBase MapReduce Nutch WWW Solr Cloud Lily HBase Indexer Hive Hue DataSci Replication Morphlines Pig Zookeeper
  • 9. Google papers •  GFS (Google FS) == HDFS •  MapReduce == MapReduce •  Bigtable == HBase •  Sawzall == Pig/Hive •  F1 == HAWQ/Impala
  • 10. Storage design requirements •  Low-level storage layer: KISS •  commodity hardware •  massively scalable •  highly available •  minimalistic set of APIs (non-POSIX) •  Application specific storage layer •  leverages LLSL •  Fast r/w random access (vs. immutable streaming) •  Scan operations
  • 11. Design patterns •  HDFS is the “data lake” •  Great simplification of storage administration (no-SANAS) •  “Stateless” distributed applications persistence layer •  Applications are “stateless” compositions of various services •  Can be instantiated anywhere (think YARN) •  Can restart serving up the state from HDFS •  Are coordinated via Zookeeper
  • 12. Application design: SolrCloud HDFS Zookeeper Solr svc … Solr svcI am alive Who Am I? What do I do?
  • 13. Application design: SolrCloud HDFS Zookeeper Solr svc Peer is dead What do I do?
  • 14. Application design: SolrCloud HDFS Zookeeper Solr svc … Solr svcI am alive Who Am I? What do I do? replication kicks in
  • 15. How do we build something like this?
  • 16. The bill of materials •  HDFS •  Zookeeper •  HBase •  Nutch •  Lily HBase indexer •  SolrCloud •  Morphlines (part of Project Kite) •  Hue •  Hive/Pig/…
  • 17. How about? $ for comp in hadoop hbase zookeeper … ; do wget http://dist.apache.org/$comp tar xzvf $comp.tar.gz cd $comp ; mvn/ant/make install scp … ssh … done
  • 18. How about? $ for comp in hadoop hbase zookeeper … ; do wget http://dist.apache.org/$comp tar xzvf $comp.tar.gz cd $comp ; mvn/ant/make install scp … ssh … done
  • 19. We’ve seen this before! GNU Software Linux kernel
  • 20. Apache Bigtop! HBase, Solr.. Hadoop
  • 21. Lets get down to business
  • 22. Still remember this? HDFS HBase Solr Cloud Lily HBase Indexer Replication Morphlines Zookeeper Hue DataSci
  • 23. HBase: row-key design com.cnn.www/a.html <html>... content: CNN CNN.com anchor:a.com anchor:b.com
  • 24. Indexing: schema design •  Bad news: no more “schema on write” •  Good news: you can change it on the fly •  Lets start with the simplest one:! ! <field name=”id" type=”string" indexed="true" stored="true” required=“true”/> <field name=”text" type="text_general" indexed="true" stored="true"/> <field name=“url” type=”string" indexed="true" stored="true”/>
  • 25. Deployment •  Single node pseudo distributed configuration •  Puppet-driven deployment •  Bigtop comes with modules •  You provide your own cluster topology in cluster.pp
  • 26. Deploying the ‘data lake’ •  Zookeeper •  3-5 members of the ensemble # vi /etc/zookeeper/conf/zoo.cfg # service zookeeper-server init # service zookeeper-server start! •  HDFS •  tons of configurations to consider: HA, NFS, etc. •  see above, plus: /usr/lib/hadoop/libexec/init-hdfs.sh
  • 27. HBase asynchronous indexing •  leveraging WAL for indexing •  can achieve infinite scalability of the indexer •  doesn’t slow down HBase (unlike co-processors) •  /etc/hbase/conf/hbase-site.xml: <property> <name>hbase.replication</name> <value>true</value> </property>
  • 28. Different clouds HBase Region Server HBase Region Server Lily Indexer Node Lily Indexer Node Solr Node Solr Node HBase “cloud” Lily Indexer “cloud” SolrCloud home of Morphline ETL … … … replication Solr docs
  • 29. Lily HBase indexer •  Pretends to be a region server on the receiving end •  Gets records •  Pipes them through the Morphline ETL •  Feeds the result to Solr •  All operations are managed via individual indexers
  • 30. Creating an indexer $ hbase-indexer add-indexer ! --name web_crawl ! --indexer-conf ./indexer.xml ! --connection-param solr.zk=localhost/solr ! --connection-param solr.collection=web_crawl ! --zookeeper localhost:2181
  • 31. indexer.xml <indexer table="web_crawl" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"> <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/> <!-- <param name="morphlineId" value="morphline1"/> à </indexer>
  • 32. Morphlines •  Part of Project Kite (look for it on GitHub) •  A very flexible ETL library (not just for HBase) •  “UNIX pipes” for bigdata •  Designed for NRT processing •  Record-oriented processing driven by HOCON definition •  Require a “pump” (most of the time) •  Have built-in syncs (e.g. loadSolr) •  Essentially a push-based data flow engine
  • 33. Different clouds extractHBaseCells convertHTML WAL entries N records xquery logInfo M records P records Solr docs
  • 34. Morphline spec morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"] commands : [ { extractHBaseCells {…} } { convertHTML {charset : UTF-8} } { xquery {…} } { logInfo { format : "output record: {}", args : ["@{}"] } } ] } ]
  • 35. extractHBaseCells { extractHBaseCells { mappings : [ { inputColumn : "content:*" outputField : "_attachment_body" type : "byte[]" source : value } ] } }
  • 36. xquery { xquery { fragments : [ { fragmentPath : "/" queryString : """ <fieldsToIndex> <webpage> {for $tk in //text() return concat($tk, ' ')} </webpage> </fieldsToIndex> """ } ] } }
  • 37. SolrCloud •  Serves up lucene indices from HDFS •  A webapp running on bigtop-tomcat •  gets configured via /etc/default/solr! SOLR_PORT=8983 SOLR_ADMIN_PORT=8984 SOLR_LOG=/var/log/solr SOLR_ZK_ENSEMBLE=localhost:2181/solr SOLR_HDFS_HOME=hdfs://localhost:8020/solr SOLR_HDFS_CONFIG=/etc/hadoop/conf! !
  • 38. Collections and instancedirs •  All of these objects reside in Zookeeper •  An unfortunate trend we already saw with Lily indexers •  Collection •  a distributed set of lucene indices •  an object defined by Zookeeper configuration •  Collection require (and can share) configurations in instancedir •  Bigtop-provided tool: solrcrl! $ solrctl [init|instacedir|collection|…]
  • 39. Creating a collection # solrctl init $ solrctl instancedir --generate /tmp/web_crawl $ vim /tmp/web_crawl/conf/schema.xml $ vim /tmp/web_crawl/conf/solrconfig.xml $ solrctl instancedir --create web_crawl /tmp/web_crawl $ solrctl collection --create web_crawl -s 1
  • 40. Hue •  Apache licensed, but not an ASF project •  A nice, flexible UI for Hadoop bigdata management platform •  Follows an extensible app model
  • 41. Demo time!
  • 42. Where to go from here
  • 43. Questions?