SlideShare a Scribd company logo
1 of 47
Download to read offline
1
SolrCloud	
  on	
  Hadoop	
  
Cleveland	
  Big	
  Data	
  and	
  Hadoop	
  User	
  Group,	
  January	
  2014	
  
Alex	
  Moundalexis	
  
	
  
@technmsg	
  
Disclaimer	
  
•  Technologies,	
  not	
  products	
  
•  Cloudera	
  builds	
  things	
  soGware	
  
•  most	
  donated	
  to	
  Apache	
  
•  some	
  closed-­‐source	
  
•  I	
  will	
  likely	
  menLon	
  “Cloudera	
  Something”	
  
•  Cloudera	
  “products”	
  I	
  reference	
  are	
  open	
  source	
  
•  Apache	
  Licensed	
  
•  Source	
  code	
  is	
  on	
  GitHub	
  
•  hQps://github.com/cloudera	
  
2
What	
  This	
  Talk	
  Isn’t	
  About	
  
•  Deploying	
  
•  Puppet,	
  Chef,	
  Ansible,	
  homegrown	
  scripts,	
  intern	
  labor	
  
•  Sizing	
  &	
  Tuning	
  
•  Depends	
  heavily	
  on	
  data	
  and	
  workload	
  
•  Coding	
  
•  Unless	
  you	
  count	
  XML	
  or	
  CSV	
  
•  Algorithms	
  
3
4
Quick	
  and	
  dirty,	
  more	
  Lme	
  for	
  use	
  cases.	
  
The	
  Apache	
  Hadoop	
  Ecosystem	
  
Why	
  “Ecosystem?”	
  
•  In	
  the	
  beginning,	
  just	
  Hadoop	
  
•  HDFS	
  
•  MapReduce	
  
•  Today,	
  dozens	
  of	
  interrelated	
  components	
  
•  I/O	
  
•  Processing	
  
•  Specialty	
  ApplicaLons	
  
•  ConfiguraLon	
  
•  Workflow	
  
5
ParLal	
  Ecosystem	
  
6
Hadoop	
  
external	
  system	
  
RDBMS	
  /	
  DWH	
  
web	
  server	
  
device	
  logs	
  
API	
  access	
  
log	
  collecLon	
  
DB	
  table	
  import	
  
batch	
  processing	
  
machine	
  learning	
  
external	
  system	
  
API	
  access	
  
user	
  
RDBMS	
  /	
  DWH	
  
DB	
  table	
  	
  export	
  
BI	
  tool	
  
+	
  JDBC/ODBC	
  
Search	
  
SQL	
  
HDFS	
  
•  Distributed,	
  highly	
  fault-­‐tolerant	
  filesystem	
  
•  OpLmized	
  for	
  large	
  streaming	
  access	
  to	
  data	
  
•  Based	
  on	
  Google	
  File	
  System	
  
•  hQp://research.google.com/archive/gfs.html	
  
7
Lots	
  of	
  Commodity	
  Machines	
  
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce	
  (MR)	
  
•  Programming	
  paradigm	
  
•  Batch	
  oriented,	
  not	
  realLme	
  
•  Works	
  well	
  with	
  distributed	
  compuLng	
  
•  Lots	
  of	
  Java,	
  but	
  other	
  languages	
  supported	
  
•  Based	
  on	
  Google’s	
  paper	
  
•  hQp://research.google.com/archive/mapreduce.html	
  
9
Under	
  the	
  Covers	
  
You specify map() and
reduce() functions.

The framework does the
rest.	

60
Apache	
  HBase	
  
•  Random,	
  realLme	
  read/write	
  access	
  
•  Key/value	
  columnar	
  store	
  
•  (b|tr)illions	
  of	
  rows/columns	
  
•  Based	
  on	
  Google	
  BigTable	
  
•  hQp://research.google.com/archive/bigtable.html	
  
12
Cloudera	
  Hue	
  
•  Hadoop	
  User	
  Experience	
  
•  Hadoop	
  is	
  largely	
  command	
  line	
  
•  Hue	
  provides	
  a	
  UI	
  for	
  end-­‐users	
  
•  SDK	
  to	
  build	
  your	
  own	
  apps	
  on	
  top	
  
13
Apache	
  Tika	
  
•  Content	
  analysis	
  toolkit	
  
•  Simply	
  put,	
  a	
  lot	
  of	
  parsers	
  
•  Detect/extract	
  metadata/text	
  from	
  documents	
  
•  HTML	
  
•  XML	
  
•  Office	
  
•  PDF	
  
•  mbox	
  
•  More…	
  
14
Apache	
  ZooKeeper	
  
•  Distributed	
  systems	
  are	
  HARD	
  
•  Everyone	
  was	
  trying	
  to	
  implement	
  the	
  same	
  subsystems	
  
•  Bugs	
  leads	
  to	
  race	
  condiLons,	
  other	
  bad	
  things	
  
•  ZK:	
  Highly	
  reliable	
  distributed	
  coordinaLon	
  services	
  
•  ConfiguraLon	
  
•  Naming	
  
•  SynchronizaLon	
  
•  Group	
  Services	
  
15
Cloudera	
  Morphlines	
  
•  In-­‐memory	
  transformaLons	
  
•  Load,	
  parse,	
  transform,	
  process	
  
•  Records	
  as	
  name-­‐value	
  pairs	
  w/	
  opLonal	
  blob/pojo	
  objects	
  
•  Java	
  library,	
  embedded	
  in	
  your	
  codebase	
  
•  Used	
  to	
  ETL	
  data	
  from	
  Flume	
  and	
  MR	
  into	
  Solr	
  
•  Was	
  part	
  of	
  CDK,	
  now	
  part	
  of	
  Kite	
  
•  hQp://kitesdk.org	
  
16
Apache	
  Lucene	
  
•  Java-­‐based	
  index	
  and	
  search	
  
•  ranked	
  or	
  sorted	
  results	
  
•  hits	
  streamed	
  through	
  QP	
  
•  mem(results)	
  	
  mem(collecLon)	
  
•  rich/extensible	
  query	
  operators	
  
•  bool,	
  phrase,	
  range,	
  span,	
  spaLal	
  
•  Features	
  
•  spellchecking	
  
•  hit	
  highlighLng	
  
•  tokenizaLon	
  
17
Apache	
  Solr	
  
•  Enterprise	
  search	
  plaporm	
  
•  Based	
  on	
  Apache	
  Lucene	
  
•  Full-­‐text	
  search	
  
•  FaceLng	
  
•  NRT	
  indexing	
  
•  UI	
  
18
Apache	
  Solr	
  –	
  Simple	
  Indexing	
  via	
  CLI 	
  	
  
$	
  java	
  -­‐jar	
  post.jar	
  solr.xml	
  money.xml	
  
SimplePostTool:	
  version	
  1.4	
  
SimplePostTool:	
  POSTing	
  files	
  to	
  http://
localhost:8983/solr/update..	
  
SimplePostTool:	
  POSTing	
  file	
  solr.xml	
  
SimplePostTool:	
  POSTing	
  file	
  money.xml	
  
SimplePostTool:	
  COMMITting	
  Solr	
  index	
  changes..	
  
	
  
$	
  post.sh	
  *.xml	
  
19
Apache	
  Solr	
  –	
  Document	
  money.xml	
  
add	
  
doc	
  
	
  	
  field	
  name=idUSD/field	
  
	
  	
  field	
  name=nameOne	
  Dollar/field	
  
	
  	
  field	
  name=manuBank	
  of	
  America/field	
  
	
  	
  field	
  name=manu_id_sboa/field	
  
	
  	
  field	
  name=catcurrency/field	
  
	
  	
  field	
  name=featuresCoins	
  and	
  notes/field	
  
	
  	
  field	
  name=price_c1,USD/field	
  
	
  	
  field	
  name=inStocktrue/field	
  
/doc	
  
	
  
doc	
  
	
  	
  field	
  name=idEUR/field	
  
	
  	
  field	
  name=nameOne	
  Euro/field	
  
20
Apache	
  Solr	
  –	
  More	
  Advanced	
  Indexing	
  
•  From	
  DB,	
  using	
  Data	
  Import	
  Handler	
  (DIH)	
  
•  Load	
  a	
  CSV	
  file	
  
•  POST	
  JSON	
  documents	
  
•  Index	
  binary	
  documents	
  (uses	
  Tika)	
  
•  SolrJ	
  for	
  programmaLc	
  document	
  creaLon	
  
21
Apache	
  Solr	
  –	
  Querying	
  
•  HTTP	
  GET	
  
•  hQp://solr:8983/solr/collecLon1/select/	
  
•  Examples	
  
•  ?q=Lmestamp:[*	
  TO	
  NOW]	
  
•  ?q=-­‐instock:false	
  
•  ?q={!lucene	
  q.op=AND	
  df=text}myfield:foo	
  +bar	
  -­‐bat	
  
22
Apache	
  Solr	
  –	
  Querying	
  
•  HTTP	
  GET	
  
•  hQp://solr:8983/solr/collecLon1/select/?q=video	
  
•  Examples	
  
•  fl=name,id	
   	
   	
   	
   	
  (return	
  only	
  name	
  and	
  id	
  fields)	
  
•  fl=name,id,score	
   	
   	
   	
  (return	
  relevancy	
  score	
  as	
  well)	
  
•  fl=*,score	
  	
   	
   	
   	
  	
  (return	
  all	
  fields	
  +	
  relevancy	
  score)	
  
•  sort=price	
  descfl=name,id,price	
   	
  	
  	
  (sort	
  by	
  price	
  desc)	
  	
  
•  wt=json 	
   	
   	
   	
  	
  	
  	
  	
  (return	
  response	
  in	
  JSON	
  format)	
  
23
What	
  the	
  Heck	
  is	
  FaceLng?	
  	
  
•  Generate	
  counts	
  for	
  properLes	
  or	
  categories	
  
•  Links	
  allow	
  drill-­‐down	
  or	
  refine	
  search	
  results	
  
	
  
	
  
What?	
  
24
Facets	
  on	
  Amazon.com 	
  	
  
25
Apache	
  Solr	
  –	
  Facets	
  at	
  Query	
  Time 	
  	
  
•  HTTP	
  GET	
  
•  hQp://solr:8983/solr/collecLon1/select/?q=video	
  
•  All	
  docs,	
  count	
  by	
  category	
  
q=*:*facet=truefacet.field=cat	
  
•  All	
  docs,	
  count	
  by	
  category	
  and	
  in-­‐stock	
  status	
  
q=*:*facet=truefacet.field=catfacet.field=inStock	
  
•  Docs	
  matching	
  “ipod”,	
  count	
  by	
  price	
  (above/below	
  $100)	
  
q=ipodfacet=truefacet.query=price:[0	
  TO	
  
100]facet.query=price:[100	
  TO	
  *]	
  	
  	
  
26
Apache	
  Solr	
  –	
  Querying	
  via	
  UI	
  
	
  
27
Apache	
  SolrCloud	
  
•  IntegraLon	
  of	
  Solr	
  +	
  ZooKeeper	
  
•  Provides	
  for	
  shard	
  failover	
  
28
Cloudera	
  Search	
  
•  Based	
  on	
  Apache	
  Solr	
  (incl	
  Lucene	
  and	
  SolrCloud)	
  
•  Fault-­‐tolerance:	
  collecLons	
  backed	
  by	
  HDFS	
  or	
  HBase	
  
•  IntegraLon	
  galore:	
  
•  HBase/Flume/MapReduce	
  w/	
  Lucene	
  
•  Hue	
  w/	
  Solr	
  
•  Avro	
  w/	
  Tika	
  
•  HDFS	
  w/	
  Solr/Lucene	
  
•  Sentry	
  w/	
  Solr	
  
	
  
29
Cloudera	
  Search	
  +	
  Hue	
  
30	
  
Cloudera	
  Search	
  +	
  Hue	
  
31	
  
32
Apologies,	
  I	
  swiped	
  some	
  preQy	
  slides	
  from	
  markeLng…	
  
Why	
  Search?	
  
Search	
  Design	
  Strategy	
  
33
One	
  pool	
  of	
  data	
  
One	
  security	
  framework	
  
One	
  set	
  of	
  system	
  resources	
  
One	
  management	
  interface	
  
An	
  Integrated	
  Part	
  of	
  
the	
  Hadoop	
  System	
  
Storage	
  
Integra5on	
  
Resource	
  Management	
  
Metadata	
  
Batch	
  
Processing	
  
MAPREDUCE,	
  
HIVE	
  	
  PIG	
  
…
HDFS	
   HBase	
  
TEXT,	
  RCFILE,	
  PARQUET,	
  AVRO,	
  ETC.	
   RECORDS	
  
Engines	
  
InteracLve	
  
SQL	
  
CLOUDERA	
  
IMPALA	
  
InteracLve	
  
Search	
  
CLOUDERA	
  
SEARCH	
  
Machine	
  
Learning	
  
MAHOUT	
  
Math	
  	
  
Sta5s5cs	
  
SAS,	
  R	
  
	
  
Benefits	
  of	
  Search	
  IntegraLon	
  
34
Improved	
  Big	
  Data	
  ROI	
  
§  An	
  interacLve	
  experience	
  without	
  technical	
  knowledge	
  
§  Single	
  data	
  set	
  for	
  mulLple	
  compuLng	
  frameworks	
  
Faster	
  Time	
  to	
  Insight	
  
§  Exploratory	
  analysis,	
  esp.	
  unstructured	
  data	
  
§  Broad	
  range	
  of	
  indexing	
  opLons	
  to	
  accommodate	
  needs	
  
Cost	
  Efficiency	
  
§  Single	
  scalable	
  plaporm;	
  no	
  incremental	
  investment	
  
§  No	
  need	
  for	
  separate	
  systems,	
  storage	
  
Solid	
  Founda5ons	
  	
  Reliability	
  
§  Solr	
  in	
  producLon	
  environments	
  for	
  years	
  
§  Hadoop-­‐powered	
  reliability	
  and	
  scalability	
  
35
Some	
  quick	
  examples.	
  
Search	
  Use	
  Cases	
  
Search	
  Use	
  Cases	
  
36
Offer	
  easy	
  access	
  to	
  non-­‐technical	
  
resources	
  
Explore	
  data	
  prior	
  to	
  processing	
  and	
  
modeling	
  
Gain	
  immediate	
  access	
  and	
  find	
  
correlaLons	
  in	
  mission-­‐criLcal	
  data	
  
Powerful,	
  proven	
  search	
  capabili5es	
  that	
  
let	
  organiza5ons:	
  
Monsanto	
  
37
Scalable,	
  efficient	
  image	
  search	
  for	
  
analysis	
  and	
  research	
  
Track	
  plant	
  characterisLcs	
  throughout	
  their	
  
lifecycle	
  
Before:	
  Manual	
  aQribute	
  extracLon	
  and	
  search	
  
queries	
  within	
  database	
  
Now:	
  Parse	
  and	
  index	
  images	
  at	
  acquisiLon	
  and	
  
on	
  demand,	
  index	
  archived	
  images	
  in	
  batch	
  
38
Cloudera:	
  Internal	
  Field	
  Portal	
  
Custom	
  Aggregated	
  Search	
  
Cloudera	
  –	
  Internal	
  Field	
  Portal	
  
•  Single	
  stop	
  for	
  field	
  engineers	
  
•  Mailing	
  lists:	
  public,	
  private	
  
•  Tickets:	
  support,	
  development,	
  public	
  ASF	
  
•  Customer	
  data:	
  accounts,	
  clusters,	
  KB	
  arLcles	
  
•  Customer	
  Clusters:	
  configs,	
  audits,	
  logs,	
  events	
  
•  Books	
  and	
  papers	
  
•  Discussion	
  forums	
  
•  Dogfooding,	
  yes	
  
•  Makes	
  my	
  life	
  easier	
  
39
Cloudera	
  –	
  Internal	
  Field	
  Portal	
  
40	
  
Cloudera	
  –	
  Internal	
  Field	
  Portal	
  
•  Varied	
  fetchers/observers	
  for	
  web/API	
  content	
  
•  Content	
  is	
  retrieved	
  via	
  Flume,	
  Sqoop	
  
•  Search	
  indexes	
  and	
  replicates	
  into	
  HBase	
  
•  Each	
  collecLon	
  has	
  collecLon-­‐specific	
  filters/fields	
  
•  Provides	
  Ltle,	
  content	
  snippet,	
  link	
  to	
  original	
  
•  Morphlines	
  extracts	
  books	
  and	
  papers	
  using	
  Tika	
  
•  Impala	
  for	
  analyLcs	
  
•  Future:	
  Use	
  MapReduce	
  to	
  ingest	
  logs	
  
41
42
ParLng	
  thoughts…	
  in	
  no	
  parLcular	
  order.	
  
Summary	
  
Search	
  Simplifies	
  InteracLon	
  
43
Explore	
  
Navigate	
  
Correlate	
  
Experts	
  know	
  MapReduce.	
  Savvy	
  people	
  know	
  SQL.	
  	
  
Everyone	
  knows	
  Search.	
  
Summary	
  
•  With	
  Hadoop,	
  it	
  depends.	
  
•  The	
  tools	
  are	
  out	
  there.	
  
•  Open	
  source	
  soGware,	
  hooray!	
  
•  Many	
  interconnected	
  pieces	
  
•  Many	
  unexplored	
  opportuniLes	
  
•  A	
  thriving	
  community	
  awaits	
  you…	
  
•  Data	
  can	
  make	
  a	
  difference.	
  
•  Search	
  allows	
  everyone	
  to	
  interact	
  with	
  data.	
  
•  This	
  is	
  a	
  Big	
  Deal.	
  
44
What’s	
  Next?	
  
•  Search	
  examples	
  
•  hQp://blog.cloudera.com/blog/category/search/	
  
•  Cloudera	
  provides	
  pre-­‐loaded	
  VMs	
  
•  hQp://Lny.cloudera.com/quickstartvm	
  
•  Clone	
  our	
  repos!	
  
•  hQps://github.com/cloudera	
  
45
46
Preferably	
  related	
  to	
  the	
  talk…	
  
QuesLons?	
  
47
Thank	
  You!	
  
Alex	
  Moundalexis	
  
	
  
@technmsg	
  
	
  
We’re	
  hiring,	
  kids!	
  Well,	
  not	
  kids.	
  

More Related Content

What's hot

What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataShalin Shekhar Mangar
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Lucidworks
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks
 
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...Lucidworks
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Lucidworks
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Lucidworks
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015polo li
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Lucidworks
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 

What's hot (20)

Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
 
Scaling Solr with Solr Cloud
Scaling Solr with Solr CloudScaling Solr with Solr Cloud
Scaling Solr with Solr Cloud
 
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 

Similar to SolrCloud on Hadoop

Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Tips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredTips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredAcquia
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 

Similar to SolrCloud on Hadoop (20)

Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Tips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredTips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding Required
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 

More from Alex Moundalexis

More from Alex Moundalexis (8)

Powered by the Sun
Powered by the SunPowered by the Sun
Powered by the Sun
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
YARN
YARNYARN
YARN
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Introduction to Cloudera Impala
Introduction to Cloudera ImpalaIntroduction to Cloudera Impala
Introduction to Cloudera Impala
 
Many Hats at Cloudera
Many Hats at ClouderaMany Hats at Cloudera
Many Hats at Cloudera
 
Hue Visual Tour
Hue Visual TourHue Visual Tour
Hue Visual Tour
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

SolrCloud on Hadoop

  • 1. 1 SolrCloud  on  Hadoop   Cleveland  Big  Data  and  Hadoop  User  Group,  January  2014   Alex  Moundalexis     @technmsg  
  • 2. Disclaimer   •  Technologies,  not  products   •  Cloudera  builds  things  soGware   •  most  donated  to  Apache   •  some  closed-­‐source   •  I  will  likely  menLon  “Cloudera  Something”   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  Source  code  is  on  GitHub   •  hQps://github.com/cloudera   2
  • 3. What  This  Talk  Isn’t  About   •  Deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  Sizing  &  Tuning   •  Depends  heavily  on  data  and  workload   •  Coding   •  Unless  you  count  XML  or  CSV   •  Algorithms   3
  • 4. 4 Quick  and  dirty,  more  Lme  for  use  cases.   The  Apache  Hadoop  Ecosystem  
  • 5. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaLons   •  ConfiguraLon   •  Workflow   5
  • 6. ParLal  Ecosystem   6 Hadoop   external  system   RDBMS  /  DWH   web  server   device  logs   API  access   log  collecLon   DB  table  import   batch  processing   machine  learning   external  system   API  access   user   RDBMS  /  DWH   DB  table    export   BI  tool   +  JDBC/ODBC   Search   SQL  
  • 7. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpLmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hQp://research.google.com/archive/gfs.html   7
  • 8. Lots  of  Commodity  Machines   8 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 9. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realLme   •  Works  well  with  distributed  compuLng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hQp://research.google.com/archive/mapreduce.html   9
  • 11. You specify map() and reduce() functions. The framework does the rest. 60
  • 12. Apache  HBase   •  Random,  realLme  read/write  access   •  Key/value  columnar  store   •  (b|tr)illions  of  rows/columns   •  Based  on  Google  BigTable   •  hQp://research.google.com/archive/bigtable.html   12
  • 13. Cloudera  Hue   •  Hadoop  User  Experience   •  Hadoop  is  largely  command  line   •  Hue  provides  a  UI  for  end-­‐users   •  SDK  to  build  your  own  apps  on  top   13
  • 14. Apache  Tika   •  Content  analysis  toolkit   •  Simply  put,  a  lot  of  parsers   •  Detect/extract  metadata/text  from  documents   •  HTML   •  XML   •  Office   •  PDF   •  mbox   •  More…   14
  • 15. Apache  ZooKeeper   •  Distributed  systems  are  HARD   •  Everyone  was  trying  to  implement  the  same  subsystems   •  Bugs  leads  to  race  condiLons,  other  bad  things   •  ZK:  Highly  reliable  distributed  coordinaLon  services   •  ConfiguraLon   •  Naming   •  SynchronizaLon   •  Group  Services   15
  • 16. Cloudera  Morphlines   •  In-­‐memory  transformaLons   •  Load,  parse,  transform,  process   •  Records  as  name-­‐value  pairs  w/  opLonal  blob/pojo  objects   •  Java  library,  embedded  in  your  codebase   •  Used  to  ETL  data  from  Flume  and  MR  into  Solr   •  Was  part  of  CDK,  now  part  of  Kite   •  hQp://kitesdk.org   16
  • 17. Apache  Lucene   •  Java-­‐based  index  and  search   •  ranked  or  sorted  results   •  hits  streamed  through  QP   •  mem(results)    mem(collecLon)   •  rich/extensible  query  operators   •  bool,  phrase,  range,  span,  spaLal   •  Features   •  spellchecking   •  hit  highlighLng   •  tokenizaLon   17
  • 18. Apache  Solr   •  Enterprise  search  plaporm   •  Based  on  Apache  Lucene   •  Full-­‐text  search   •  FaceLng   •  NRT  indexing   •  UI   18
  • 19. Apache  Solr  –  Simple  Indexing  via  CLI     $  java  -­‐jar  post.jar  solr.xml  money.xml   SimplePostTool:  version  1.4   SimplePostTool:  POSTing  files  to  http:// localhost:8983/solr/update..   SimplePostTool:  POSTing  file  solr.xml   SimplePostTool:  POSTing  file  money.xml   SimplePostTool:  COMMITting  Solr  index  changes..     $  post.sh  *.xml   19
  • 20. Apache  Solr  –  Document  money.xml   add   doc      field  name=idUSD/field      field  name=nameOne  Dollar/field      field  name=manuBank  of  America/field      field  name=manu_id_sboa/field      field  name=catcurrency/field      field  name=featuresCoins  and  notes/field      field  name=price_c1,USD/field      field  name=inStocktrue/field   /doc     doc      field  name=idEUR/field      field  name=nameOne  Euro/field   20
  • 21. Apache  Solr  –  More  Advanced  Indexing   •  From  DB,  using  Data  Import  Handler  (DIH)   •  Load  a  CSV  file   •  POST  JSON  documents   •  Index  binary  documents  (uses  Tika)   •  SolrJ  for  programmaLc  document  creaLon   21
  • 22. Apache  Solr  –  Querying   •  HTTP  GET   •  hQp://solr:8983/solr/collecLon1/select/   •  Examples   •  ?q=Lmestamp:[*  TO  NOW]   •  ?q=-­‐instock:false   •  ?q={!lucene  q.op=AND  df=text}myfield:foo  +bar  -­‐bat   22
  • 23. Apache  Solr  –  Querying   •  HTTP  GET   •  hQp://solr:8983/solr/collecLon1/select/?q=video   •  Examples   •  fl=name,id          (return  only  name  and  id  fields)   •  fl=name,id,score        (return  relevancy  score  as  well)   •  fl=*,score            (return  all  fields  +  relevancy  score)   •  sort=price  descfl=name,id,price        (sort  by  price  desc)     •  wt=json                (return  response  in  JSON  format)   23
  • 24. What  the  Heck  is  FaceLng?     •  Generate  counts  for  properLes  or  categories   •  Links  allow  drill-­‐down  or  refine  search  results       What?   24
  • 26. Apache  Solr  –  Facets  at  Query  Time     •  HTTP  GET   •  hQp://solr:8983/solr/collecLon1/select/?q=video   •  All  docs,  count  by  category   q=*:*facet=truefacet.field=cat   •  All  docs,  count  by  category  and  in-­‐stock  status   q=*:*facet=truefacet.field=catfacet.field=inStock   •  Docs  matching  “ipod”,  count  by  price  (above/below  $100)   q=ipodfacet=truefacet.query=price:[0  TO   100]facet.query=price:[100  TO  *]       26
  • 27. Apache  Solr  –  Querying  via  UI     27
  • 28. Apache  SolrCloud   •  IntegraLon  of  Solr  +  ZooKeeper   •  Provides  for  shard  failover   28
  • 29. Cloudera  Search   •  Based  on  Apache  Solr  (incl  Lucene  and  SolrCloud)   •  Fault-­‐tolerance:  collecLons  backed  by  HDFS  or  HBase   •  IntegraLon  galore:   •  HBase/Flume/MapReduce  w/  Lucene   •  Hue  w/  Solr   •  Avro  w/  Tika   •  HDFS  w/  Solr/Lucene   •  Sentry  w/  Solr     29
  • 30. Cloudera  Search  +  Hue   30  
  • 31. Cloudera  Search  +  Hue   31  
  • 32. 32 Apologies,  I  swiped  some  preQy  slides  from  markeLng…   Why  Search?  
  • 33. Search  Design  Strategy   33 One  pool  of  data   One  security  framework   One  set  of  system  resources   One  management  interface   An  Integrated  Part  of   the  Hadoop  System   Storage   Integra5on   Resource  Management   Metadata   Batch   Processing   MAPREDUCE,   HIVE    PIG   … HDFS   HBase   TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS   Engines   InteracLve   SQL   CLOUDERA   IMPALA   InteracLve   Search   CLOUDERA   SEARCH   Machine   Learning   MAHOUT   Math     Sta5s5cs   SAS,  R    
  • 34. Benefits  of  Search  IntegraLon   34 Improved  Big  Data  ROI   §  An  interacLve  experience  without  technical  knowledge   §  Single  data  set  for  mulLple  compuLng  frameworks   Faster  Time  to  Insight   §  Exploratory  analysis,  esp.  unstructured  data   §  Broad  range  of  indexing  opLons  to  accommodate  needs   Cost  Efficiency   §  Single  scalable  plaporm;  no  incremental  investment   §  No  need  for  separate  systems,  storage   Solid  Founda5ons    Reliability   §  Solr  in  producLon  environments  for  years   §  Hadoop-­‐powered  reliability  and  scalability  
  • 35. 35 Some  quick  examples.   Search  Use  Cases  
  • 36. Search  Use  Cases   36 Offer  easy  access  to  non-­‐technical   resources   Explore  data  prior  to  processing  and   modeling   Gain  immediate  access  and  find   correlaLons  in  mission-­‐criLcal  data   Powerful,  proven  search  capabili5es  that   let  organiza5ons:  
  • 37. Monsanto   37 Scalable,  efficient  image  search  for   analysis  and  research   Track  plant  characterisLcs  throughout  their   lifecycle   Before:  Manual  aQribute  extracLon  and  search   queries  within  database   Now:  Parse  and  index  images  at  acquisiLon  and   on  demand,  index  archived  images  in  batch  
  • 38. 38 Cloudera:  Internal  Field  Portal   Custom  Aggregated  Search  
  • 39. Cloudera  –  Internal  Field  Portal   •  Single  stop  for  field  engineers   •  Mailing  lists:  public,  private   •  Tickets:  support,  development,  public  ASF   •  Customer  data:  accounts,  clusters,  KB  arLcles   •  Customer  Clusters:  configs,  audits,  logs,  events   •  Books  and  papers   •  Discussion  forums   •  Dogfooding,  yes   •  Makes  my  life  easier   39
  • 40. Cloudera  –  Internal  Field  Portal   40  
  • 41. Cloudera  –  Internal  Field  Portal   •  Varied  fetchers/observers  for  web/API  content   •  Content  is  retrieved  via  Flume,  Sqoop   •  Search  indexes  and  replicates  into  HBase   •  Each  collecLon  has  collecLon-­‐specific  filters/fields   •  Provides  Ltle,  content  snippet,  link  to  original   •  Morphlines  extracts  books  and  papers  using  Tika   •  Impala  for  analyLcs   •  Future:  Use  MapReduce  to  ingest  logs   41
  • 42. 42 ParLng  thoughts…  in  no  parLcular  order.   Summary  
  • 43. Search  Simplifies  InteracLon   43 Explore   Navigate   Correlate   Experts  know  MapReduce.  Savvy  people  know  SQL.     Everyone  knows  Search.  
  • 44. Summary   •  With  Hadoop,  it  depends.   •  The  tools  are  out  there.   •  Open  source  soGware,  hooray!   •  Many  interconnected  pieces   •  Many  unexplored  opportuniLes   •  A  thriving  community  awaits  you…   •  Data  can  make  a  difference.   •  Search  allows  everyone  to  interact  with  data.   •  This  is  a  Big  Deal.   44
  • 45. What’s  Next?   •  Search  examples   •  hQp://blog.cloudera.com/blog/category/search/   •  Cloudera  provides  pre-­‐loaded  VMs   •  hQp://Lny.cloudera.com/quickstartvm   •  Clone  our  repos!   •  hQps://github.com/cloudera   45
  • 46. 46 Preferably  related  to  the  talk…   QuesLons?  
  • 47. 47 Thank  You!   Alex  Moundalexis     @technmsg     We’re  hiring,  kids!  Well,  not  kids.