SlideShare a Scribd company logo
1 of 38
Download to read offline
1
Finding	
  a	
  needle	
  in	
  a	
  stack	
  of	
  
needles	
  -­‐	
  adding	
  Search	
  to	
  the	
  
Hadoop	
  Ecosystem	
  
Patrick	
  Hunt	
  (@phunt)	
  
Big	
  Data	
  Gurus	
  Meetup	
  July	
  2013	
  
Agenda	
  
•  Big	
  Data	
  and	
  Search	
  –	
  seIng	
  the	
  stage	
  
•  Cloudera	
  Search’s	
  Architecture	
  
•  Component	
  deep	
  dive	
  
•  Early	
  performance	
  insights	
  
•  What’s	
  next?	
  
Feel	
  free	
  to	
  ask	
  quesQons	
  as	
  we	
  go!	
  
Why	
  Search?	
  
An	
  Integrated	
  Part	
  of	
  
the	
  Hadoop	
  System	
  
One	
  pool	
  of	
  data	
  
One	
  security	
  framework	
  
One	
  set	
  of	
  system	
  resources	
  
One	
  management	
  interface	
  
Search	
  Simplifies	
  InteracQon	
  
•  User	
  Goals	
  
•  Explore	
  
•  Navigate	
  
•  Correlate	
  
•  Experts	
  know	
  MapReduce	
  
•  Savvy	
  people	
  know	
  SQL	
  
•  Everyone	
  knows	
  Search!	
  
Benefits	
  of	
  Search	
  
•  Improved	
  Big	
  Data	
  ROI	
  
•  An	
  interacQve	
  experience	
  without	
  technical	
  knowledge	
  
•  Single	
  data	
  set	
  for	
  mulQple	
  compuQng	
  frameworks	
  
•  Faster	
  Qme	
  to	
  insight	
  
•  Exploratory	
  analysis,	
  esp.	
  unstructured	
  data	
  
•  Broad	
  range	
  of	
  indexing	
  opQons	
  to	
  accommodate	
  needs	
  
•  Cost	
  efficiency	
  
•  Single	
  scalable	
  pla`orm;	
  no	
  incremental	
  investment	
  
•  No	
  need	
  for	
  separate	
  systems,	
  storage	
  
•  Solid	
  foundaQons	
  and	
  reliability	
  
•  Apache	
  Solr	
  in	
  producQon	
  environments	
  for	
  years	
  
•  Hadoop-­‐powered	
  reliability	
  and	
  scalability	
  
What	
  is	
  Cloudera	
  Search?	
  
•  Full-­‐text,	
  interacQve	
  search	
  and	
  faceted	
  navigaQon	
  
•  Batch,	
  near	
  real-­‐Qme,	
  and	
  on-­‐demand	
  indexing	
  
•  Apache	
  Solr	
  integrated	
  with	
  CDH	
  
•  Established,	
  mature	
  search	
  with	
  vibrant	
  community	
  
•  Separate	
  runQme	
  like	
  MapReduce,	
  Impala	
  
•  Incorporated	
  as	
  part	
  of	
  the	
  Hadoop	
  ecosystem	
  
•  Open	
  Source	
  
•  100%	
  Apache,	
  100%	
  Solr	
  
•  Standard	
  Solr	
  APIs	
  
Cloudera	
  Search	
  Components	
  
•  Refresher	
  –	
  HDFS/MR/Lucene/Solr/SolrCloud	
  
•  HDFSDirectoryFactory/HDFSDirectory	
  
•  BlockDirectory/BlockDirectoryCache	
  
•  Near	
  Real	
  Time	
  (NRT)	
  indexing	
  
•  Apache	
  Flume	
  MorphlineSolrSink	
  
•  Lily	
  HBase	
  Indexer	
  
•  Batch	
  –	
  MapReduce	
  Indexer	
  
•  ETL	
  –	
  Cloudera	
  Morphlines	
  
•  Hue	
  Search	
  ApplicaQon	
  
Apache	
  Hadoop	
  
•  Apache	
  HDFS	
  
•  Distributed	
  file	
  system	
  
•  High	
  reliability	
  
•  High	
  throughput	
  
•  Apache	
  MapReduce	
  
•  Parallel,	
  distributed	
  programming	
  model	
  
•  Allows	
  processing	
  of	
  large	
  datasets	
  
•  Fault	
  tolerant	
  
Apache	
  Lucene	
  
•  Full	
  text	
  search	
  
•  Indexing	
  
•  Query	
  
•  TradiQonal	
  inverted	
  index	
  
•  Batch	
  and	
  Incremental	
  indexing	
  
•  We	
  are	
  using	
  version	
  4	
  (4.3	
  currently)	
  
Apache	
  Solr	
  
•  Search	
  service	
  built	
  using	
  Lucene	
  
•  Ships	
  with	
  Lucene	
  (same	
  TLP	
  at	
  Apache)	
  
•  Provides	
  XML/HTTP/JSON/Python/Ruby/…	
  APIs	
  
•  Indexing	
  
•  Query	
  
•  AdministraQve	
  interface	
  
•  Also	
  rich	
  web	
  admin	
  GUI	
  via	
  HTTP	
  
Apache	
  SolrCloud	
  
•  Provides	
  distributed	
  Search	
  capability	
  
•  Part	
  of	
  Solr	
  (not	
  a	
  separate	
  library/codebase)	
  
•  Shards	
  -­‐	
  both	
  verQcally	
  and	
  horizontally	
  scaleable	
  	
  
•  Horizontally	
  –	
  parQQon	
  index	
  for	
  size	
  
•  VerQcally	
  –	
  replicate	
  for	
  query	
  performance	
  
•  Uses	
  ZooKeeper	
  for	
  coordinaQon	
  
•  No	
  split-­‐brain	
  issues	
  
•  Simplifies	
  operaQons	
  
Distributed	
  Search	
  on	
  Hadoop	
  
Flume	
  
Hue	
  UI	
  
Custom	
  
UI	
  
Custom	
  
App	
  
Solr	
  
Solr	
  
Solr	
  
SolrCloud	
  
query	
  
query	
  
query	
  
index	
  
Hadoop	
  Cluster	
  
MR	
  
HDFS	
  
index	
  
HBase	
  
index	
  
High	
  Level	
  View	
  
13	
  
HDFS	
  
Lucene	
  
Solr	
  
ZooKeeper	
  
SolrCloud	
  
Querying	
  API	
   Indexing	
  API	
  
Solr	
  on	
  HDFS	
  
•  Scalable,	
  cost-­‐efficient	
  
index	
  storage	
  
•  Higher	
  availability	
  
•  Search	
  and	
  process	
  data	
  
in	
  one	
  pla`orm	
  
Cloudera	
  Upstream	
  ContribuQons	
  
•  SOLR-­‐3911	
  -­‐	
  Directory/DirectoryFactory	
  now	
  first	
  class	
  
•  Solr	
  ReplicaQon	
  now	
  uses	
  Directory	
  abstracQon	
  
•  Solr	
  Admin	
  UI	
  no	
  longer	
  assumes	
  local	
  directory	
  access	
  
•  SOLR-­‐4916	
  –	
  support	
  for	
  reading/wriQng	
  Solr	
  index	
  files	
  and	
  
transacQon	
  log	
  files	
  to/from	
  HDFS	
  
•  HDFSDirectoryFactory/HDFSDirectory	
  implementaQon	
  
•  SOLR-­‐4655	
  -­‐	
  The	
  Overseer	
  should	
  assign	
  node	
  names	
  by	
  default.	
  
•  SOLR-­‐3706	
  -­‐	
  Ship	
  setup	
  to	
  log	
  with	
  log4j	
  
•  SOLR-­‐4494	
  -­‐	
  Clean	
  up	
  and	
  polish	
  CollecQons	
  API	
  
•  SOLR-­‐4718	
  -­‐Improvements	
  to	
  configurability	
  
•  ConfiguraQon	
  now	
  enQrely	
  through	
  ZooKeeper	
  	
  (opQonal)	
  
•  Many	
  more	
  improvements/cleanup/hardening/…	
  
Lucene	
  Directory	
  abstracQon	
  
•  It’s	
  how	
  Lucene	
  interacts	
  with	
  index	
  files	
  
•  Solr	
  uses	
  it	
  too,	
  but	
  spory	
  prior	
  to	
  4.x	
  
	
  
Class Directory {
listAll();
createOutput(file, context);
openInput(file, context);
deleteFile(file);
makeLock(file);
clearLock(file);
…
}
HDFSDirectory	
  
•  Originally	
  implemented	
  against	
  Lucene	
  3	
  by	
  Blur	
  
•  Cloudera	
  ported	
  to	
  Lucene	
  4	
  and	
  now	
  upstream	
  
•  Solr	
  trunk	
  and	
  version	
  4.4	
  (upcoming)	
  
•  Uses	
  the	
  HDFS	
  Client	
  API	
  
import org.apache.hadoop.fs.FileSystem;
public IndexInput openInput(file, context){
…
_inputStream = fileSystem.open(path, bufferSize);
…
}	
  
HDFSDirectoryFactory	
  
•  Enables	
  plugin	
  of	
  HDFSDirectory	
  into	
  Solr	
  
•  Configurable	
  through	
  solrconfig.xml	
  
•  Also	
  handles	
  
•  Directory	
  configuraQon	
  
•  ComposiQng	
  of	
  Directory(s)	
  
•  NRTCachingDirectory	
  
•  BlockDirectory/BlockDirectoryCache	
  
BlockDirectory/BlockDirectoryCache	
  
•  In	
  memory	
  cache	
  of	
  index	
  file	
  blocks	
  
•  Caches	
  on	
  read,	
  in	
  some	
  cases	
  on	
  write	
  
•  Compensate	
  for	
  less	
  effecQve	
  file	
  system	
  cache	
  
•  Uses	
  DirectByteBuffer,	
  not	
  JVM	
  heap	
  (default)	
  
•  Size	
  configurable	
  by	
  user	
  
Near	
  Real	
  Time	
  Indexing	
  with	
  Flume	
  
Log	
  File	
  
Solr	
  and	
  Flume	
  
•  Data	
  ingest	
  at	
  scale	
  
•  Flexible	
  extracQon	
  and	
  
mapping	
  
•  Indexing	
  at	
  data	
  ingest	
  
HDFS	
  
Flume	
  
Agent	
  
Indexer	
  
Other	
  
Log	
  File	
  
Flume	
  
Agent	
  
Indexer	
  
19	
  
Apache	
  Flume	
  -­‐	
  MorphlineSolrSink	
  
•  A	
  Flume	
  Source…	
  
•  Receives/gathers	
  events	
  	
  
•  A	
  Flume	
  Channel…	
  
•  Carries	
  the	
  event	
  –	
  MemoryChannel	
  or	
  reliable	
  FileChannel	
  	
  
•  A	
  Flume	
  Sink…	
  
•  Sends	
  the	
  events	
  on	
  to	
  the	
  next	
  locaQon	
  
•  Flume	
  MorphlineSolrSink	
  
•  Integrates	
  Cloudera	
  Morphlines	
  library	
  
•  ETL,	
  more	
  on	
  that	
  in	
  a	
  bit	
  
•  Does	
  batching	
  
•  Results	
  sent	
  to	
  Solr	
  for	
  indexing	
  
Near	
  Real	
  Time	
  indexing	
  of	
  Apache	
  HBase	
  
HDFS	
  
HBase	
  
interacQve	
  load	
  
Indexer(s)	
  
Triggers	
  on	
  
updates	
  
Solr	
  server	
  
Solr	
  server	
  
Solr	
  server	
  
Solr	
  server	
  
Solr	
  server	
  
Search	
  
+	
   =	
  
planet-­‐sized	
  tabular	
  data	
  
immediate	
  access	
  &	
  updates	
  
fast	
  &	
  flexible	
  informaFon	
  
discovery	
  
BIG	
  DATA	
  DATAMANAGEMENT	
  
Lily	
  HBase	
  Indexer	
  
•  CollaboraQon	
  between	
  NGData	
  &	
  Cloudera	
  
•  NGData	
  are	
  creators	
  of	
  the	
  Lily	
  data	
  management	
  pla`orm	
  
•  Lily	
  HBase	
  Indexer	
  
•  Service	
  which	
  acts	
  as	
  a	
  HBase	
  replicaQon	
  listener	
  
•  HBase	
  replicaQon	
  features,	
  such	
  as	
  filtering,	
  supported	
  
•  ReplicaQon	
  updates	
  trigger	
  indexing	
  of	
  updates	
  (rows)	
  
•  Integrates	
  Cloudera	
  Morphlines	
  library	
  for	
  ETL	
  of	
  rows	
  
•  AL2	
  licensed	
  on	
  github	
  hrps://github.com/ngdata	
  
Scalable	
  Batch	
  Indexing	
  
Index	
  
shard	
  
Files	
  
Index	
  
shard	
  
Indexer	
  
Files	
  
Solr	
  
server	
  
Indexer	
  
Solr	
  
server	
  
23
HDFS	
  
Solr	
  and	
  MapReduce	
  
•  Flexible,	
  scalable	
  batch	
  
indexing	
  
•  Start	
  serving	
  new	
  indices	
  
with	
  no	
  downQme	
  
•  On-­‐demand	
  indexing,	
  cost-­‐
efficient	
  re-­‐indexing	
  
Scalable	
  Batch	
  Indexing	
  
24
Mapper:	
  
Parse	
  input	
  into	
  
indexable	
  document	
  
Mapper:	
  
Parse	
  input	
  into	
  
indexable	
  document	
  
Mapper:	
  
Parse	
  input	
  into	
  
indexable	
  document	
  
Index	
  
shard	
  1	
  
Index	
  
shard	
  2	
  
Arbitrary	
  reducing	
  steps	
  of	
  indexing	
  and	
  merging	
  
End-­‐Reducer	
  (shard	
  1):	
  
Index	
  document	
  
End-­‐Reducer	
  (shard	
  2):	
  
Index	
  document	
  
MapReduce	
  Indexer	
  
MapReduce	
  Job	
  with	
  two	
  parts	
  
	
  
1)	
  Scan	
  HDFS	
  for	
  files	
  to	
  be	
  indexed	
  
•  Much	
  like	
  Unix	
  “find”	
  –	
  see	
  HADOOP-­‐8989	
  
•  Output	
  is	
  NLineInputFormat’ed	
  file	
  
2)	
  Mapper/Reducer	
  indexing	
  step	
  
•  Mapper	
  extracts	
  content	
  via	
  Cloudera	
  Morphlines	
  
•  Reducer	
  indexes	
  documents	
  via	
  embedded	
  Solr	
  server	
  
•  Originally	
  based	
  on	
  SOLR-­‐1301	
  
•  Many	
  modificaQons	
  to	
  enable	
  linear	
  scalability	
  
MapReduce	
  Indexer	
  “golive”	
  
•  Cloudera	
  created	
  this	
  to	
  bridge	
  the	
  gap	
  between	
  NRT	
  
(low	
  latency,	
  expensive)	
  and	
  Batch	
  (high	
  latency,	
  
cheap	
  at	
  scale)	
  indexing	
  
•  Results	
  of	
  MR	
  indexing	
  operaQon	
  are	
  immediately	
  
merged	
  into	
  a	
  live	
  SolrCloud	
  serving	
  cluster	
  
•  No	
  downQme	
  for	
  users	
  
•  No	
  NRT	
  expense	
  
•  Linear	
  scale	
  out	
  to	
  the	
  size	
  of	
  your	
  MR	
  cluster	
  
Cloudera	
  Morphlines	
  
•  Open	
  Source	
  framework	
  for	
  simple	
  ETL	
  
•  Ships	
  as	
  part	
  Cloudera	
  Developer	
  Kit	
  (CDK)	
  
•  It’s	
  a	
  Java	
  library	
  
•  AL2	
  licensed	
  on	
  github	
  hrps://github.com/cloudera/cdk	
  
•  Similar	
  to	
  Unix	
  pipelines	
  
•  ConfiguraQon	
  over	
  coding	
  
•  Supports	
  common	
  Hadoop	
  formats	
  
•  Avro	
  
•  Sequence	
  file	
  
•  Text	
  
•  Etc…	
  
Cloudera	
  Morphlines	
  Architecture	
  
Solr	
  
Solr	
  
Solr	
  
SolrCloud	
  
Logs,	
  tweets,	
  social	
  
media,	
  html,	
  
images,	
  pdf,	
  text….	
  
	
  
Anything	
  you	
  want	
  
to	
  index	
  
Flume,	
  MR	
  Indexer,	
  HBase	
  indexer,	
  etc...	
  
	
  Or	
  your	
  applicaQon!	
  
Morphline	
  Library	
  
Morphlines	
  can	
  be	
  embedded	
  in	
  any	
  applicaQon…	
  
ExtracQon	
  and	
  Mapping	
  
•  Simple	
  and	
  flexible	
  data	
  
transformaQon	
  	
  
•  Reusable	
  across	
  mulQple	
  
index	
  workloads	
  
•  Over	
  Qme,	
  extend	
  and	
  re-­‐
use	
  across	
  pla`orm	
  
workloads	
  
syslog	
   Flume	
  
Agent	
  
Solr	
  sink	
  
Command:	
  readLine	
  
Command:	
  grok	
  
Command:	
  loadSolr	
  
Solr	
  
Event	
  
Record	
  
Record	
  
Record	
  
Document	
  
Morphline	
  Library	
  
Current	
  Command	
  Library	
  
•  Integrate	
  with	
  and	
  load	
  into	
  Apache	
  Solr	
  
•  Flexible	
  log	
  file	
  analysis	
  
•  Single-­‐line	
  record,	
  mulQ-­‐line	
  records,	
  CSV	
  files	
  	
  
•  Regex	
  based	
  parern	
  matching	
  and	
  extracQon	
  	
  
•  IntegraQon	
  with	
  Avro	
  	
  
•  IntegraQon	
  with	
  Apache	
  Hadoop	
  Sequence	
  Files	
  
•  IntegraQon	
  with	
  SolrCell	
  and	
  all	
  Apache	
  Tika	
  parsers	
  	
  
•  Auto-­‐detecQon	
  of	
  MIME	
  types	
  from	
  binary	
  data	
  using	
  
Apache	
  Tika	
  
Current	
  Command	
  Library	
  (cont)	
  
•  ScripQng	
  support	
  for	
  dynamic	
  java	
  code	
  	
  
•  OperaQons	
  on	
  fields	
  for	
  assignment	
  and	
  comparison	
  
•  OperaQons	
  on	
  fields	
  with	
  list	
  and	
  set	
  semanQcs	
  	
  
•  if-­‐then-­‐else	
  condiQonals	
  	
  
•  A	
  small	
  rules	
  engine	
  (tryRules)	
  
•  String	
  and	
  Qmestamp	
  conversions	
  	
  
•  slf4j	
  logging	
  
•  Yammer	
  metrics	
  and	
  counters	
  	
  
•  Decompression	
  and	
  unpacking	
  of	
  arbitrarily	
  nested	
  
container	
  file	
  formats	
  
•  Etc…	
  
Morphline	
  Example	
  –	
  syslog	
  with	
  grok	
  
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCommands	
  :	
  ["com.cloudera.**",	
  "org.apache.solr.**"]	
  
	
  	
  	
  commands	
  :	
  [	
  
	
  	
  	
  	
  	
  {	
  readLine	
  {}	
  }	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  grok	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  dicQonaryFiles	
  :	
  [/tmp/grok-­‐dicQonaries]	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  expressions	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  message	
  :	
  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp}	
  %
{SYSLOGHOST:syslog_hostname}	
  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:	
  %
{GREEDYDATA:syslog_message}"""	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  {	
  loadSolr	
  {}	
  }	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  ]	
  
	
  }	
  
]	
  
Example	
  Input	
  
<164>Feb	
  	
  4	
  10:46:14	
  syslog	
  sshd[607]:	
  listening	
  on	
  0.0.0.0	
  port	
  22	
  
Output	
  Record	
  
syslog_pri:164	
  
syslog_Qmestamp:Feb	
  	
  4	
  10:46:14	
  
syslog_hostname:syslog	
  
syslog_program:sshd	
  
syslog_pid:607	
  
syslog_message:listening	
  on	
  0.0.0.0	
  port	
  22.	
  
	
  
	
  
Simple,	
  Customizable	
  Search	
  Interface	
  
Hue	
  
•  Simple	
  UI	
  
•  Navigated,	
  faceted	
  drill	
  
down	
  
•  Customizable	
  display	
  
•  Full	
  text	
  search,	
  
standard	
  Solr	
  API	
  and	
  
query	
  language	
  
Performance	
  
•  Cloudera	
  internal	
  tesQng	
  results	
  
•  Cisco	
  WebEx	
  results	
  from	
  Hadoop	
  Summit	
  2013	
  
Cloudera	
  Internal	
  TesQng	
  
•  We’ve	
  looked	
  at	
  
•  NRT	
  and	
  Batch	
  indexing	
  
•  Query	
  performance	
  
•  Performance	
  has	
  been	
  similar	
  to	
  Solr	
  on	
  local	
  disk	
  
•  Indexing/query	
  operaQons	
  are	
  typically	
  CPU	
  bound	
  
•  Caching	
  obviously	
  plays	
  a	
  big	
  factor	
  for	
  queries	
  
•  Limited	
  use	
  cases	
  explored	
  –	
  public	
  beta	
  helping	
  here!	
  
Details	
  shared	
  by	
  WebEx	
  at	
  2013	
  Summit	
  
•  Cisco	
  presented	
  on	
  their	
  use	
  of	
  Flume,	
  Cloudera	
  
Search,	
  and	
  Cloudera	
  Morphlines	
  
•  Indexing	
  log	
  events	
  in	
  Near	
  Real	
  Time	
  via	
  Flume	
  
•  Cisco	
  UCS	
  C240	
  M3	
  servers	
  
•  2	
  quad	
  cores	
  @2.3ghz	
  
•  16gb	
  RAM	
  
•  12	
  x	
  3TB	
  storage	
  
•  Ingest	
  rate	
  
•  70k	
  events/sec,	
  1.2	
  TB/day	
  inbound	
  
What’s	
  next	
  
•  Usability	
  –	
  “solrctl”	
  
•  Security	
  
•  Index,	
  Document	
  and	
  (eventually)	
  Field	
  level	
  security	
  
•  Lots	
  of	
  scalability/performance	
  work	
  to	
  be	
  done	
  
•  What	
  are	
  the	
  best	
  Solr/Lucene	
  seIngs	
  for	
  HDFS?	
  
•  InvesQgate	
  short	
  circuit	
  HDFS	
  reads	
  
•  BlockDirectoryCache	
  tuning	
  
•  HDFS	
  block	
  affinity	
  
•  More	
  sophisQcated	
  index	
  management	
  
•  Take	
  advantage	
  of	
  collecQon	
  alias	
  support	
  (SOLR-­‐4497)	
  
Conclusion	
  
•  Cloudera	
  Search	
  now	
  in	
  public	
  beta	
  
•  Free	
  Download	
  	
  
•  Extensive	
  documentaQon	
  
•  Send	
  your	
  quesQons	
  and	
  feedback	
  to	
  
search-­‐user@cloudera.org	
  
•  Take	
  the	
  Search	
  online	
  training	
  
•  Cloudera	
  Manager	
  Standard	
  (i.e.	
  the	
  free	
  version)	
  
•  Simple	
  management	
  of	
  Search	
  
•  Free	
  Download	
  
•  QuickStart	
  VM	
  also	
  available!	
  

More Related Content

What's hot

The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Hive on kafka
Hive on kafkaHive on kafka
Hive on kafkaSzehon Ho
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Large Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using MorphlinesLarge Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using Morphlineswhoschek
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopHyunsik Choi
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 

What's hot (20)

The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Hive on kafka
Hive on kafkaHive on kafka
Hive on kafka
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Large Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using MorphlinesLarge Scale ETL for Hadoop and Cloudera Search using Morphlines
Large Scale ETL for Hadoop and Cloudera Search using Morphlines
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 

Similar to Search On Hadoop

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkkeval dalasaniya
 
Best Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightRevin Chalil
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataLucidworks
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 

Similar to Search On Hadoop (20)

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Best Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsight
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 

More from bigdatagurus_meetup

Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014bigdatagurus_meetup
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)bigdatagurus_meetup
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 

More from bigdatagurus_meetup (11)

Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Scaling HBase at Pinterest
Scaling HBase at PinterestScaling HBase at Pinterest
Scaling HBase at Pinterest
 
Continuuity Weave
Continuuity WeaveContinuuity Weave
Continuuity Weave
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Cloudera Developer Kit (CDK)
Cloudera Developer Kit (CDK)Cloudera Developer Kit (CDK)
Cloudera Developer Kit (CDK)
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Search On Hadoop

  • 1. 1 Finding  a  needle  in  a  stack  of   needles  -­‐  adding  Search  to  the   Hadoop  Ecosystem   Patrick  Hunt  (@phunt)   Big  Data  Gurus  Meetup  July  2013  
  • 2. Agenda   •  Big  Data  and  Search  –  seIng  the  stage   •  Cloudera  Search’s  Architecture   •  Component  deep  dive   •  Early  performance  insights   •  What’s  next?   Feel  free  to  ask  quesQons  as  we  go!  
  • 3. Why  Search?   An  Integrated  Part  of   the  Hadoop  System   One  pool  of  data   One  security  framework   One  set  of  system  resources   One  management  interface  
  • 4. Search  Simplifies  InteracQon   •  User  Goals   •  Explore   •  Navigate   •  Correlate   •  Experts  know  MapReduce   •  Savvy  people  know  SQL   •  Everyone  knows  Search!  
  • 5. Benefits  of  Search   •  Improved  Big  Data  ROI   •  An  interacQve  experience  without  technical  knowledge   •  Single  data  set  for  mulQple  compuQng  frameworks   •  Faster  Qme  to  insight   •  Exploratory  analysis,  esp.  unstructured  data   •  Broad  range  of  indexing  opQons  to  accommodate  needs   •  Cost  efficiency   •  Single  scalable  pla`orm;  no  incremental  investment   •  No  need  for  separate  systems,  storage   •  Solid  foundaQons  and  reliability   •  Apache  Solr  in  producQon  environments  for  years   •  Hadoop-­‐powered  reliability  and  scalability  
  • 6. What  is  Cloudera  Search?   •  Full-­‐text,  interacQve  search  and  faceted  navigaQon   •  Batch,  near  real-­‐Qme,  and  on-­‐demand  indexing   •  Apache  Solr  integrated  with  CDH   •  Established,  mature  search  with  vibrant  community   •  Separate  runQme  like  MapReduce,  Impala   •  Incorporated  as  part  of  the  Hadoop  ecosystem   •  Open  Source   •  100%  Apache,  100%  Solr   •  Standard  Solr  APIs  
  • 7. Cloudera  Search  Components   •  Refresher  –  HDFS/MR/Lucene/Solr/SolrCloud   •  HDFSDirectoryFactory/HDFSDirectory   •  BlockDirectory/BlockDirectoryCache   •  Near  Real  Time  (NRT)  indexing   •  Apache  Flume  MorphlineSolrSink   •  Lily  HBase  Indexer   •  Batch  –  MapReduce  Indexer   •  ETL  –  Cloudera  Morphlines   •  Hue  Search  ApplicaQon  
  • 8. Apache  Hadoop   •  Apache  HDFS   •  Distributed  file  system   •  High  reliability   •  High  throughput   •  Apache  MapReduce   •  Parallel,  distributed  programming  model   •  Allows  processing  of  large  datasets   •  Fault  tolerant  
  • 9. Apache  Lucene   •  Full  text  search   •  Indexing   •  Query   •  TradiQonal  inverted  index   •  Batch  and  Incremental  indexing   •  We  are  using  version  4  (4.3  currently)  
  • 10. Apache  Solr   •  Search  service  built  using  Lucene   •  Ships  with  Lucene  (same  TLP  at  Apache)   •  Provides  XML/HTTP/JSON/Python/Ruby/…  APIs   •  Indexing   •  Query   •  AdministraQve  interface   •  Also  rich  web  admin  GUI  via  HTTP  
  • 11. Apache  SolrCloud   •  Provides  distributed  Search  capability   •  Part  of  Solr  (not  a  separate  library/codebase)   •  Shards  -­‐  both  verQcally  and  horizontally  scaleable     •  Horizontally  –  parQQon  index  for  size   •  VerQcally  –  replicate  for  query  performance   •  Uses  ZooKeeper  for  coordinaQon   •  No  split-­‐brain  issues   •  Simplifies  operaQons  
  • 12. Distributed  Search  on  Hadoop   Flume   Hue  UI   Custom   UI   Custom   App   Solr   Solr   Solr   SolrCloud   query   query   query   index   Hadoop  Cluster   MR   HDFS   index   HBase   index  
  • 13. High  Level  View   13   HDFS   Lucene   Solr   ZooKeeper   SolrCloud   Querying  API   Indexing  API   Solr  on  HDFS   •  Scalable,  cost-­‐efficient   index  storage   •  Higher  availability   •  Search  and  process  data   in  one  pla`orm  
  • 14. Cloudera  Upstream  ContribuQons   •  SOLR-­‐3911  -­‐  Directory/DirectoryFactory  now  first  class   •  Solr  ReplicaQon  now  uses  Directory  abstracQon   •  Solr  Admin  UI  no  longer  assumes  local  directory  access   •  SOLR-­‐4916  –  support  for  reading/wriQng  Solr  index  files  and   transacQon  log  files  to/from  HDFS   •  HDFSDirectoryFactory/HDFSDirectory  implementaQon   •  SOLR-­‐4655  -­‐  The  Overseer  should  assign  node  names  by  default.   •  SOLR-­‐3706  -­‐  Ship  setup  to  log  with  log4j   •  SOLR-­‐4494  -­‐  Clean  up  and  polish  CollecQons  API   •  SOLR-­‐4718  -­‐Improvements  to  configurability   •  ConfiguraQon  now  enQrely  through  ZooKeeper    (opQonal)   •  Many  more  improvements/cleanup/hardening/…  
  • 15. Lucene  Directory  abstracQon   •  It’s  how  Lucene  interacts  with  index  files   •  Solr  uses  it  too,  but  spory  prior  to  4.x     Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); … }
  • 16. HDFSDirectory   •  Originally  implemented  against  Lucene  3  by  Blur   •  Cloudera  ported  to  Lucene  4  and  now  upstream   •  Solr  trunk  and  version  4.4  (upcoming)   •  Uses  the  HDFS  Client  API   import org.apache.hadoop.fs.FileSystem; public IndexInput openInput(file, context){ … _inputStream = fileSystem.open(path, bufferSize); … }  
  • 17. HDFSDirectoryFactory   •  Enables  plugin  of  HDFSDirectory  into  Solr   •  Configurable  through  solrconfig.xml   •  Also  handles   •  Directory  configuraQon   •  ComposiQng  of  Directory(s)   •  NRTCachingDirectory   •  BlockDirectory/BlockDirectoryCache  
  • 18. BlockDirectory/BlockDirectoryCache   •  In  memory  cache  of  index  file  blocks   •  Caches  on  read,  in  some  cases  on  write   •  Compensate  for  less  effecQve  file  system  cache   •  Uses  DirectByteBuffer,  not  JVM  heap  (default)   •  Size  configurable  by  user  
  • 19. Near  Real  Time  Indexing  with  Flume   Log  File   Solr  and  Flume   •  Data  ingest  at  scale   •  Flexible  extracQon  and   mapping   •  Indexing  at  data  ingest   HDFS   Flume   Agent   Indexer   Other   Log  File   Flume   Agent   Indexer   19  
  • 20. Apache  Flume  -­‐  MorphlineSolrSink   •  A  Flume  Source…   •  Receives/gathers  events     •  A  Flume  Channel…   •  Carries  the  event  –  MemoryChannel  or  reliable  FileChannel     •  A  Flume  Sink…   •  Sends  the  events  on  to  the  next  locaQon   •  Flume  MorphlineSolrSink   •  Integrates  Cloudera  Morphlines  library   •  ETL,  more  on  that  in  a  bit   •  Does  batching   •  Results  sent  to  Solr  for  indexing  
  • 21. Near  Real  Time  indexing  of  Apache  HBase   HDFS   HBase   interacQve  load   Indexer(s)   Triggers  on   updates   Solr  server   Solr  server   Solr  server   Solr  server   Solr  server   Search   +   =   planet-­‐sized  tabular  data   immediate  access  &  updates   fast  &  flexible  informaFon   discovery   BIG  DATA  DATAMANAGEMENT  
  • 22. Lily  HBase  Indexer   •  CollaboraQon  between  NGData  &  Cloudera   •  NGData  are  creators  of  the  Lily  data  management  pla`orm   •  Lily  HBase  Indexer   •  Service  which  acts  as  a  HBase  replicaQon  listener   •  HBase  replicaQon  features,  such  as  filtering,  supported   •  ReplicaQon  updates  trigger  indexing  of  updates  (rows)   •  Integrates  Cloudera  Morphlines  library  for  ETL  of  rows   •  AL2  licensed  on  github  hrps://github.com/ngdata  
  • 23. Scalable  Batch  Indexing   Index   shard   Files   Index   shard   Indexer   Files   Solr   server   Indexer   Solr   server   23 HDFS   Solr  and  MapReduce   •  Flexible,  scalable  batch   indexing   •  Start  serving  new  indices   with  no  downQme   •  On-­‐demand  indexing,  cost-­‐ efficient  re-­‐indexing  
  • 24. Scalable  Batch  Indexing   24 Mapper:   Parse  input  into   indexable  document   Mapper:   Parse  input  into   indexable  document   Mapper:   Parse  input  into   indexable  document   Index   shard  1   Index   shard  2   Arbitrary  reducing  steps  of  indexing  and  merging   End-­‐Reducer  (shard  1):   Index  document   End-­‐Reducer  (shard  2):   Index  document  
  • 25. MapReduce  Indexer   MapReduce  Job  with  two  parts     1)  Scan  HDFS  for  files  to  be  indexed   •  Much  like  Unix  “find”  –  see  HADOOP-­‐8989   •  Output  is  NLineInputFormat’ed  file   2)  Mapper/Reducer  indexing  step   •  Mapper  extracts  content  via  Cloudera  Morphlines   •  Reducer  indexes  documents  via  embedded  Solr  server   •  Originally  based  on  SOLR-­‐1301   •  Many  modificaQons  to  enable  linear  scalability  
  • 26. MapReduce  Indexer  “golive”   •  Cloudera  created  this  to  bridge  the  gap  between  NRT   (low  latency,  expensive)  and  Batch  (high  latency,   cheap  at  scale)  indexing   •  Results  of  MR  indexing  operaQon  are  immediately   merged  into  a  live  SolrCloud  serving  cluster   •  No  downQme  for  users   •  No  NRT  expense   •  Linear  scale  out  to  the  size  of  your  MR  cluster  
  • 27. Cloudera  Morphlines   •  Open  Source  framework  for  simple  ETL   •  Ships  as  part  Cloudera  Developer  Kit  (CDK)   •  It’s  a  Java  library   •  AL2  licensed  on  github  hrps://github.com/cloudera/cdk   •  Similar  to  Unix  pipelines   •  ConfiguraQon  over  coding   •  Supports  common  Hadoop  formats   •  Avro   •  Sequence  file   •  Text   •  Etc…  
  • 28. Cloudera  Morphlines  Architecture   Solr   Solr   Solr   SolrCloud   Logs,  tweets,  social   media,  html,   images,  pdf,  text….     Anything  you  want   to  index   Flume,  MR  Indexer,  HBase  indexer,  etc...    Or  your  applicaQon!   Morphline  Library   Morphlines  can  be  embedded  in  any  applicaQon…  
  • 29. ExtracQon  and  Mapping   •  Simple  and  flexible  data   transformaQon     •  Reusable  across  mulQple   index  workloads   •  Over  Qme,  extend  and  re-­‐ use  across  pla`orm   workloads   syslog   Flume   Agent   Solr  sink   Command:  readLine   Command:  grok   Command:  loadSolr   Solr   Event   Record   Record   Record   Document   Morphline  Library  
  • 30. Current  Command  Library   •  Integrate  with  and  load  into  Apache  Solr   •  Flexible  log  file  analysis   •  Single-­‐line  record,  mulQ-­‐line  records,  CSV  files     •  Regex  based  parern  matching  and  extracQon     •  IntegraQon  with  Avro     •  IntegraQon  with  Apache  Hadoop  Sequence  Files   •  IntegraQon  with  SolrCell  and  all  Apache  Tika  parsers     •  Auto-­‐detecQon  of  MIME  types  from  binary  data  using   Apache  Tika  
  • 31. Current  Command  Library  (cont)   •  ScripQng  support  for  dynamic  java  code     •  OperaQons  on  fields  for  assignment  and  comparison   •  OperaQons  on  fields  with  list  and  set  semanQcs     •  if-­‐then-­‐else  condiQonals     •  A  small  rules  engine  (tryRules)   •  String  and  Qmestamp  conversions     •  slf4j  logging   •  Yammer  metrics  and  counters     •  Decompression  and  unpacking  of  arbitrarily  nested   container  file  formats   •  Etc…  
  • 32. Morphline  Example  –  syslog  with  grok   morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dicQonaryFiles  :  [/tmp/grok-­‐dicQonaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example  Input   <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22   Output  Record   syslog_pri:164   syslog_Qmestamp:Feb    4  10:46:14   syslog_hostname:syslog   syslog_program:sshd   syslog_pid:607   syslog_message:listening  on  0.0.0.0  port  22.      
  • 33. Simple,  Customizable  Search  Interface   Hue   •  Simple  UI   •  Navigated,  faceted  drill   down   •  Customizable  display   •  Full  text  search,   standard  Solr  API  and   query  language  
  • 34. Performance   •  Cloudera  internal  tesQng  results   •  Cisco  WebEx  results  from  Hadoop  Summit  2013  
  • 35. Cloudera  Internal  TesQng   •  We’ve  looked  at   •  NRT  and  Batch  indexing   •  Query  performance   •  Performance  has  been  similar  to  Solr  on  local  disk   •  Indexing/query  operaQons  are  typically  CPU  bound   •  Caching  obviously  plays  a  big  factor  for  queries   •  Limited  use  cases  explored  –  public  beta  helping  here!  
  • 36. Details  shared  by  WebEx  at  2013  Summit   •  Cisco  presented  on  their  use  of  Flume,  Cloudera   Search,  and  Cloudera  Morphlines   •  Indexing  log  events  in  Near  Real  Time  via  Flume   •  Cisco  UCS  C240  M3  servers   •  2  quad  cores  @2.3ghz   •  16gb  RAM   •  12  x  3TB  storage   •  Ingest  rate   •  70k  events/sec,  1.2  TB/day  inbound  
  • 37. What’s  next   •  Usability  –  “solrctl”   •  Security   •  Index,  Document  and  (eventually)  Field  level  security   •  Lots  of  scalability/performance  work  to  be  done   •  What  are  the  best  Solr/Lucene  seIngs  for  HDFS?   •  InvesQgate  short  circuit  HDFS  reads   •  BlockDirectoryCache  tuning   •  HDFS  block  affinity   •  More  sophisQcated  index  management   •  Take  advantage  of  collecQon  alias  support  (SOLR-­‐4497)  
  • 38. Conclusion   •  Cloudera  Search  now  in  public  beta   •  Free  Download     •  Extensive  documentaQon   •  Send  your  quesQons  and  feedback  to   search-­‐user@cloudera.org   •  Take  the  Search  online  training   •  Cloudera  Manager  Standard  (i.e.  the  free  version)   •  Simple  management  of  Search   •  Free  Download   •  QuickStart  VM  also  available!