SlideShare a Scribd company logo
1 of 70
Download to read offline
1
Search	
  in	
  the	
  Apache	
  Hadoop	
  
Ecosystem:	
  Thoughts	
  from	
  the	
  Field	
  
Open	
  Source	
  Search	
  Conference,	
  November	
  2013	
  
Alex	
  Moundalexis	
  
	
  
@technmsg	
  
2
Thoughts	
  of	
  a	
  Former	
  SA	
  
3
Thoughts	
  of	
  a	
  Former	
  SA	
  Field	
  Guy	
  
Disclaimer	
  
•  Technologies,	
  not	
  products	
  
•  Cloudera	
  builds	
  things	
  soJware	
  
•  most	
  donated	
  to	
  Apache	
  
•  some	
  closed-­‐source	
  
•  I	
  will	
  likely	
  menOon	
  “Cloudera	
  Something”	
  
•  Cloudera	
  “products”	
  I	
  reference	
  are	
  open	
  source	
  
•  Apache	
  Licensed	
  
•  Source	
  code	
  is	
  on	
  GitHub	
  
•  hSps://github.com/cloudera	
  
4
What	
  This	
  Talk	
  Isn’t	
  About	
  
•  Deploying	
  
•  Puppet,	
  Chef,	
  Ansible,	
  homegrown	
  scripts,	
  intern	
  labor	
  
•  Sizing	
  &	
  Tuning	
  
•  Depends	
  heavily	
  on	
  data	
  and	
  workload	
  
•  Coding	
  
•  Algorithms	
  
5
6	
  
“	
  The	
  answer	
  to	
  most	
  
Hadoop	
  quesOons	
  is	
  it	
  
depends.”	
  
7
Quick	
  and	
  dirty,	
  more	
  Ome	
  for	
  use	
  cases.	
  
The	
  Apache	
  Hadoop	
  Ecosystem	
  
Why	
  “Ecosystem?”	
  
•  In	
  the	
  beginning,	
  just	
  Hadoop	
  
•  HDFS	
  
•  MapReduce	
  
•  Today,	
  dozens	
  of	
  interrelated	
  components	
  
•  I/O	
  
•  Processing	
  
•  Specialty	
  ApplicaOons	
  
•  ConfiguraOon	
  
•  Workflow	
  
8
ParOal	
  Ecosystem	
  
9
Hadoop	
  
external	
  system	
  
RDBMS	
  /	
  DWH	
  
web	
  server	
  
device	
  logs	
  
API	
  access	
  
log	
  collecOon	
  
DB	
  table	
  import	
  
batch	
  processing	
  
machine	
  learning	
  
external	
  system	
  
API	
  access	
  
user	
  
RDBMS	
  /	
  DWH	
  
DB	
  table	
  	
  export	
  
BI	
  tool	
  
+	
  JDBC/ODBC	
  
Search	
  
SQL	
  
HDFS	
  
•  Distributed,	
  highly	
  fault-­‐tolerant	
  filesystem	
  
•  OpOmized	
  for	
  large	
  streaming	
  access	
  to	
  data	
  
•  Based	
  on	
  Google	
  File	
  System	
  
•  hSp://research.google.com/archive/gfs.html	
  
10
Lots	
  of	
  Commodity	
  Machines	
  
11
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce	
  (MR)	
  
•  Programming	
  paradigm	
  
•  Batch	
  oriented,	
  not	
  realOme	
  
•  Works	
  well	
  with	
  distributed	
  compuOng	
  
•  Lots	
  of	
  Java,	
  but	
  other	
  languages	
  supported	
  
•  Based	
  on	
  Google’s	
  paper	
  
•  hSp://research.google.com/archive/mapreduce.html	
  
12
Under	
  the	
  Covers	
  
You specify map() and
reduce() functions.

The framework does the
rest.	

60
Apache	
  HBase	
  
•  Random,	
  realOme	
  read/write	
  access	
  
•  Key/value	
  columnar	
  store	
  
•  (b|tr)illions	
  of	
  rows/columns	
  
•  Based	
  on	
  Google	
  BigTable	
  
•  hSp://research.google.com/archive/bigtable.html	
  
15
Apache	
  Accumulo	
  
•  Random,	
  realOme	
  read/write	
  access	
  
•  Key/value	
  columnar	
  store	
  
•  (b|tr)illions	
  of	
  rows/columns	
  
•  Based	
  on	
  Google	
  BigTable	
  
•  hSp://research.google.com/archive/bigtable.html	
  
•  Adds	
  cell-­‐level	
  security	
  
•  Implemented	
  by	
  NaOonal	
  Security	
  Agency	
  
•  Donated	
  to	
  ASF	
  
16
Apache	
  Hive	
  	
  Pig	
  
•  AbstracOon	
  of	
  Hadoop’s	
  Java	
  API	
  
•  Hive	
  is	
  SQL-­‐based	
  
•  Pig	
  is	
  more	
  data-­‐flow	
  oriented	
  
•  Eases	
  analysis	
  using	
  MapReduce	
  
17
Cloudera	
  Impala	
  
•  SQL-­‐based,	
  but	
  interacOve	
  response	
  
•  Backed	
  by	
  HDFS	
  or	
  HBase	
  
•  Allows	
  for	
  fast	
  iteraOon/discovery	
  
•  Not	
  as	
  fault-­‐tolerant	
  as	
  MapReduce	
  
18
Apache	
  Sqoop	
  	
  Flume	
  
•  Get	
  your	
  data	
  in	
  and	
  out	
  of	
  HDFS	
  
•  Sqoop	
  focuses	
  on	
  relaOonal	
  databases	
  
•  Flume	
  focuses	
  on	
  log	
  files	
  
19
Cloudera	
  Hue	
  
•  Hadoop	
  User	
  Experience	
  
•  Hadoop	
  is	
  largely	
  command	
  line	
  
•  Hue	
  provides	
  a	
  UI	
  for	
  end-­‐users	
  
•  SDK	
  to	
  build	
  your	
  own	
  apps	
  on	
  top	
  
20
Apache	
  Mahout	
  
•  Machine	
  learning	
  algorithms	
  that	
  run	
  on	
  MapReduce	
  
•  Clustering	
  
•  ClassificaOon	
  
•  Filtering	
  
•  I	
  didn’t	
  study	
  these	
  algorithms	
  in	
  school	
  
•  Data	
  science	
  people	
  are	
  excited	
  
•  Math	
  people	
  are	
  excited	
  
•  I’m	
  excited	
  for	
  them	
  
21
Apache	
  Tika	
  
•  Content	
  analysis	
  toolkit	
  
•  Simply	
  put,	
  a	
  lot	
  of	
  parsers	
  
•  Detect/extract	
  metadata/text	
  from	
  documents	
  
•  HTML	
  
•  XML	
  
•  Office	
  
•  PDF	
  
•  mbox	
  
•  More…	
  
22
Apache	
  ZooKeeper	
  
•  Distributed	
  systems	
  are	
  HARD	
  
•  Everyone	
  was	
  trying	
  to	
  implement	
  the	
  same	
  subsystems	
  
•  Bugs	
  leads	
  to	
  race	
  condiOons,	
  other	
  bad	
  things	
  
•  ZK:	
  Highly	
  reliable	
  distributed	
  coordinaOon	
  services	
  
•  ConfiguraOon	
  
•  Naming	
  
•  SynchronizaOon	
  
•  Group	
  Services	
  
23
Apache	
  Oozie	
  
•  Workflow	
  scheduling	
  for	
  Hadoop	
  
•  Like	
  cron,	
  but	
  in	
  directed	
  graph	
  fashion	
  
•  Out	
  of	
  box	
  hooks:	
  
•  MR	
  
•  Pig	
  
•  Hive	
  
•  Sqoop	
  
•  Impala	
  
24
Sentry	
  (incubaOng)	
  
•  Role-­‐based	
  access	
  control	
  for	
  Hive/Impala/Solr	
  
•  Regulatory/compliance	
  assurance	
  
25
Cloudera	
  Morphlines	
  
•  In-­‐memory	
  transformaOons	
  
•  Load,	
  parse,	
  transform,	
  process	
  
•  Records	
  as	
  name-­‐value	
  pairs	
  w/	
  opOonal	
  blob/pojo	
  objects	
  
•  Java	
  library,	
  embedded	
  in	
  your	
  codebase	
  
•  Used	
  to	
  ETL	
  data	
  from	
  Flume	
  and	
  MR	
  into	
  Solr	
  
26
Apache	
  Lucene	
  
•  Java-­‐based	
  index	
  and	
  search	
  
•  Spellchecking	
  
•  Hit	
  highlighOng	
  
•  TokenizaOon	
  
27
Apache	
  Solr	
  
•  Enterprise	
  search	
  plaoorm	
  
•  Based	
  on	
  Apache	
  Lucene	
  
•  Full-­‐text	
  search	
  
•  FaceOng	
  
•  NRT	
  indexing	
  
28
Apache	
  SolrCloud	
  
•  IntegraOon	
  of	
  Solr	
  +	
  ZooKeeper	
  
•  Provides	
  for	
  shard	
  failover	
  
29
Cloudera	
  Search	
  
•  Based	
  on	
  Apache	
  Solr	
  (incl	
  Lucene	
  and	
  SolrCloud)	
  
•  Fault-­‐tolerance:	
  collecOons	
  backed	
  by	
  HDFS	
  or	
  Hbase	
  
•  IntegraOon	
  galore:	
  
•  HBase/Flume/MapReduce	
  w/	
  Lucene	
  
•  Hue	
  w/	
  Solr	
  
•  Avro	
  w/	
  Tika	
  
•  HDFS	
  w/	
  Solr/Lucene	
  
•  Sentry	
  w/	
  Solr	
  
	
  
30
Cloudera	
  Search	
  +	
  Hue	
  
31	
  
Cloudera	
  Search	
  +	
  Hue	
  
32	
  
33
Apologies,	
  I	
  swiped	
  some	
  preSy	
  slides	
  from	
  markeOng…	
  
Why	
  Search?	
  
Search	
  Design	
  Strategy	
  
34
One	
  pool	
  of	
  data	
  
One	
  security	
  framework	
  
One	
  set	
  of	
  system	
  resources	
  
One	
  management	
  interface	
  
An	
  Integrated	
  Part	
  of	
  
the	
  Hadoop	
  System	
  
Storage	
  
Integra5on	
  
Resource	
  Management	
  
Metadata	
  
Batch	
  
Processing	
  
MAPREDUCE,	
  
HIVE	
  	
  PIG	
  
…
HDFS	
   HBase	
  
TEXT,	
  RCFILE,	
  PARQUET,	
  AVRO,	
  ETC.	
   RECORDS	
  
Engines	
  
InteracOve	
  
SQL	
  
CLOUDERA	
  
IMPALA	
  
InteracOve	
  
Search	
  
CLOUDERA	
  
SEARCH	
  
Machine	
  
Learning	
  
MAHOUT	
  
Math	
  	
  
Sta5s5cs	
  
SAS,	
  R	
  
	
  
Benefits	
  of	
  Search	
  IntegraOon	
  
35
Improved	
  Big	
  Data	
  ROI	
  
§  An	
  interacOve	
  experience	
  without	
  technical	
  knowledge	
  
§  Single	
  data	
  set	
  for	
  mulOple	
  compuOng	
  frameworks	
  
Faster	
  Time	
  to	
  Insight	
  
§  Exploratory	
  analysis,	
  esp.	
  unstructured	
  data	
  
§  Broad	
  range	
  of	
  indexing	
  opOons	
  to	
  accommodate	
  needs	
  
Cost	
  Efficiency	
  
§  Single	
  scalable	
  plaoorm;	
  no	
  incremental	
  investment	
  
§  No	
  need	
  for	
  separate	
  systems,	
  storage	
  
Solid	
  Founda5ons	
  	
  Reliability	
  
§  Solr	
  in	
  producOon	
  environments	
  for	
  years	
  
§  Hadoop-­‐powered	
  reliability	
  and	
  scalability	
  
36
So	
  much	
  soJware…	
  
Making	
  Decisions	
  
That’s	
  a	
  Lot	
  of	
  SoJware	
  
•  21	
  packages,	
  depending	
  on	
  how	
  you	
  count	
  
•  And	
  there’s	
  plenty	
  more…	
  
•  How	
  to	
  decide	
  what	
  to	
  use?	
  
37
38	
  
“	
  The	
  answer	
  to	
  most	
  
Hadoop	
  quesOons	
  is	
  it	
  
depends.”	
  
Some	
  of	
  the	
  Big	
  Issues	
  
•  Response	
  Ome	
  
•  User	
  interfaces	
  
•  Programming	
  paradigm	
  
•  Input/output	
  formats	
  
•  Use	
  cases	
  
	
  
39
Response	
  Time	
  
•  MapReduce	
  is	
  batch	
  oriented	
  
•  Resilient	
  to	
  hardware	
  failures	
  
•  Robust	
  scheduling	
  opOons	
  
•  Impala	
  is	
  near-­‐realOme	
  
•  HBase	
  is	
  realOme	
  
•  Key/values	
  are	
  cached	
  in	
  memory	
  
•  Search	
  can	
  be	
  (near-­‐)realOme.	
  
•  Hybrid	
  systems	
  are	
  common!	
  
40
User	
  Interfaces	
  
•  Java	
  
•  MapReduce,	
  HBase	
  
•  SQL	
  
•  Hive,	
  Impala	
  
•  Shell	
  
•  Pig	
  
•  Natural	
  Language	
  /	
  Free	
  Text	
  
•  Search	
  
41
Data	
  Constraints	
  
•  MapReduce	
  
•  Paradigm	
  takes	
  some	
  getng	
  used	
  to	
  
•  Processing	
  must	
  accommodate	
  format	
  
•  HBase	
  
•  Columnar	
  key/value	
  store	
  
•  Hue	
  makes	
  this	
  easier	
  
•  Search	
  
•  Indexing	
  and	
  display	
  
•  Hue	
  makes	
  this	
  easier	
  
42
Input/Output	
  Formats	
  
•  Know	
  what	
  they	
  are…	
  opOonal.	
  
•  Don’t	
  know?	
  That’s	
  okay.	
  
•  Schema	
  on	
  read.	
  
•  Be	
  able	
  to	
  extract	
  what	
  you	
  need	
  
43
Lack	
  of	
  Use	
  Case	
  
•  “Big	
  Data”	
  and	
  Hadoop	
  
•  They	
  ENABLE	
  you	
  to	
  solve	
  problems	
  
•  Won’t	
  solve	
  problems	
  for	
  you	
  
•  Doesn’t	
  know	
  about	
  your	
  business	
  logic	
  
•  “Big”	
  is	
  bigger	
  than	
  you’re	
  accustomed	
  to…	
  
•  Have	
  a	
  plan	
  
•  Bring	
  your	
  use	
  cases	
  
•  Bring	
  your	
  business	
  quesOons	
  
44
45
One	
  typical	
  Hadoop	
  use	
  case.	
  
Index	
  GeneraOon/Serving	
  
eBay	
  –	
  Cassini	
  Project	
  
•  June	
  2012	
  
•  2B	
  page	
  views/day	
  
•  250M	
  searches/day	
  
•  9	
  PB	
  online	
  
•  Custom	
  search	
  indexes	
  
•  Limited	
  by	
  field	
  or	
  Ome	
  period	
  
46
eBay	
  –	
  Cassini	
  Project	
  
•  MapReduce	
  to	
  generate	
  indexes	
  
•  Customer	
  history	
  
•  Item	
  fields:	
  name,	
  price,	
  descripOons,	
  etc	
  
•  Bulk	
  import	
  indexes	
  into	
  HBase,	
  served	
  
•  15	
  TB	
  in	
  HBase,	
  1.2	
  TB	
  daily	
  import	
  into	
  Hbase	
  
•  Ranking	
  algorithms	
  can	
  take	
  into	
  account	
  
•  More	
  history	
  
•  More	
  fields	
  
•  More	
  customer-­‐specific	
  details	
  
47
48
Some	
  quick	
  examples.	
  
Search	
  Use	
  Cases	
  
Search	
  Use	
  Cases	
  
49
Offer	
  easy	
  access	
  to	
  non-­‐technical	
  
resources	
  
Explore	
  data	
  prior	
  to	
  processing	
  and	
  
modeling	
  
Gain	
  immediate	
  access	
  and	
  find	
  
correlaOons	
  in	
  mission-­‐criOcal	
  data	
  
Powerful,	
  proven	
  search	
  capabili5es	
  that	
  
let	
  organiza5ons:	
  
Monsanto	
  
50
Scalable,	
  efficient	
  image	
  search	
  for	
  
analysis	
  and	
  research	
  
Track	
  plant	
  characterisOcs	
  throughout	
  their	
  
lifecycle	
  
Before:	
  Manual	
  aSribute	
  extracOon	
  and	
  search	
  
queries	
  within	
  database	
  
Now:	
  Parse	
  and	
  index	
  images	
  at	
  acquisiOon	
  and	
  
on	
  demand,	
  index	
  archived	
  images	
  in	
  batch	
  
51
Cloudera:	
  Internal	
  Field	
  Portal	
  
Custom	
  Aggregated	
  Search	
  
Cloudera	
  –	
  Internal	
  Field	
  Portal	
  
•  Single	
  stop	
  for	
  field	
  engineers	
  
•  Mailing	
  lists:	
  public,	
  private	
  
•  Tickets:	
  support,	
  development,	
  public	
  ASF	
  
•  Customer	
  data:	
  accounts,	
  clusters,	
  KB	
  arOcles	
  
•  Customer	
  Clusters:	
  configs,	
  audits,	
  logs,	
  events	
  
•  Books	
  and	
  papers	
  
•  Discussion	
  forums	
  
•  Dogfooding,	
  yes	
  
•  Makes	
  my	
  life	
  easier	
  
52
Cloudera	
  –	
  Internal	
  Field	
  Portal	
  
53	
  
Cloudera	
  –	
  Internal	
  Field	
  Portal	
  
•  Varied	
  fetchers/observers	
  for	
  web/API	
  content	
  
•  Content	
  is	
  retrieved	
  via	
  Flume,	
  Sqoop	
  
•  Search	
  indexes	
  and	
  replicates	
  into	
  HBase	
  
•  Each	
  collecOon	
  has	
  collecOon-­‐specific	
  filters/fields	
  
•  Provides	
  Otle,	
  content	
  snippet,	
  link	
  to	
  original	
  
•  Morphlines	
  extracts	
  books	
  and	
  papers	
  using	
  Tika	
  
•  Impala	
  for	
  analyOcs	
  
•  Future:	
  Use	
  MapReduce	
  to	
  ingest	
  logs	
  
54
55
PaSerns	
  	
  PredicOons:	
  Durkheim	
  Project	
  
Risk	
  ClassificaOon	
  	
  PredicOve	
  Analysis	
  
56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US	
  Combat	
  Deaths	
  AFG	
  
301	
  
	
  
2012	
  
57 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US	
  Combat	
  Deaths	
  AFG	
  
301	
  
	
  US	
  Military	
  Suicides	
  
349	
  
2012	
  
58 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US	
  Combat	
  Deaths	
  AFG	
  
301	
  
US	
  Military	
  Suicides	
  
349	
  
	
  
349	
  	
  301	
  
2012	
  
PaSerns	
  	
  PredicOons	
  –	
  Durkheim	
  Project	
  
•  Assessment	
  of	
  mental	
  health	
  risks	
  
•  Correlate	
  veterans’	
  communicaOons	
  with	
  suicide	
  risk	
  
59
PaSerns	
  	
  PredicOons	
  –	
  Durkheim	
  Project	
  
•  Build	
  machine	
  learning	
  algorithms	
  on	
  MapReduce	
  
•  Train	
  using	
  expert	
  knowledge	
  
•  Keywords	
  
•  PaSerns	
  
•  Algorithm	
  detects	
  and	
  assign	
  risk	
  scores	
  
•  In	
  what	
  medium?	
  
60
PaSerns	
  	
  PredicOons	
  –	
  Durkheim	
  Project	
  
61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/
Unstructured	
  
Clinical	
  
Notes	
  
PaSerns	
  	
  PredicOons	
  –	
  Durkheim	
  Project	
  
•  Phase	
  1	
  
•  3	
  cohorts:	
  non-­‐psychiatric,	
  psychiatric,	
  suicide-­‐posiOve	
  
•  100	
  clinical	
  profiles	
  per	
  cohort	
  
•  65%	
  accurate	
  in	
  predicOng	
  suicide	
  risk	
  in	
  control	
  group	
  
•  Phase	
  2	
  	
  
•  Text	
  analyOcs	
  of	
  clinical	
  records,	
  opt-­‐in	
  social	
  media	
  
•  Goal	
  of	
  100,000	
  veteran	
  parOcipants	
  
•  Represents	
  a	
  huge	
  increase	
  of	
  data	
  
•  TradiOonal	
  enterprise	
  search	
  couldn’t	
  scale	
  
62
PaSerns	
  	
  PredicOons	
  –	
  Durkheim	
  Project	
  
•  Technologies	
  
•  Hadoop	
  
•  Search	
  
•  Indexing	
  of	
  machine	
  learning,	
  backed	
  by	
  HBase	
  for	
  performance	
  
•  Hue	
  interface	
  for	
  non-­‐technical	
  users	
  
•  Discovery	
  of	
  terms,	
  keywords,	
  risk	
  factors	
  in	
  numerous	
  facets	
  
•  Impala	
  
•  Deep	
  SQL	
  queries	
  if/when	
  interesOng	
  deviaOons	
  are	
  found	
  
•  e.g.	
  if	
  the	
  word	
  “Molly”	
  appeared	
  in	
  top	
  10	
  facets	
  
•  Write	
  some	
  SQL	
  to	
  dig	
  in,	
  perhaps	
  revise	
  indexing	
  scheme	
  
63
PaSerns	
  	
  PredicOons	
  –	
  Durkheim	
  Project	
  
•  Currently	
  
•  Monitoring	
  
•  Analysis	
  
•  Future	
  
•  IntervenOonal	
  study	
  
•  Back	
  our	
  hopes	
  with	
  data…	
  
•  More	
  detailed	
  Case	
  Study	
  
•  hSp://goo.gl/3ZJMwS	
  
•  hSp://durkheimproject.org/	
  
64
65
ParOng	
  thoughts…	
  in	
  no	
  parOcular	
  order.	
  
Summary	
  
Search	
  Simplifies	
  InteracOon	
  
66
Explore	
  
Navigate	
  
Correlate	
  
Experts	
  know	
  MapReduce.	
  Savvy	
  people	
  know	
  SQL.	
  	
  
Everyone	
  knows	
  Search.	
  
Summary	
  
•  With	
  Hadoop,	
  it	
  depends.	
  
•  The	
  tools	
  are	
  out	
  there.	
  
•  Open	
  source	
  soJware	
  
•  Many	
  interconnected	
  pieces	
  
•  Many	
  unexplored	
  opportuniOes	
  
•  A	
  thriving	
  community	
  awaits	
  you…	
  
•  Data	
  can	
  make	
  a	
  difference.	
  
•  Search	
  allows	
  everyone	
  to	
  interact	
  with	
  data.	
  
•  This	
  is	
  a	
  Big	
  Deal.	
  
67
What’s	
  Next?	
  
•  Download	
  Hadoop!	
  
•  Already	
  done	
  that?	
  Contribute…	
  
•  CDH	
  available	
  at	
  www.cloudera.com	
  
•  Cloudera	
  provides	
  pre-­‐loaded	
  VMs	
  
•  hSp://Ony.cloudera.com/quickstartvm	
  
•  Clone	
  our	
  repos!	
  
•  hSps://github.com/cloudera	
  
68
69
Preferably	
  related	
  to	
  the	
  talk…	
  
QuesOons?	
  
70
Thank	
  You!	
  
Alex	
  Moundalexis	
  
	
  
@technmsg	
  
	
  
We’re	
  hiring,	
  kids!	
  Well,	
  not	
  kids.	
  

More Related Content

What's hot

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 

What's hot (20)

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 

Similar to Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudRogue Wave Software
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 

Similar to Search in the Apache Hadoop Ecosystem: Thoughts from the Field (20)

SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 

More from Alex Moundalexis

More from Alex Moundalexis (8)

Powered by the Sun
Powered by the SunPowered by the Sun
Powered by the Sun
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
YARN
YARNYARN
YARN
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Introduction to Cloudera Impala
Introduction to Cloudera ImpalaIntroduction to Cloudera Impala
Introduction to Cloudera Impala
 
Many Hats at Cloudera
Many Hats at ClouderaMany Hats at Cloudera
Many Hats at Cloudera
 
Hue Visual Tour
Hue Visual TourHue Visual Tour
Hue Visual Tour
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Search in the Apache Hadoop Ecosystem: Thoughts from the Field

  • 1. 1 Search  in  the  Apache  Hadoop   Ecosystem:  Thoughts  from  the  Field   Open  Source  Search  Conference,  November  2013   Alex  Moundalexis     @technmsg  
  • 2. 2 Thoughts  of  a  Former  SA  
  • 3. 3 Thoughts  of  a  Former  SA  Field  Guy  
  • 4. Disclaimer   •  Technologies,  not  products   •  Cloudera  builds  things  soJware   •  most  donated  to  Apache   •  some  closed-­‐source   •  I  will  likely  menOon  “Cloudera  Something”   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  Source  code  is  on  GitHub   •  hSps://github.com/cloudera   4
  • 5. What  This  Talk  Isn’t  About   •  Deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  Sizing  &  Tuning   •  Depends  heavily  on  data  and  workload   •  Coding   •  Algorithms   5
  • 6. 6   “  The  answer  to  most   Hadoop  quesOons  is  it   depends.”  
  • 7. 7 Quick  and  dirty,  more  Ome  for  use  cases.   The  Apache  Hadoop  Ecosystem  
  • 8. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaOons   •  ConfiguraOon   •  Workflow   8
  • 9. ParOal  Ecosystem   9 Hadoop   external  system   RDBMS  /  DWH   web  server   device  logs   API  access   log  collecOon   DB  table  import   batch  processing   machine  learning   external  system   API  access   user   RDBMS  /  DWH   DB  table    export   BI  tool   +  JDBC/ODBC   Search   SQL  
  • 10. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpOmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hSp://research.google.com/archive/gfs.html   10
  • 11. Lots  of  Commodity  Machines   11 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 12. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realOme   •  Works  well  with  distributed  compuOng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hSp://research.google.com/archive/mapreduce.html   12
  • 14. You specify map() and reduce() functions. The framework does the rest. 60
  • 15. Apache  HBase   •  Random,  realOme  read/write  access   •  Key/value  columnar  store   •  (b|tr)illions  of  rows/columns   •  Based  on  Google  BigTable   •  hSp://research.google.com/archive/bigtable.html   15
  • 16. Apache  Accumulo   •  Random,  realOme  read/write  access   •  Key/value  columnar  store   •  (b|tr)illions  of  rows/columns   •  Based  on  Google  BigTable   •  hSp://research.google.com/archive/bigtable.html   •  Adds  cell-­‐level  security   •  Implemented  by  NaOonal  Security  Agency   •  Donated  to  ASF   16
  • 17. Apache  Hive    Pig   •  AbstracOon  of  Hadoop’s  Java  API   •  Hive  is  SQL-­‐based   •  Pig  is  more  data-­‐flow  oriented   •  Eases  analysis  using  MapReduce   17
  • 18. Cloudera  Impala   •  SQL-­‐based,  but  interacOve  response   •  Backed  by  HDFS  or  HBase   •  Allows  for  fast  iteraOon/discovery   •  Not  as  fault-­‐tolerant  as  MapReduce   18
  • 19. Apache  Sqoop    Flume   •  Get  your  data  in  and  out  of  HDFS   •  Sqoop  focuses  on  relaOonal  databases   •  Flume  focuses  on  log  files   19
  • 20. Cloudera  Hue   •  Hadoop  User  Experience   •  Hadoop  is  largely  command  line   •  Hue  provides  a  UI  for  end-­‐users   •  SDK  to  build  your  own  apps  on  top   20
  • 21. Apache  Mahout   •  Machine  learning  algorithms  that  run  on  MapReduce   •  Clustering   •  ClassificaOon   •  Filtering   •  I  didn’t  study  these  algorithms  in  school   •  Data  science  people  are  excited   •  Math  people  are  excited   •  I’m  excited  for  them   21
  • 22. Apache  Tika   •  Content  analysis  toolkit   •  Simply  put,  a  lot  of  parsers   •  Detect/extract  metadata/text  from  documents   •  HTML   •  XML   •  Office   •  PDF   •  mbox   •  More…   22
  • 23. Apache  ZooKeeper   •  Distributed  systems  are  HARD   •  Everyone  was  trying  to  implement  the  same  subsystems   •  Bugs  leads  to  race  condiOons,  other  bad  things   •  ZK:  Highly  reliable  distributed  coordinaOon  services   •  ConfiguraOon   •  Naming   •  SynchronizaOon   •  Group  Services   23
  • 24. Apache  Oozie   •  Workflow  scheduling  for  Hadoop   •  Like  cron,  but  in  directed  graph  fashion   •  Out  of  box  hooks:   •  MR   •  Pig   •  Hive   •  Sqoop   •  Impala   24
  • 25. Sentry  (incubaOng)   •  Role-­‐based  access  control  for  Hive/Impala/Solr   •  Regulatory/compliance  assurance   25
  • 26. Cloudera  Morphlines   •  In-­‐memory  transformaOons   •  Load,  parse,  transform,  process   •  Records  as  name-­‐value  pairs  w/  opOonal  blob/pojo  objects   •  Java  library,  embedded  in  your  codebase   •  Used  to  ETL  data  from  Flume  and  MR  into  Solr   26
  • 27. Apache  Lucene   •  Java-­‐based  index  and  search   •  Spellchecking   •  Hit  highlighOng   •  TokenizaOon   27
  • 28. Apache  Solr   •  Enterprise  search  plaoorm   •  Based  on  Apache  Lucene   •  Full-­‐text  search   •  FaceOng   •  NRT  indexing   28
  • 29. Apache  SolrCloud   •  IntegraOon  of  Solr  +  ZooKeeper   •  Provides  for  shard  failover   29
  • 30. Cloudera  Search   •  Based  on  Apache  Solr  (incl  Lucene  and  SolrCloud)   •  Fault-­‐tolerance:  collecOons  backed  by  HDFS  or  Hbase   •  IntegraOon  galore:   •  HBase/Flume/MapReduce  w/  Lucene   •  Hue  w/  Solr   •  Avro  w/  Tika   •  HDFS  w/  Solr/Lucene   •  Sentry  w/  Solr     30
  • 31. Cloudera  Search  +  Hue   31  
  • 32. Cloudera  Search  +  Hue   32  
  • 33. 33 Apologies,  I  swiped  some  preSy  slides  from  markeOng…   Why  Search?  
  • 34. Search  Design  Strategy   34 One  pool  of  data   One  security  framework   One  set  of  system  resources   One  management  interface   An  Integrated  Part  of   the  Hadoop  System   Storage   Integra5on   Resource  Management   Metadata   Batch   Processing   MAPREDUCE,   HIVE    PIG   … HDFS   HBase   TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS   Engines   InteracOve   SQL   CLOUDERA   IMPALA   InteracOve   Search   CLOUDERA   SEARCH   Machine   Learning   MAHOUT   Math     Sta5s5cs   SAS,  R    
  • 35. Benefits  of  Search  IntegraOon   35 Improved  Big  Data  ROI   §  An  interacOve  experience  without  technical  knowledge   §  Single  data  set  for  mulOple  compuOng  frameworks   Faster  Time  to  Insight   §  Exploratory  analysis,  esp.  unstructured  data   §  Broad  range  of  indexing  opOons  to  accommodate  needs   Cost  Efficiency   §  Single  scalable  plaoorm;  no  incremental  investment   §  No  need  for  separate  systems,  storage   Solid  Founda5ons    Reliability   §  Solr  in  producOon  environments  for  years   §  Hadoop-­‐powered  reliability  and  scalability  
  • 36. 36 So  much  soJware…   Making  Decisions  
  • 37. That’s  a  Lot  of  SoJware   •  21  packages,  depending  on  how  you  count   •  And  there’s  plenty  more…   •  How  to  decide  what  to  use?   37
  • 38. 38   “  The  answer  to  most   Hadoop  quesOons  is  it   depends.”  
  • 39. Some  of  the  Big  Issues   •  Response  Ome   •  User  interfaces   •  Programming  paradigm   •  Input/output  formats   •  Use  cases     39
  • 40. Response  Time   •  MapReduce  is  batch  oriented   •  Resilient  to  hardware  failures   •  Robust  scheduling  opOons   •  Impala  is  near-­‐realOme   •  HBase  is  realOme   •  Key/values  are  cached  in  memory   •  Search  can  be  (near-­‐)realOme.   •  Hybrid  systems  are  common!   40
  • 41. User  Interfaces   •  Java   •  MapReduce,  HBase   •  SQL   •  Hive,  Impala   •  Shell   •  Pig   •  Natural  Language  /  Free  Text   •  Search   41
  • 42. Data  Constraints   •  MapReduce   •  Paradigm  takes  some  getng  used  to   •  Processing  must  accommodate  format   •  HBase   •  Columnar  key/value  store   •  Hue  makes  this  easier   •  Search   •  Indexing  and  display   •  Hue  makes  this  easier   42
  • 43. Input/Output  Formats   •  Know  what  they  are…  opOonal.   •  Don’t  know?  That’s  okay.   •  Schema  on  read.   •  Be  able  to  extract  what  you  need   43
  • 44. Lack  of  Use  Case   •  “Big  Data”  and  Hadoop   •  They  ENABLE  you  to  solve  problems   •  Won’t  solve  problems  for  you   •  Doesn’t  know  about  your  business  logic   •  “Big”  is  bigger  than  you’re  accustomed  to…   •  Have  a  plan   •  Bring  your  use  cases   •  Bring  your  business  quesOons   44
  • 45. 45 One  typical  Hadoop  use  case.   Index  GeneraOon/Serving  
  • 46. eBay  –  Cassini  Project   •  June  2012   •  2B  page  views/day   •  250M  searches/day   •  9  PB  online   •  Custom  search  indexes   •  Limited  by  field  or  Ome  period   46
  • 47. eBay  –  Cassini  Project   •  MapReduce  to  generate  indexes   •  Customer  history   •  Item  fields:  name,  price,  descripOons,  etc   •  Bulk  import  indexes  into  HBase,  served   •  15  TB  in  HBase,  1.2  TB  daily  import  into  Hbase   •  Ranking  algorithms  can  take  into  account   •  More  history   •  More  fields   •  More  customer-­‐specific  details   47
  • 48. 48 Some  quick  examples.   Search  Use  Cases  
  • 49. Search  Use  Cases   49 Offer  easy  access  to  non-­‐technical   resources   Explore  data  prior  to  processing  and   modeling   Gain  immediate  access  and  find   correlaOons  in  mission-­‐criOcal  data   Powerful,  proven  search  capabili5es  that   let  organiza5ons:  
  • 50. Monsanto   50 Scalable,  efficient  image  search  for   analysis  and  research   Track  plant  characterisOcs  throughout  their   lifecycle   Before:  Manual  aSribute  extracOon  and  search   queries  within  database   Now:  Parse  and  index  images  at  acquisiOon  and   on  demand,  index  archived  images  in  batch  
  • 51. 51 Cloudera:  Internal  Field  Portal   Custom  Aggregated  Search  
  • 52. Cloudera  –  Internal  Field  Portal   •  Single  stop  for  field  engineers   •  Mailing  lists:  public,  private   •  Tickets:  support,  development,  public  ASF   •  Customer  data:  accounts,  clusters,  KB  arOcles   •  Customer  Clusters:  configs,  audits,  logs,  events   •  Books  and  papers   •  Discussion  forums   •  Dogfooding,  yes   •  Makes  my  life  easier   52
  • 53. Cloudera  –  Internal  Field  Portal   53  
  • 54. Cloudera  –  Internal  Field  Portal   •  Varied  fetchers/observers  for  web/API  content   •  Content  is  retrieved  via  Flume,  Sqoop   •  Search  indexes  and  replicates  into  HBase   •  Each  collecOon  has  collecOon-­‐specific  filters/fields   •  Provides  Otle,  content  snippet,  link  to  original   •  Morphlines  extracts  books  and  papers  using  Tika   •  Impala  for  analyOcs   •  Future:  Use  MapReduce  to  ingest  logs   54
  • 55. 55 PaSerns    PredicOons:  Durkheim  Project   Risk  ClassificaOon    PredicOve  Analysis  
  • 57. 57 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/ US  Combat  Deaths  AFG   301    US  Military  Suicides   349   2012  
  • 58. 58 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/ US  Combat  Deaths  AFG   301   US  Military  Suicides   349     349    301   2012  
  • 59. PaSerns    PredicOons  –  Durkheim  Project   •  Assessment  of  mental  health  risks   •  Correlate  veterans’  communicaOons  with  suicide  risk   59
  • 60. PaSerns    PredicOons  –  Durkheim  Project   •  Build  machine  learning  algorithms  on  MapReduce   •  Train  using  expert  knowledge   •  Keywords   •  PaSerns   •  Algorithm  detects  and  assign  risk  scores   •  In  what  medium?   60
  • 61. PaSerns    PredicOons  –  Durkheim  Project   61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/ Unstructured   Clinical   Notes  
  • 62. PaSerns    PredicOons  –  Durkheim  Project   •  Phase  1   •  3  cohorts:  non-­‐psychiatric,  psychiatric,  suicide-­‐posiOve   •  100  clinical  profiles  per  cohort   •  65%  accurate  in  predicOng  suicide  risk  in  control  group   •  Phase  2     •  Text  analyOcs  of  clinical  records,  opt-­‐in  social  media   •  Goal  of  100,000  veteran  parOcipants   •  Represents  a  huge  increase  of  data   •  TradiOonal  enterprise  search  couldn’t  scale   62
  • 63. PaSerns    PredicOons  –  Durkheim  Project   •  Technologies   •  Hadoop   •  Search   •  Indexing  of  machine  learning,  backed  by  HBase  for  performance   •  Hue  interface  for  non-­‐technical  users   •  Discovery  of  terms,  keywords,  risk  factors  in  numerous  facets   •  Impala   •  Deep  SQL  queries  if/when  interesOng  deviaOons  are  found   •  e.g.  if  the  word  “Molly”  appeared  in  top  10  facets   •  Write  some  SQL  to  dig  in,  perhaps  revise  indexing  scheme   63
  • 64. PaSerns    PredicOons  –  Durkheim  Project   •  Currently   •  Monitoring   •  Analysis   •  Future   •  IntervenOonal  study   •  Back  our  hopes  with  data…   •  More  detailed  Case  Study   •  hSp://goo.gl/3ZJMwS   •  hSp://durkheimproject.org/   64
  • 65. 65 ParOng  thoughts…  in  no  parOcular  order.   Summary  
  • 66. Search  Simplifies  InteracOon   66 Explore   Navigate   Correlate   Experts  know  MapReduce.  Savvy  people  know  SQL.     Everyone  knows  Search.  
  • 67. Summary   •  With  Hadoop,  it  depends.   •  The  tools  are  out  there.   •  Open  source  soJware   •  Many  interconnected  pieces   •  Many  unexplored  opportuniOes   •  A  thriving  community  awaits  you…   •  Data  can  make  a  difference.   •  Search  allows  everyone  to  interact  with  data.   •  This  is  a  Big  Deal.   67
  • 68. What’s  Next?   •  Download  Hadoop!   •  Already  done  that?  Contribute…   •  CDH  available  at  www.cloudera.com   •  Cloudera  provides  pre-­‐loaded  VMs   •  hSp://Ony.cloudera.com/quickstartvm   •  Clone  our  repos!   •  hSps://github.com/cloudera   68
  • 69. 69 Preferably  related  to  the  talk…   QuesOons?  
  • 70. 70 Thank  You!   Alex  Moundalexis     @technmsg     We’re  hiring,  kids!  Well,  not  kids.