Applications on Hadoop

  • 603 views
Uploaded on

Applications on Hadoop talk by Mark Grover at JFokus 2014 on February 4, 2014 in Stockholm, Sweden.

Applications on Hadoop talk by Mark Grover at JFokus 2014 on February 4, 2014 in Stockholm, Sweden.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
603
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
22
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12   Building  ApplicaCons  on  Hadoop   Headline  Goes  Here   Mark  Grover   Speaker  Name  or  Subhead  Goes  Here   SoFware  Engineer,  Cloudera   @mark_grover   Jfokus  2014  (February  4th,  2014)     1 ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. Agenda   •  Brief  intro  to  Hadoop  and  the  ecosystem   •  Developing  apps  on  Hadoop   •  What’s  the  current  problem?   •  How  are  we  fixing  it?   2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. What  is  Apache  Hadoop?   Apache Hadoop  is  an  open  source   pla_orm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  quesCons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   3 CORE  HADOOP  SYSTEM  COMPONENTS   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage   Excels  at   Processing  Complex  Data       MapReduce     Distributed  CompuCng   Framework   Scales   Economically     §  Scale-­‐out  architecture  divides  workloads   across  mulCple  nodes     §  Can  be  deployed  on  commodity   hardware   §  Flexible  file  system  eliminates  ETL   bo^lenecks   §  Open  source  pla_orm  guards  against   vendor  lock   ©2014 Cloudera, Inc. All Rights Reserved.
  • 4. Developing  apps  on  Hadoop   Kite  SDK   4 ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. “[I]t’s  not  enough  to  just  build  a  scalable   and  stable  system;  the  system  also  has  to   be  easy  enough  for  thousands  of  internal   developers  of  all  types  and  all  skill  levels  to   use.”   2 5 h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/  
  • 6. Hadoop  is  incredibly  powerful   6 ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. Hadoop  is  incredibly  flexible   7 ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. Hadoop  is  incredibly  low-­‐level   8 ©2014 Cloudera, Inc. All Rights Reserved.
  • 9. Hadoop  is  incredibly  complex   9 ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. A  typical  system  (zoom  100:1)   10 ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. A  typical  system  (zoom  10:1)   11 ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. A  typical  system  (zoom  5:1)   12 ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. What  you  actually  care  about   •  Gelng  data  from  A  to  B   •  Using  it  later   13 ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. Infrastructure  details   •  SerializaCon,  file  formats,  and  compression   •  Metadata  capture  and  maintenance   •  Dataset  organizaCon  and  parCConing   •  Durability  and  delivery  guarantees   •  Well-­‐defined  failure  semanCcs   •  Performance  and  health  instrumentaCon   14 ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. Kite  SDK   •  Make  Hadoop  accessible  to  the  enterprise  developer   •  Address  the  most  common  cases   •  Codify  expert  pa^erns  and  pracCces  for  building  data-­‐oriented   systems  and  applicaCons.   •  Let  developers  focus  on  business  logic,  not  plumbing  or   infrastructure.   •  Provide  smart  defaults  for  pla_orm  choices.   •  Support  piecemeal  adopCon  via  loosely-­‐coupled  modules   15 ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. Kite  SDK   •  An  open  source  set  of  libraries,  guides,  and  examples  for   building  data-­‐oriented  systems  and  applicaCons   •  Provides  higher  level  APIs  atop  exisCng  components  of  CDH   •  Supports  piecemeal  adopCon  via  loosely  coupled  modules   16 ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. Kite  SDK   •  Data  –  logical  abstracCons  of  records,  datasets  and  repositories  with  implementaCons  for   HDFS  and  HBase  (upcoming)   •  •  •  •  •  •  APIs  to  drasCcally  simplify  working  with  datasets  in  Hadoop  filesystems.  The  Data  module:    handles  automaCc  serializaCon  and  deserializaCon  of  Java  POJOs  as  well  as  Avro  Records.   AutomaCc  compression.   File  and  directory  layout  and  management.   AutomaCc  parCConing  based  on  configurable  funcCons.   A  metadata  provider  plugin  interface  to  integrate  with  centralized  metadata  management   systems.     •  •  •  17 Morphlines  –  declaraCve  ETL  stream  processing  library     Maven  Plugin  –  tools  for  working  with  datasets  and  running  jobs   TODO:  Add  more!!!   ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. Co-­‐authoring  O’Reilly  book   •  Titled  ‘Hadoop  ApplicaCon  Architectures’   •  How  to  build  end-­‐to-­‐end  soluCons  using     Apache  Hadoop  and  related  tools   •  Updates  on  Twi^er:  @hadooparchbook   •  h^p://www.hadooparchitecturebook.com/   18 ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. Code   DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get(); Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get() ); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build() ); writer.close(); Data   15 19 /data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro
  • 20. Kite  SDK  Morphlines  Module   Pluggable,  configuraCon-­‐driven  data  transform  library   Born  out  of  Cloudera  Search,  but  general  purpose   Configure  record  transform  stages  in  a  container  library   Use  the  library  in  Flume,  MapReduce  jobs,  Storm,  and  other  Java   applicaCons   14 20
  • 21. Other  Modules   Maven  plugin   Package,  deploy,  and  execute  “apps”   Execute  dataset  operaCons   Examples   POJO,  generic,  and  generated  enCty  ingest   Dataset  administraCve  operaCons   Crunch  and  MR  integraCon   ...   14 21
  • 22. Future   HBase   Extending  data  APIs  to  support  random  access   Same  automaCc  serializaCon,  schema  management,  etc.   Higher-­‐order  data  management   Common  tasks   Think  background  compacCon,  conversion,  etc.   IntegraCon  with  exisCng  middleware  frameworks   Give  us  all  your  good  ideas  (and  code)!   14 22
  • 23. Kite  SDK  Resources   •  Docs   •  h^p://kitesdk.org/docs/current/   •  Examples   •  h^ps://github.com/kite-­‐sdk/kite-­‐examples   •  Source  code   •  h^ps://github.com/kite-­‐sdk/   Binary  arCfacts  available  from  Cloudera’s  Maven  repository   •  Twi^er:  @mark_grover   •  Slides  at     •  LinkedIn:  linkedin.com/in/grovermark   23 ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. 18 24