Your SlideShare is downloading. ×
0
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa

4,115

Published on

Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation

Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation

Published in: Technology, Business
2 Comments
19 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,115
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
79
Comments
2
Likes
19
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 Hadoop  is  dead,  long  live  Hadoop!   Lars  George    |    EMEA  Chief  Architect   @larsgeorge   A  Eulogy  and  ProclamaAon  
  • 2. What  the  Press  Says…   2 Source:  hFp://blogs.the451group.com/informaAon_management/2012/07/09/hadoop-­‐is-­‐dead-­‐long-­‐live-­‐hadoop/  
  • 3. 3 Big  Data…  WTH?   A  brief  reasoning  for  Hadoop’s  existence.  
  • 4. 4 —  Bubble  Buddy,  Head  of  IT  
  • 5. Big  Data  –  A  Misnomer   •  Misleading  to  quick  assumpAons   •  Current  challenges  are  driven  by  many  things,  not  just  the   size  of  data   •  ANY  company  can  use  the  Big  Data  principles  to   improve  specific  business  metrics   •  Increased  data  retenAon   •  Access  to  all  the  data   •  Machine  learning  for  paFern  detecAon,  recommendaAons   •  But  what  has  happened  to  cause  this  all?   5
  • 6. Explosive  Data  Growth   6 10,000   2005   2015  2010   5,000   0   1.8 trillion gigabytes of  data  was   created  in  2011…   §  More  than  90%  is  unstructured  data   §  Approx.  500  quadrillion  files   §  QuanAty  doubles  every  2  years   STRUCTURED  DATA   UNSTRUCTURED  DATA   GIGABYTES  OF  DATA  CREATED  (IN  BILLIONS)   Source:  IDC  2011  
  • 7. The  ‘Big  Data’  Phenomenon   7 Big  Data  Drivers:   §  The  proliferaAon  of  data  capture   and  creaAon  technologies   §  Increased  “interconnectedness”   drives  consumpAon  (creaAng  more   data)   §  Inexpensive  storage  makes  it   possible  to  keep  more,  longer   §  InnovaAve  somware  and  analysis   tools  turn  data  into  informaAon   Big  Data  encompasses  not  only   the  content itself,  but  how it’s consumed.   More Devices More Consumption More Content New & Better Information §  Every  gigabyte  of  stored  content  can  generate  a   petabyte  or  more  of  transient  data*   §  The  informaAon  about  you  is  much  greater  than   the  informaAon  you  create   *Source:  IDC  2011  
  • 8. The  Current  SoluAons   8 10,000   2005   2015  2010   5,000   0   Current Database Solutions are   designed  for  structured  data.   §  OpAmized  to  answer  known  quesPons  quickly   §  Schemas  dictate  form/context   §  Difficult  to  adapt  to  new  data  types  and  new   quesAons   §  Expensive  at  Petabyte  scale   STRUCTURED  DATA   UNSTRUCTURED  DATA   GIGABYTES  OF  DATA  CREATED  (IN  BILLIONS)   10%
  • 9. Data  Management  Strategies   Have  Stayed  the  Same     •  Raw  data  on  SAN,  NAS   and  tape     •  Data  moved  from   storage  to  compute     •  RelaAonal  models  with   predesigned  schemas  
  • 10. Too  Much  Data,  Too  Many  Sources   •  Can’t  ingest  fast  enough  
  • 11. Too  Much  Data,  Too  Many  Sources   $ ! $ $ $ •  Can’t  ingest  fast  enough     •  Costs  too  much  to  store  
  • 12. Too  Much  Data,  Too  Many  Sources   1 2 3 4 5 •  Can’t  ingest  fast  enough     •  Costs  too  much  to  store     •  Exists  in  different  places  
  • 13. Too  Much  Data,  Too  Many  Sources   •  Can’t  ingest  fast  enough     •  Costs  too  much  to  store     •  Exists  in  different  places     •  Archived  data  is  lost  
  • 14. Can’t  Use  It  The  Way  You  Want  To   •  Analysis  and  processing   takes  too  long  
  • 15. Can’t  Use  It  The  Way  You  Want  To   1 2 3 4 5 •  Analysis  and  processing   takes  too  long     •  Data  exists  in  silos  
  • 16. Can’t  Use  It  The  Way  You  Want  To   ? ? ? •  Analysis  and  processing   takes  too  long     •  Data  exists  in  silos     •  Can’t  ask  new  quesAons  
  • 17. Can’t  Use  It  The  Way  You  Want  To   •  Analysis  and  processing   takes  too  long     •  Data  exists  in  silos     •  Can’t  ask  new  quesAons     •  Can’t  analyze   unstructured  data  
  • 18. The  Big  Data  Challenge   18 VOLUME VARIETY VELOCITY DEMANDS  A   NEW  APPROACH   Big  Data  Contains  Limitless  Insights…   BUT   WEB  LOGS   SOCIAL   MEDIA   TRANSACTIONAL   DATA   SMART   GRIDS   OPERATIONAL  DATA   DIGITAL   CONTENT   R&D  DATA   AD  IMPRESSIONS   FILES  
  • 19. Big  Data  Challenges   19 Cost-­‐effecAvely  managing  the  volume, velocity and variety of  data   Deriving  value  across   structured and unstructured data   AdapAng  to  context changes and integraAng new data sources and types
  • 20. Big  Data  SoluAon  Requirements   20 Cost-effectively manage the  volume,  variety  and  velocity  of  data   Process and analyze large,  complex  data  sets…quickly   Flexibly adapt to  context  changes  and  new  data  types  
  • 21. 21 Google’s  Approach  to  Big  Data   Hadoop’s  Pedigree    
  • 22. A  Timeline  View  #1   22
  • 23. Google  File  System   •  FoundaAon  of  scalable,  fail-­‐safe,  self-­‐healing  storage   •  One  central  place  of  truth   •  Cost-­‐effecAve  hardware  finally  available   •  19”  Rack  servers  with  decent  amount  of  disk  space   •  Handling  of  failures  built  in   •  Components  or  enAre  servers   •  At  scale  there  are  always  hardware  faults     •  Simple  file  system  interface   •  Finally  no  need  for  expensive,  proprietary  systems   23 Storage  
  • 24. MapReduce   •  First  take  on  distributed  data  processing  framework   •  Same  concepts  as  Google  File  System,  i.e.   •  Fail-­‐safe  and  scalable   •  Handles  a  wide  range  of  data  processing  problems   •  BUT  not  all  of  them  (more  later)   •  Simple  API  reading  and  wriAng  Key/Value  pairs   •  Framework  handles  heavy  task  of  data  movement   •  Core  concept  is  data  locality,  heavy  I/O   •  Brings  code  to  data,  not  the  opposite  (i.e.  no  HPC)   •  Accessible  in  many  programming  languages   24 Processing  
  • 25. BigTable   •  Adds  database  like  random  access  to  data   •  EffecAvely  a  Key/Value  store  with  table  semanAcs   •  Used  for  small  data  points   •  Usually  less  than  a  megabyte  per  Key/Value   •  Forfeits  advanced  concepts  for  ease  of  scalability   •  No  transacAons,  no  query  language   •  Powers  many  applicaAons  at  Google   •  Uses  Google  File  System  as  storage  layer   •  Tight  integraAon  with  MapReduce  for  batch   processing   25 Random  Access  
  • 26. Dremel,  Tenzing,  Pregel   •  Dremel  adds  specific  file  format  and  query  language   •  Used  for  highly  selecAve  queries,  data  exploraAon   •  File  layout  is  opAmized  for  very  effecAve  scanning   •  Runs  alongside  of  MapReduce  and  File  System     •  Tenzing  adds  SQL  over  various  data  sources   •  Can  query  raw  files,  Dremel  files,  or  BigTable  data  etc.   •  Brings  “known”  paradigm  to  stored  data   •  Pregel  adds  graph  processing  API   26 Query  API  
  • 27. Percolator,  Megastore   •  AddiAons  to  BigTable  to  add  “missing”  features   •  Percolator  is  using  BigTable  to  update  search  index   incrementally,  needs  transacAons   •  Distributes  updates  with  mulA-­‐phase  commits   •  Megastore  drives  Google  App  Engine  to  also  add   transacAons  for  user  API   •  Uses  ranges  of  rows  as  en#ty  groups   •  Reduces  locking  to  small  subsets   •  OpAmisAc,  roll-­‐forward  only  transacAons   •  Java  layer  over  BigTable  API   27 TransacAons  
  • 28. Spanner,  F1   •  Future  of  Google’s  distributed  storage  and   processing  system   •  Spanner  is  a  scalable,  mulA-­‐version,  globally-­‐   distributed,  and  synchronously-­‐replicated  database   •  Replicates  across  datacenters   •  Uses  TrueTime  (atomic  clocks)  for  synchronizaAon   •  Uses  Colossus  for  storage  (a  GFS  successor)   •  F1  replaced  MySQL  for  AdWords  service   •  SQL  over  data  stored  in  Spanner   •  Colocated  with  Spanner  processes   28 World-­‐Wide  Data  
  • 29. 29 The  Hadoop  Story   A  Eulogy  
  • 30. What  is  Apache  Hadoop?   30 Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  quesAons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   Excels  at   Processing  Complex  Data     §  Scale-­‐out  architecture  divides  workloads   across  mulAple  nodes   §  Flexible  file  system  eliminates  ETL   boFlenecks   Scales   Economically     §  Can  be  deployed  on  commodity   hardware   §  Open  source  plavorm  guards  against   vendor  lock   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage       MapReduce/YARN     Distributed  CompuAng   Framework   Apache Hadoop  is  an  open  source   plavorm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   CORE  HADOOP  SYSTEM  COMPONENTS  
  • 31. Core  Hadoop:  HDFS   31 Self-healing, high bandwidth 1 2 3 4 5 2 4 5 HDFS 1 2 5 1 3 4 2 3 5 1 3 4 HDFS  breaks  incoming  files  into  blocks  and  stores  them  redundantly  across  the  cluster.  
  • 32. Core  Hadoop:  MapReduce   32 framework. 1 2 3 4 5 2 4 5 MR 1 2 5 1 3 4 2 3 5 1 3 4 Processes  large  jobs  in  parallel  across  many  nodes  and  combines  the  results.  
  • 33. Why  Hadoop  Was  Created   33 New opportunities to  derive  value  from    all  your  data.     Exploding  Data  Volumes   &  Types   Driving  The  Need  For  A  Flexible,   Scalable  SoluPon   It’s difficult to handle data this diverse, at this scale. Traditional platforms can’t keep pace. WEB   LOGS   SOCIAL   MEDIA   TRANSACTIONAL   DATA   SMART   GRIDS   OPERATIONAL   DATA   DIGITAL   CONTENT   R&D   DATA   AD  IMPRESSIONS   FILES   •  Any  Kind   •  From  Any  Source   •  Structured  &  Unstructured   •  At  Scale   •  Deep  Analysis   •  ExhausAve  &  Detailed   •  SophisAcated  Algorithms   •  Generate  Results  Quickly   •  Extract More Value •  From More Data •  More Cost Effectively •  With Greater Flexibility BIG  DATA   HARD   PROBLEMS   NEW OPPORTUNITIES
  • 34. The  Core  Values  of  Hadoop   34 A platform for §  Designed to store and process data at petabyte scale §  Scale-out architecture increases capacity and processing power linearly §  Perform operations in parallel across the entire cluster §  Store data in any format – free from rigid schemas §  Define context at the time you ask the question §  Process and analyze data using virtually any programming language §  Build out your cluster on your hardware of choice §  Open source software guards against vendor lock-in §  Wide integration ensures investment protection 1 2 3
  • 35. Hadoop  In  PracAce   35
  • 36. 36   Cloudera  Soaware  Stack   Turnkey  soluAon  for  Big  Data  and  Advanced  AnalyAcs  use-­‐cases       CDH   100%  OPEN  SOURCE   HADOOP  DISTRIBUTION   CLOUDERA  MANAGER   END-­‐TO-­‐END  SYSTEM  MANAGEMENT   CORE  PROJECTS   PREMIUM  PROJECTS   CONNECTORS   HDFS   MAPREDUCE   FLUME   HCATALOG   MICROSTRATEGY   NETEZZA   ORACLE   QLIKVIEW   TABLEAU   TERADATA   HIVE   HUE   MAHOUT   OOZIE   PIG   SQOOP   WHIRR   ZOOKEEPER   HBASE   IMPALA   SEARCH  (BETA)   DEPLOYMENT   MONITORING   API   SNMP   CONFIG  ROLLBACKS   PHONE  HOME   SERVICE  MGMT   DIAGNOSTICS   ROLLING  UPGRADES   LDAP   REPORTING   BACKUP/DR   CLOUDERA  SUPPORT   BEST-­‐IN-­‐CLASS  TECHNICAL  SUPPORT,   COMMUNICTY  ADVOCACY  &   INDEMNIFICATION   CLOUDERA  NAVIGATOR   END-­‐TO-­‐END  DATA  MANAGEMENT   ACCESS  MGMT   DATA  AUDIT   CORE  HADOOP   PROJECTS   CLOUDERA   MANAGER   CLOUDERA   NAVIGATOR   HBASE   IMPALA  
  • 37. 37 Spin  some  YARN!   Reborn  again!  
  • 38. Back  to  the  Press  again…   38 Source:  hFp://gigaom.com/2012/07/07/why-­‐the-­‐days-­‐are-­‐numbered-­‐for-­‐hadoop-­‐as-­‐we-­‐know-­‐it/  
  • 39. A  Timeline  View  #2   39
  • 40. First:  What  is  MapReduce  1?   40
  • 41. MoAvaAons  to  Change  MR1   41 •  Scaling  >4000  nodes   •  Fewer,  larger  clusters   •  No  single  source  of  truth,  data  in  “silos”  again   •  HA  of  Job  Tracker  difficult   •  Large,  complex  state   •  Poor  resource  uAlizaAon   •  Slots  in  MR1  are  for  either  map  or  reduce  
  • 42. YARN:  Yet  Another  Resource  NegoAator   42
  • 43. Split  of  ResponsibiliAes   43 Job  Tracker   Resource   Manager   ApplicaAon   Master   split   •  One  per  Cluster   •  Long-­‐lived   •  App-­‐level   •  One  per  app  instance   •  Short-­‐lived   •  Task-­‐level  scheduling   and  monitoring  
  • 44. Fine-­‐grained  Resource  Control   •  Node  Manager  is  a  generalized  Task  Tracker   •  Task  Tracker   •  Fixed  number  of  map  and  reduce  slots   •  Node  Manager   •  Containers  with  variable  resource  limits   44
  • 45. Node  Manager:  Containers   45
  • 46. YARN  +  MapReduce  2   46 •  YARN  “runs”  MapReduce  as  an  applicaAon   •  MR  is  user  space   •  YARN  is  kernel  
  • 47. YARN  ApplicaAons   •  Distributed  shell   •  Open  MPI   •  Master-­‐worker   •  Apache  Giraph,  Hama   •  Spark   47
  • 48. 48 Summary   What  the  future  may  hold  
  • 49. Enterprise  Data  EvoluAon   RDBMS/EDW HADOOP-OPTIMIZED INFRASTRUCTURE AMOUNTOFDATA BUSINESS IMPACT NEXT-GEN DATA COMPUTING PLATFORM DATA-DRIVEN ORGANIZATION AMOUNT  OF  DATA   •  Data  collecAon  &  reporAng   •  Process  data  faster   •  Store  data  more  cost-­‐effecAvely   •  Simplify  infrastructure   •  Combine  data  from  across  the  business   •  Ask  new  quesAons  immediately   •  Enable  new  real-­‐Ame  applicaAons       1980s   2000s   2010s   CREATE   COMPETITIVE  ADVANTAGE   IMPROVE   OPERATIONAL  EFFICIENCY  
  • 50. Playing  Catchup   •  Improve  overall  performance   •  Google’s  code  is  kernel  module,  C++,  as  low  as  possible   •  Hadoop  is  Java,  for  ease  of  development  in  open-­‐source   •  Maybe  rewrite  parts  of  the  stack?   •  Overall  goal:  saturate  machine  specs  (I/O,  CPU,  RAM)   •  Add  missing  features   •  Everything  is  based  on  “hearsay”,  aka  research  papers  and   presentaAons   •  Add  what  is  necessary  or  for  the  sake  of  it?   50
  • 51. Further  Extend  or  Invent?   •  YARN  is  a  good  example  for  what  can  be  done   •  Look  at  every  component  and  evaluate   •  Work  with  research  and  universiAes,  companies  to   drive  new  development   •  What  else  can  be  done  with  all  that  data?   51
  • 52. 52 —  Jim  Gray,  Computer  ScienAst  
  • 53. From  Framework  to  Plavorm  to  Commodity   •  Hadoop  distribuAons  are  already  a  commodity   •  Move  up  the  stack  to  reach  commercial  space   •  Simplify  data  processing   •  ConAnuuity   •  WibiData  (Kiji)   •  Cloudera  CDK   •  Pure  Hadoop  SoluAons   •  DataMeer   •  Plavora   53
  • 54. Hadoop…  live  long  and  prosper!   54
  • 55.  Lars  George,  EMEA  Chief  Architect,  Cloudera            @larsgeorge   Thank  you!  

×