©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
BIG	
  DATA	
  INTELLIGENCE	
  PRACTICE	
  
HADOOP:	
  
PAST,	
  PRESENT	
  AND	
  FUTURE	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Roadmap	
  
1	
  
~1	
  hour	
  
1-­‐	
  What	
  Makes	
  Up	
  Hadoop	
  1.x?	
  
2-­‐	
  What’s	
  New	
  In	
  Hadoop	
  2.x?	
  
3-­‐	
  The	
  Future	
  Of	
  Hadoop	
  …	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
WHAT	
  MAKES	
  UP	
  
HADOOP	
  1.0?	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
What’s	
  a	
  “Node”?	
  
Node	
  aka	
  Server	
  
Compute	
  
Storage	
  
OperaVng	
  System	
  
Memory	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Hadoop	
  1.0:	
  HDFS	
  +	
  MapReduce	
  
4	
  
NameNode	
  
DataNode	
  /	
  TaskTracker	
   DataNode	
  /	
  TaskTracker	
  
DataNode	
  /	
  TaskTracker	
   DataNode	
  /	
  TaskTracker	
  
JobTracker	
  
Client	
  
1-­‐1	
  
1-­‐2	
  1-­‐3	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Hadoop	
  1.0:	
  HDFS	
  +	
  MapReduce	
  
5	
  
NameNode	
  
DataNode	
  /	
  TaskTracker	
   DataNode	
  /	
  TaskTracker	
  
DataNode	
  /	
  TaskTracker	
   DataNode	
  /	
  TaskTracker	
  
JobTracker	
  
Client	
  
1-­‐1	
   1-­‐2	
  
1-­‐3	
  
Reduce	
  Map	
  
2-­‐1	
   3-­‐2	
   3-­‐3	
   4-­‐1	
  
2-­‐3	
   4-­‐2	
   2-­‐2	
   3-­‐1	
   4-­‐3	
  
Reduce	
  Map	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
MapReduce	
  v1	
  LimitaVons	
  
6	
  
Scalability	
  
Maximum	
  cluster	
  size	
  is	
  4,000	
  nodes	
  and	
  maximum	
  concurrent	
  tasks	
  is	
  40,000	
  
Availability	
  
JobTracker	
  failure	
  kills	
  all	
  queued	
  and	
  running	
  jobs	
  
Resources	
  ParVVoned	
  into	
  Map	
  and	
  Reduce	
  
Hard	
  parGGoning	
  of	
  Map	
  and	
  Reduce	
  slots	
  led	
  to	
  low	
  resource	
  uVlizaVon	
  
No	
  Support	
  for	
  Alternate	
  Paradigms	
  /	
  Services	
  
Only	
  MapReduce	
  batch	
  jobs,	
  nothing	
  else	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Hadoop	
  1.0:	
  Single	
  Use	
  System	
  
7	
  
HADOOP	
  1.0	
  
Single	
  Use	
  System	
  
Batch	
  Apps	
  
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  and	
  data	
  
processing)	
  
Pig	
   Hive	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
WHAT’S	
  NEW	
  IN	
  
HADOOP	
  2.0?	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
YARN	
  
9	
  
YARN	
  Replaces	
  
MapReduce	
  
Yet	
  Another	
  Resource	
  NegoVator	
  
YARN	
  will	
  be	
  the	
  de-­‐facto	
  distributed	
  
operaVng	
  system	
  for	
  Big	
  Data	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  10	
  
Store	
  DATA	
  in	
  one	
  place	
  
Interact	
  with	
  that	
  data	
  in	
  MULTIPLE	
  WAYS	
  
with	
  Predictable	
  Performance	
  and	
  Quality	
  of	
  Service	
  
	
  	
  	
  	
  	
  	
  ApplicaGons	
  Run	
  NaGvely	
  IN	
  Hadoop	
  
HDFS2	
  
(redundant,	
  reliable	
  storage)	
  
YARN	
  
(cluster	
  resource	
  management)	
  
BATCH	
  
(MapReduce)	
  
INTERACTIVE	
  
(Tez)	
  
ONLINE	
  
(HBase)	
  
STREAMING	
  
(DataTorrent)	
  
GRAPH	
  
(Giraph)	
  
YARN:	
  No	
  Longer	
  Just	
  Batch	
  Apps	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  11	
  
YARN:	
  ApplicaVons	
  
Running	
  all	
  on	
  the	
  same	
  Hadoop	
  cluster	
  to	
  give	
  
applicaVons	
  access	
  to	
  all	
  the	
  same	
  source	
  data!	
  
MapReduce	
  v2	
  
Stream	
  Processing	
  
Master-­‐Worker	
  Online	
  
In-­‐Memory	
  
Apache	
  Storm	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  12	
  
YARN:	
  Quickly	
  Maturing	
  
2010	
  
	
  
2011	
  
	
  
2012	
  
	
  
2013	
  
	
  
2014	
  
	
  
Today	
  
Conceived	
  at	
  Yahoo!	
  
Alpha	
  Releases	
  –	
  2.0	
  
Beta	
  Releases	
  –	
  2.1	
  
GA	
  Released	
  –	
  2.2	
  
100,000+	
  nodes,	
  400,000+	
  jobs	
  daily	
  
10	
  million+	
  hours	
  of	
  compute	
  daily	
  
Version	
  2.3	
  
Version	
  2.4	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  13	
  
YARN:	
  Dr.	
  Evil	
  Approved	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  14	
  
YARN:	
  What	
  Has	
  Changed?	
  
YARN	
   MRv1	
  
RM	
  
ResourceManager	
  
AM	
  ApplicaVonMaster	
  
JT	
  
JobTracker	
  
Scheduler	
   Scheduler	
  
NM	
  NodeManager	
  
TT	
  TaskTracker	
  
Container	
  
Map	
  &	
  
Reduce	
  
Slot	
  
ResourceManager	
  
Scheduler	
  
JobTracker	
  
Scheduler	
  
NodeManager	
  
ApplicaVonMaster	
  
TaskTracker	
  
Map	
   Reduce	
  
NodeManager	
  
Container	
   Container	
  
TaskTracker	
  
Map	
   Reduce	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
The	
  6	
  Benefits	
  Of	
  YARN	
  
15	
  
• Scale	
  
• New	
  programming	
  models	
  
and	
  services	
  
• Improved	
  cluster	
  uVlizaVon	
  
• Agility	
  
• Backwards	
  compaVble	
  with	
  
MapReduce	
  v1	
  
• Mixed	
  workloads	
  on	
  the	
  
same	
  source	
  of	
  data	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
THE	
  FUTURE	
  
OF	
  HADOOP	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
SQL	
  on	
  Hadoop	
  
Speed	
  
Deliver	
  interacGve	
  query	
  performance.	
  
SQL	
  
Support	
  array	
  of	
  SQL	
  semanGcs	
  for	
  analyGc	
  
applicaGons	
  running	
  against	
  Hadoop.	
  
Scale	
  
SQL	
  interface	
  to	
  Hadoop	
  designed	
  for	
  queries	
  
that	
  scale	
  from	
  Terabytes	
  to	
  Petabytes	
  
	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
SQL	
  on	
  Hadoop	
  
Hive	
  on	
  Apache	
  Tez	
  
Hortonworks	
  HDP2	
  
Hive	
  on	
  Apache	
  Spark	
  
Cloudera	
  CDH5	
  
Apache	
  Drill	
  
MapR	
  M7	
  
Cloudera	
  Impala	
  
Cloudera	
  CDH5	
  
Pivotal	
  HAWQ	
  
Pivotal	
  Big	
  Data	
  Suite	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
HOYA:	
  HBase	
  (NoSQL)	
  on	
  YARN	
  
Dynamic	
  Scaling	
  
On-­‐demand	
  cluster	
  size.	
  Increase	
  and	
  decrease	
  
the	
  size	
  with	
  load.	
  
Easier	
  Deployment	
  
APIs	
  to	
  create,	
  start,	
  stop	
  and	
  delete	
  HBase	
  
clusters.	
  
Availability	
  
Recover	
  from	
  Region	
  Server	
  loss	
  with	
  a	
  new	
  
container.	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Microsoo	
  REEF	
  
Machine	
  Learning	
  
Framework	
  well	
  suited	
  for	
  building	
  machine	
  
learning	
  jobs.	
  
Scalable	
  /	
  Fault	
  Tolerant	
  
Makes	
  it	
  easy	
  to	
  implement	
  scalable,	
  fault-­‐
tolerant	
  runGme	
  environments	
  for	
  a	
  range	
  of	
  
computaGonal	
  models.	
  
Maintain	
  State	
  
Users	
  can	
  build	
  jobs	
  that	
  uGlize	
  data	
  from	
  
where	
  it’s	
  needed	
  and	
  also	
  maintain	
  state	
  a_er	
  
jobs	
  are	
  done.	
  
Retainable	
  
Evaluator	
  
ExecuGon	
  
Framework	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Heterogeneous	
  Storage	
  
NameNode	
  
Storage	
  
NameNode	
  
SATA	
   SSD	
  
Fusion	
  
IO	
  
THEN	
   NOW	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
Hadoop	
  Roadmap	
  
	
  
• Apache	
  Hadoop	
  2.5	
  
–  NodeManager	
  Restart	
  w/o	
  disrupGon	
  
–  Dynamic	
  Resource	
  ConfiguraGon	
  
	
  
• Apache	
  Hadoop	
  2.6	
  
–  Memory	
  As	
  Storage	
  Tier	
  
–  Support	
  For	
  Docker	
  Containers	
  
Q3	
  2014	
  
Q4	
  2014	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
HADOOP:	
  PAST,	
  PRESENT	
  &	
  FUTURE	
  
23	
  
I	
  KNOW	
  YOU	
  HAVE	
  
QUESTONS	
  
NO	
  SUCH	
  THING	
  AS	
  A	
  STUPID	
  QUESTION.	
  
©	
  2014	
  Trace3,	
  All	
  rights	
  reserved.	
  
ONE	
  LAST	
  THING	
  …	
  
24	
  
SD	
  Big	
  Data	
  Meetup	
  	
  
meetup.com/sdbigdata	
  
2nd	
  Wednesday	
  Of	
  The	
  Month	
  
Next:	
  August	
  13th	
  @	
  5:45P	
  

Hadoop - Past, Present and Future - v2.0

  • 1.
    ©  2014  Trace3,  All  rights  reserved.   BIG  DATA  INTELLIGENCE  PRACTICE   HADOOP:   PAST,  PRESENT  AND  FUTURE  
  • 2.
    ©  2014  Trace3,  All  rights  reserved.   Roadmap   1   ~1  hour   1-­‐  What  Makes  Up  Hadoop  1.x?   2-­‐  What’s  New  In  Hadoop  2.x?   3-­‐  The  Future  Of  Hadoop  …  
  • 3.
    ©  2014  Trace3,  All  rights  reserved.   WHAT  MAKES  UP   HADOOP  1.0?  
  • 4.
    ©  2014  Trace3,  All  rights  reserved.   What’s  a  “Node”?   Node  aka  Server   Compute   Storage   OperaVng  System   Memory  
  • 5.
    ©  2014  Trace3,  All  rights  reserved.   Hadoop  1.0:  HDFS  +  MapReduce   4   NameNode   DataNode  /  TaskTracker   DataNode  /  TaskTracker   DataNode  /  TaskTracker   DataNode  /  TaskTracker   JobTracker   Client   1-­‐1   1-­‐2  1-­‐3  
  • 6.
    ©  2014  Trace3,  All  rights  reserved.   Hadoop  1.0:  HDFS  +  MapReduce   5   NameNode   DataNode  /  TaskTracker   DataNode  /  TaskTracker   DataNode  /  TaskTracker   DataNode  /  TaskTracker   JobTracker   Client   1-­‐1   1-­‐2   1-­‐3   Reduce  Map   2-­‐1   3-­‐2   3-­‐3   4-­‐1   2-­‐3   4-­‐2   2-­‐2   3-­‐1   4-­‐3   Reduce  Map  
  • 7.
    ©  2014  Trace3,  All  rights  reserved.   MapReduce  v1  LimitaVons   6   Scalability   Maximum  cluster  size  is  4,000  nodes  and  maximum  concurrent  tasks  is  40,000   Availability   JobTracker  failure  kills  all  queued  and  running  jobs   Resources  ParVVoned  into  Map  and  Reduce   Hard  parGGoning  of  Map  and  Reduce  slots  led  to  low  resource  uVlizaVon   No  Support  for  Alternate  Paradigms  /  Services   Only  MapReduce  batch  jobs,  nothing  else  
  • 8.
    ©  2014  Trace3,  All  rights  reserved.   Hadoop  1.0:  Single  Use  System   7   HADOOP  1.0   Single  Use  System   Batch  Apps   HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management  and  data   processing)   Pig   Hive  
  • 9.
    ©  2014  Trace3,  All  rights  reserved.   WHAT’S  NEW  IN   HADOOP  2.0?  
  • 10.
    ©  2014  Trace3,  All  rights  reserved.   YARN   9   YARN  Replaces   MapReduce   Yet  Another  Resource  NegoVator   YARN  will  be  the  de-­‐facto  distributed   operaVng  system  for  Big  Data  
  • 11.
    ©  2014  Trace3,  All  rights  reserved.  10   Store  DATA  in  one  place   Interact  with  that  data  in  MULTIPLE  WAYS   with  Predictable  Performance  and  Quality  of  Service              ApplicaGons  Run  NaGvely  IN  Hadoop   HDFS2   (redundant,  reliable  storage)   YARN   (cluster  resource  management)   BATCH   (MapReduce)   INTERACTIVE   (Tez)   ONLINE   (HBase)   STREAMING   (DataTorrent)   GRAPH   (Giraph)   YARN:  No  Longer  Just  Batch  Apps  
  • 12.
    ©  2014  Trace3,  All  rights  reserved.  11   YARN:  ApplicaVons   Running  all  on  the  same  Hadoop  cluster  to  give   applicaVons  access  to  all  the  same  source  data!   MapReduce  v2   Stream  Processing   Master-­‐Worker  Online   In-­‐Memory   Apache  Storm  
  • 13.
    ©  2014  Trace3,  All  rights  reserved.  12   YARN:  Quickly  Maturing   2010     2011     2012     2013     2014     Today   Conceived  at  Yahoo!   Alpha  Releases  –  2.0   Beta  Releases  –  2.1   GA  Released  –  2.2   100,000+  nodes,  400,000+  jobs  daily   10  million+  hours  of  compute  daily   Version  2.3   Version  2.4  
  • 14.
    ©  2014  Trace3,  All  rights  reserved.  13   YARN:  Dr.  Evil  Approved  
  • 15.
    ©  2014  Trace3,  All  rights  reserved.  14   YARN:  What  Has  Changed?   YARN   MRv1   RM   ResourceManager   AM  ApplicaVonMaster   JT   JobTracker   Scheduler   Scheduler   NM  NodeManager   TT  TaskTracker   Container   Map  &   Reduce   Slot   ResourceManager   Scheduler   JobTracker   Scheduler   NodeManager   ApplicaVonMaster   TaskTracker   Map   Reduce   NodeManager   Container   Container   TaskTracker   Map   Reduce  
  • 16.
    ©  2014  Trace3,  All  rights  reserved.   The  6  Benefits  Of  YARN   15   • Scale   • New  programming  models   and  services   • Improved  cluster  uVlizaVon   • Agility   • Backwards  compaVble  with   MapReduce  v1   • Mixed  workloads  on  the   same  source  of  data  
  • 17.
    ©  2014  Trace3,  All  rights  reserved.   THE  FUTURE   OF  HADOOP  
  • 18.
    ©  2014  Trace3,  All  rights  reserved.   SQL  on  Hadoop   Speed   Deliver  interacGve  query  performance.   SQL   Support  array  of  SQL  semanGcs  for  analyGc   applicaGons  running  against  Hadoop.   Scale   SQL  interface  to  Hadoop  designed  for  queries   that  scale  from  Terabytes  to  Petabytes    
  • 19.
    ©  2014  Trace3,  All  rights  reserved.   SQL  on  Hadoop   Hive  on  Apache  Tez   Hortonworks  HDP2   Hive  on  Apache  Spark   Cloudera  CDH5   Apache  Drill   MapR  M7   Cloudera  Impala   Cloudera  CDH5   Pivotal  HAWQ   Pivotal  Big  Data  Suite  
  • 20.
    ©  2014  Trace3,  All  rights  reserved.   HOYA:  HBase  (NoSQL)  on  YARN   Dynamic  Scaling   On-­‐demand  cluster  size.  Increase  and  decrease   the  size  with  load.   Easier  Deployment   APIs  to  create,  start,  stop  and  delete  HBase   clusters.   Availability   Recover  from  Region  Server  loss  with  a  new   container.  
  • 21.
    ©  2014  Trace3,  All  rights  reserved.   Microsoo  REEF   Machine  Learning   Framework  well  suited  for  building  machine   learning  jobs.   Scalable  /  Fault  Tolerant   Makes  it  easy  to  implement  scalable,  fault-­‐ tolerant  runGme  environments  for  a  range  of   computaGonal  models.   Maintain  State   Users  can  build  jobs  that  uGlize  data  from   where  it’s  needed  and  also  maintain  state  a_er   jobs  are  done.   Retainable   Evaluator   ExecuGon   Framework  
  • 22.
    ©  2014  Trace3,  All  rights  reserved.   Heterogeneous  Storage   NameNode   Storage   NameNode   SATA   SSD   Fusion   IO   THEN   NOW  
  • 23.
    ©  2014  Trace3,  All  rights  reserved.   Hadoop  Roadmap     • Apache  Hadoop  2.5   –  NodeManager  Restart  w/o  disrupGon   –  Dynamic  Resource  ConfiguraGon     • Apache  Hadoop  2.6   –  Memory  As  Storage  Tier   –  Support  For  Docker  Containers   Q3  2014   Q4  2014  
  • 24.
    ©  2014  Trace3,  All  rights  reserved.   HADOOP:  PAST,  PRESENT  &  FUTURE   23   I  KNOW  YOU  HAVE   QUESTONS   NO  SUCH  THING  AS  A  STUPID  QUESTION.  
  • 25.
    ©  2014  Trace3,  All  rights  reserved.   ONE  LAST  THING  …   24   SD  Big  Data  Meetup     meetup.com/sdbigdata   2nd  Wednesday  Of  The  Month   Next:  August  13th  @  5:45P