Respect	  for	  the	  elephant	  –	  Hadoop	                    Aditya	  Sakhuja	                  aditya@sakhuja.us	     ...
Whoami	  •  So=ware	  Engineer	  @	  Yahoo	  Inc.	  	  •  Web	  Search	  -­‐>	  Cloud	  PlaHorms	  -­‐>	  Display	  Ads	  ...
Agenda	  •      MoVvaVon	  •      History	  •      Ecosystem	  •      Daemon	  processes	  /	  High	  Level	  View	  •    ...
MoVvaVon	  •  ‘TradiVonal’	  large-­‐scale	  compuVng	  systems	  -­‐	     problems	  •  Desired	  features	  in	  an	  im...
‘TradiVonal’	  large-­‐scale	  compuVng	  systems	  -­‐	                              problems	                           ...
Desired	  Features	  in	  a	  Large	  Scale	  Data	  Systems	  •  Data	  Driven	           –  A	  new	  improved	  system	...
What	  Hadoop	  offers	  •  Provides	  a	  high	  level	  programming	  model	           –  No	  worries	  for	  Locking/Te...
History	  •  Hadoop	  is	  based	  on	  work	  done	  by	  Google	  in	     the	  late	  1990s/early	  2000s	  •  Specifica...
Apache	  Hadoop	  Ecosystem	                •    Hadoop	  Common:	  The	  common	  uVliVes	  that	  support	  the	  other	...
Hadoop	  Key	  Daemon	  Processes	  •      Namenode	  •      Secondary	  NameNode	  •      DataNode	  •      JobTracker	  ...
High	  level	  Hadoop	  cluster	  view	  9/24/11	                    PyCon	  UK	  2011	  
MapReduce	  Data	  Flow	  9/24/11	                PyCon	  UK	  2011	  
HDFS	  Architecture	  9/24/11	             PyCon	  UK	  2011	  
HDFS	  ReplicaVon	  9/24/11	             PyCon	  UK	  2011	  
Map	  Reduce	  Program	  Components	  •  MapReduce	  programs	  generally	  consist	  of	     three	  porVons	           –...
Hadoop	  Is	  /	  Is	  Not	  •  High	  Bandwidth,	  High	  Latency	  System	  •  Not	  a	  subsVtute	  for	  a	  DBMS,	  n...
Ge[ng	  started	  yourself	          Requirements	  :	  	              •  Java	  SE	  SDK	  [download	  JDK	  6	  or	  hig...
Simple	  Demo	  •  Using	           –  Pig	  	           –  Map/Reduce	  9/24/11	                    PyCon	  UK	  2011	  
Streaming	  Jobs 	  	  •  Any	  language	  that	  can	  read	  from	  stdin	  and	  write	  to	  stdout	  •  hadoop	  jar	...
Companies	  involved	  •  Yahoo	  	  -­‐	  4500	  nodes	  cluster	  (	  2*4	  cores,	  4*1	  TBs	                    Disk	...
Useful	  Links	  •                                      	         hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop	  -­...
Q&C	  Contact	  InformaFon	  :	  	  Aditya	  Sakhuja	  aditya@sakhuja.us	  hKp://twiKer.com/sakhuja	  hKp://linkedin.com/i...
Upcoming SlideShare
Loading in …5
×

Hadoop pycon2011uk

1,485 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,485
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop pycon2011uk

  1. 1. Respect  for  the  elephant  –  Hadoop   Aditya  Sakhuja   aditya@sakhuja.us    
  2. 2. Whoami  •  So=ware  Engineer  @  Yahoo  Inc.    •  Web  Search  -­‐>  Cloud  PlaHorms  -­‐>  Display  Ads  Serving    •  hKp://linkedin.com/in/adityasakhuja    9/24/11   PyCon  UK  2011  
  3. 3. Agenda  •  MoVvaVon  •  History  •  Ecosystem  •  Daemon  processes  /  High  Level  View  •  Map  Reduce  Data  Flow  •  HDFS  Architecture  /  ReplicaVon  •  Can  /  Cannot  •  Ge[ng  started  yourself  •  Demo  •  Companies  Involved  •  Q&A  9/24/11   PyCon  UK  2011  
  4. 4. MoVvaVon  •  ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐   problems  •  Desired  features  in  an  improved  system  •  How  Hadoop  addresses  them  9/24/11   PyCon  UK  2011  
  5. 5. ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐   problems    •  CPU  intensive  over  Data  intensive  •  MPI  ,  PVM,    RPCs  –  Parallel  ComputaVon   Frameworks  •  Programming  for  tradiVonal  distributed  systems   is  complex   –  Data  exchange  requires  synchronizaVon   –  Temporal  dependencies  are  complicated   –  It  is  difficult  to  deal  with  parVal  failures  of  the  system  •  Data  typically  stored  on  SAN    •  Data  brought  to  compute  nodes  @  runVme  9/24/11   PyCon  UK  2011  
  6. 6. Desired  Features  in  a  Large  Scale  Data  Systems  •  Data  Driven   –  A  new  improved  system  should  avoid  data   boKlenecks  •  Scalable  •  Consistent  •  Recoverable    (  Data  /  Processor  )  •  ParVal  Failure  Support  9/24/11   PyCon  UK  2011  
  7. 7. What  Hadoop  offers  •  Provides  a  high  level  programming  model   –  No  worries  for  Locking/Temporal  Dependencies,   Sockets  ..  •  and  the  list  of  features  in  the  desired  list  J   (  previous  slide  )    9/24/11   PyCon  UK  2011  
  8. 8. History  •  Hadoop  is  based  on  work  done  by  Google  in   the  late  1990s/early  2000s  •  Specifically,  on  papers  describing  the  Google   File  System  (GFS)published  in  2003,  and  Map/ Reduce  published  in  2004  •  Hadoop  MapReduce  NextGeneraVon  –  2011   –  hKp://developer.yahoo.com/blogs/hadoop/ posts/2011/02/mapreduce-­‐nextgen/  9/24/11   PyCon  UK  2011  
  9. 9. Apache  Hadoop  Ecosystem   •  Hadoop  Common:  The  common  uVliVes  that  support  the  other  Hadoop  subprojects.   •  Hadoop  Distributed  File  System  (HDFS™):  A  distributed  file  system  that  provides  high-­‐ throughput  access  to  applicaVon  data.   •  Hadoop  MapReduce:  A  so=ware  framework  for  distributed  processing  of  large  data  sets   on  compute  clusters.   Other  Hadoop-­‐related  projects  at  Apache  include:   •  Cassandra™:  A  scalable  mulV-­‐master  database  with  no  single  points  of  failure.   •  HBase™:  A  scalable,  distributed  database  that  supports  structured  data  storage  for  large   tables.   •  Hive™:  A  data  warehouse  infrastructure  that  provides  data  summarizaVon  and  ad  hoc   querying.   •  Mahout™:  A  Scalable  machine  learning  and  data  mining  library.   •  Pig™:  A  high-­‐level  data-­‐flow  language  and  execuVon  framework  for  parallel   computaVon.   Source  :  hKp://hadoop.apache.org/    9/24/11   PyCon  UK  2011  
  10. 10. Hadoop  Key  Daemon  Processes  •  Namenode  •  Secondary  NameNode  •  DataNode  •  JobTracker  •  TaskTracker  9/24/11   PyCon  UK  2011  
  11. 11. High  level  Hadoop  cluster  view  9/24/11   PyCon  UK  2011  
  12. 12. MapReduce  Data  Flow  9/24/11   PyCon  UK  2011  
  13. 13. HDFS  Architecture  9/24/11   PyCon  UK  2011  
  14. 14. HDFS  ReplicaVon  9/24/11   PyCon  UK  2011  
  15. 15. Map  Reduce  Program  Components  •  MapReduce  programs  generally  consist  of   three  porVons   –   The  Mapper   –   The  Reducer   –  The  driver  code  •  AddiVonal  components  :   –  Combiner  (o=en  the  same  code  as  the  Reducer)   –  Custom  ParVVoner  9/24/11   PyCon  UK  2011  
  16. 16. Hadoop  Is  /  Is  Not  •  High  Bandwidth,  High  Latency  System  •  Not  a  subsVtute  for  a  DBMS,  not  alone  at-­‐least  •  HDFS  is  not  yet  a  Highly  Available  FS.   NameNode  is  a  SPOF  •  Is  a  “Share  nothing”  Architecture   –  Mappers  do  not  talk,  neither  do  Reducers  9/24/11   PyCon  UK  2011  
  17. 17. Ge[ng  started  yourself   Requirements  :     •  Java  SE  SDK  [download  JDK  6  or  higher  )   •  Download  and  Install     Hadoop  Common    :  0.20.203.X  -­‐  current  stable  version   Hadoop  HDFS  :  0.21  –  stable  version   Hadoop  MapReduce  :  0.21  –  stable  version   •  Subscribe  to  mailing  lists    for  Hadoop  subprojects,  depending  on  your   role   •  AddiVonally/AlternaVvely  one  can  setup  VMs  from  Cloudera  /  Yahoo     •  Details  :   •  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop   •  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic    9/24/11   PyCon  UK  2011  
  18. 18. Simple  Demo  •  Using   –  Pig     –  Map/Reduce  9/24/11   PyCon  UK  2011  
  19. 19. Streaming  Jobs    •  Any  language  that  can  read  from  stdin  and  write  to  stdout  •  hadoop  jar  $HADOOP_HOME/hadoop-­‐streaming.jar      -­‐input  myInputDirs      -­‐output  myOutputDir      -­‐mapper  myMapScript.py      -­‐reducer  myReduceScript.py      -­‐file  myMapScript.py      -­‐file  myReduceScript.py    9/24/11   PyCon  UK  2011  
  20. 20. Companies  involved  •  Yahoo    -­‐  4500  nodes  cluster  (  2*4  cores,  4*1  TBs   Disk  ,  16GB  RAM  )  –  (  AdServer,  Search  )  •  HortonWorks  ,  Cloudera  •  Facebook  •  A9    (  Amazon  Product  Search  )  •  EBay  -­‐  532  node  cluster  –  (  8  *  532  cores  ,  5.3  PB  )  •  Last.fm,  TwiKer  …  •  ……  a  lot  more  can  be  found  on  the  link  below  :                        hKp://wiki.apache.org/hadoop/PoweredBy   9/24/11   PyCon  UK  2011  
  21. 21. Useful  Links  •    hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop  -­‐  Ge[ng  Started  •  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html  -­‐  Cluster   Setup    •  hKp://developer.yahoo.com/hadoop/tutorial/module4.html  -­‐  MapReduce  •  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html  -­‐  PIG  •  hKp://hadoop.apache.org/common/docs/current/api/index.html  -­‐  APIs    •  hKp://developer.yahoo.com/hadoop/tutorial/  -­‐  YDN  resource  on  Hadoop  9/24/11   PyCon  UK  2011  
  22. 22. Q&C  Contact  InformaFon  :    Aditya  Sakhuja  aditya@sakhuja.us  hKp://twiKer.com/sakhuja  hKp://linkedin.com/in/adityasakhuja      9/24/11   PyCon  UK  2011  

×