• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop pycon2011uk
 

Hadoop pycon2011uk

on

  • 1,371 views

 

Statistics

Views

Total Views
1,371
Views on SlideShare
1,346
Embed Views
25

Actions

Likes
1
Downloads
12
Comments
0

2 Embeds 25

http://www.linkedin.com 24
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop pycon2011uk Hadoop pycon2011uk Presentation Transcript

    • Respect  for  the  elephant  –  Hadoop   Aditya  Sakhuja   aditya@sakhuja.us    
    • Whoami  •  So=ware  Engineer  @  Yahoo  Inc.    •  Web  Search  -­‐>  Cloud  PlaHorms  -­‐>  Display  Ads  Serving    •  hKp://linkedin.com/in/adityasakhuja    9/24/11   PyCon  UK  2011  
    • Agenda  •  MoVvaVon  •  History  •  Ecosystem  •  Daemon  processes  /  High  Level  View  •  Map  Reduce  Data  Flow  •  HDFS  Architecture  /  ReplicaVon  •  Can  /  Cannot  •  Ge[ng  started  yourself  •  Demo  •  Companies  Involved  •  Q&A  9/24/11   PyCon  UK  2011  
    • MoVvaVon  •  ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐   problems  •  Desired  features  in  an  improved  system  •  How  Hadoop  addresses  them  9/24/11   PyCon  UK  2011  
    • ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐   problems    •  CPU  intensive  over  Data  intensive  •  MPI  ,  PVM,    RPCs  –  Parallel  ComputaVon   Frameworks  •  Programming  for  tradiVonal  distributed  systems   is  complex   –  Data  exchange  requires  synchronizaVon   –  Temporal  dependencies  are  complicated   –  It  is  difficult  to  deal  with  parVal  failures  of  the  system  •  Data  typically  stored  on  SAN    •  Data  brought  to  compute  nodes  @  runVme  9/24/11   PyCon  UK  2011  
    • Desired  Features  in  a  Large  Scale  Data  Systems  •  Data  Driven   –  A  new  improved  system  should  avoid  data   boKlenecks  •  Scalable  •  Consistent  •  Recoverable    (  Data  /  Processor  )  •  ParVal  Failure  Support  9/24/11   PyCon  UK  2011  
    • What  Hadoop  offers  •  Provides  a  high  level  programming  model   –  No  worries  for  Locking/Temporal  Dependencies,   Sockets  ..  •  and  the  list  of  features  in  the  desired  list  J   (  previous  slide  )    9/24/11   PyCon  UK  2011  
    • History  •  Hadoop  is  based  on  work  done  by  Google  in   the  late  1990s/early  2000s  •  Specifically,  on  papers  describing  the  Google   File  System  (GFS)published  in  2003,  and  Map/ Reduce  published  in  2004  •  Hadoop  MapReduce  NextGeneraVon  –  2011   –  hKp://developer.yahoo.com/blogs/hadoop/ posts/2011/02/mapreduce-­‐nextgen/  9/24/11   PyCon  UK  2011  
    • Apache  Hadoop  Ecosystem   •  Hadoop  Common:  The  common  uVliVes  that  support  the  other  Hadoop  subprojects.   •  Hadoop  Distributed  File  System  (HDFS™):  A  distributed  file  system  that  provides  high-­‐ throughput  access  to  applicaVon  data.   •  Hadoop  MapReduce:  A  so=ware  framework  for  distributed  processing  of  large  data  sets   on  compute  clusters.   Other  Hadoop-­‐related  projects  at  Apache  include:   •  Cassandra™:  A  scalable  mulV-­‐master  database  with  no  single  points  of  failure.   •  HBase™:  A  scalable,  distributed  database  that  supports  structured  data  storage  for  large   tables.   •  Hive™:  A  data  warehouse  infrastructure  that  provides  data  summarizaVon  and  ad  hoc   querying.   •  Mahout™:  A  Scalable  machine  learning  and  data  mining  library.   •  Pig™:  A  high-­‐level  data-­‐flow  language  and  execuVon  framework  for  parallel   computaVon.   Source  :  hKp://hadoop.apache.org/    9/24/11   PyCon  UK  2011  
    • Hadoop  Key  Daemon  Processes  •  Namenode  •  Secondary  NameNode  •  DataNode  •  JobTracker  •  TaskTracker  9/24/11   PyCon  UK  2011  
    • High  level  Hadoop  cluster  view  9/24/11   PyCon  UK  2011  
    • MapReduce  Data  Flow  9/24/11   PyCon  UK  2011  
    • HDFS  Architecture  9/24/11   PyCon  UK  2011  
    • HDFS  ReplicaVon  9/24/11   PyCon  UK  2011  
    • Map  Reduce  Program  Components  •  MapReduce  programs  generally  consist  of   three  porVons   –   The  Mapper   –   The  Reducer   –  The  driver  code  •  AddiVonal  components  :   –  Combiner  (o=en  the  same  code  as  the  Reducer)   –  Custom  ParVVoner  9/24/11   PyCon  UK  2011  
    • Hadoop  Is  /  Is  Not  •  High  Bandwidth,  High  Latency  System  •  Not  a  subsVtute  for  a  DBMS,  not  alone  at-­‐least  •  HDFS  is  not  yet  a  Highly  Available  FS.   NameNode  is  a  SPOF  •  Is  a  “Share  nothing”  Architecture   –  Mappers  do  not  talk,  neither  do  Reducers  9/24/11   PyCon  UK  2011  
    • Ge[ng  started  yourself   Requirements  :     •  Java  SE  SDK  [download  JDK  6  or  higher  )   •  Download  and  Install     Hadoop  Common    :  0.20.203.X  -­‐  current  stable  version   Hadoop  HDFS  :  0.21  –  stable  version   Hadoop  MapReduce  :  0.21  –  stable  version   •  Subscribe  to  mailing  lists    for  Hadoop  subprojects,  depending  on  your   role   •  AddiVonally/AlternaVvely  one  can  setup  VMs  from  Cloudera  /  Yahoo     •  Details  :   •  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop   •  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic    9/24/11   PyCon  UK  2011  
    • Simple  Demo  •  Using   –  Pig     –  Map/Reduce  9/24/11   PyCon  UK  2011  
    • Streaming  Jobs    •  Any  language  that  can  read  from  stdin  and  write  to  stdout  •  hadoop  jar  $HADOOP_HOME/hadoop-­‐streaming.jar      -­‐input  myInputDirs      -­‐output  myOutputDir      -­‐mapper  myMapScript.py      -­‐reducer  myReduceScript.py      -­‐file  myMapScript.py      -­‐file  myReduceScript.py    9/24/11   PyCon  UK  2011  
    • Companies  involved  •  Yahoo    -­‐  4500  nodes  cluster  (  2*4  cores,  4*1  TBs   Disk  ,  16GB  RAM  )  –  (  AdServer,  Search  )  •  HortonWorks  ,  Cloudera  •  Facebook  •  A9    (  Amazon  Product  Search  )  •  EBay  -­‐  532  node  cluster  –  (  8  *  532  cores  ,  5.3  PB  )  •  Last.fm,  TwiKer  …  •  ……  a  lot  more  can  be  found  on  the  link  below  :                        hKp://wiki.apache.org/hadoop/PoweredBy   9/24/11   PyCon  UK  2011  
    • Useful  Links  •    hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop  -­‐  Ge[ng  Started  •  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html  -­‐  Cluster   Setup    •  hKp://developer.yahoo.com/hadoop/tutorial/module4.html  -­‐  MapReduce  •  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html  -­‐  PIG  •  hKp://hadoop.apache.org/common/docs/current/api/index.html  -­‐  APIs    •  hKp://developer.yahoo.com/hadoop/tutorial/  -­‐  YDN  resource  on  Hadoop  9/24/11   PyCon  UK  2011  
    • Q&C  Contact  InformaFon  :    Aditya  Sakhuja  aditya@sakhuja.us  hKp://twiKer.com/sakhuja  hKp://linkedin.com/in/adityasakhuja      9/24/11   PyCon  UK  2011