Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Upcoming SlideShare
Loading in...5
×
 

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

on

  • 222 views

Introduction to Hadoop presentation at Carnegie Mellon University, Silicon Valley Campus.

Introduction to Hadoop presentation at Carnegie Mellon University, Silicon Valley Campus.

Statistics

Views

Total Views
222
Views on SlideShare
213
Embed Views
9

Actions

Likes
1
Downloads
16
Comments
0

2 Embeds 9

https://twitter.com 7
http://www.slideee.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley Presentation Transcript

  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Introduc=on  to  Apache  Hadoop     and  its  Ecosystem   Mark  Grover    |    Intro  to  Cloud  Compu=ng,  Carnegie  Mellon  SV   github.com/markgrover/hadoop-­‐intro-­‐fast   ©  Copyright  2010-­‐2014              Cloudera,  Inc.                All  rights  reserved.      
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   About  Me   •  CommiNer  on  Apache  Bigtop,  commiNer  and  PPMC  member   on  Apache  Sentry  (incuba=ng).   •  Contributor  to  Apache  Hadoop,  Hive,  Spark,  Sqoop,  Flume.   •  SoUware  developer  at  Cloudera   •  @mark_grover   •  www.linkedin.com/in/grovermark  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Co-­‐author  O’Reilly  book   •  @hadooparchbook   •  hadooparchitecturebook.com   •  To  be  released  early  2015  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   About  the  Presenta=on…   •  What’s  ahead   •  Fundamental  Concepts   •  HDFS:  The  Hadoop  Distributed  File  System   •  Data  Processing  with  MapReduce   •  Demo   •  Conclusion  +  Q&A  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Fundamental  Concepts   Why  the  World  Needs  Hadoop  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What’s  the  craze  about  Hadoop?   •  Volume   •  More  and  more  data  being  generated   •  Machine  generated  data  increasing   •  Velocity   •  Data  coming  it  at  higher  speed   •  Variety   •  Audio,  video,  images,  log  files,  web  pages,  social  network   connec=ons,  etc.  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   We  Need  a  System  that  Scales   •  Too  much  data  for  tradi=onal  tools   •  Two  key  problems   •  How  to  reliably  store  this  data  at  a  reasonable  cost   •  How  to  we  process  all  the  data  we’ve  stored  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What  is  Apache  Hadoop?   •  Scalable  data  storage  and  processing   •  Distributed  and  fault-­‐tolerant     •  Runs  on  standard  hardware   •  Two  main  components   •  Storage:  Hadoop  Distributed  File  System  (HDFS)   •  Processing:  MapReduce   •  Hadoop  clusters  are  composed  of  computers  called  nodes   •  Clusters  range  from  a  single  node  up  to  several  thousand  nodes  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  Did  Apache  Hadoop  Originate?   •  Heavily  influenced  by  Google’s  architecture   •  Notably,  the  Google  Filesystem  and  MapReduce  papers   •  Other  Web  companies  quickly  saw  the  benefits   •  Early  adop=on  by  Yahoo,  Facebook  and  others   2002 2003 2004 2005 2006 Google publishes MapReduce paper Nutch rewritten for MapReduce Hadoop becomes Lucene subproject Nutch spun off from Lucene Google publishes GFS paper
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Comparing  Hadoop  to  Other  Systems   •  Monolithic  systems  don’t  scale   •  Modern  high-­‐performance  compu=ng  systems  are  distributed   •  They  spread  computa=ons  across  many  machines  in  parallel   •  Widely-­‐used  used  for  scien=fic  applica=ons   •  Let’s  examine  how  a  typical  HPC  system  works  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Fast Network
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 1: Copy input data Fast Network
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 2: Process the data Fast Network
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 3: Copy output data Fast Network
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   You  Don’t  Just  Need  Speed…   •  The  problem  is  that  we  have  way  more  data  than  code   $ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   You  Need  Speed  At  Scale   Storage System Compute Nodes Bottleneck
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Design  Fundamental:  Data  Locality   •  This  is  a  hallmark  of  Hadoop’s  design   •  Don’t  bring  the  data  to  the  computa=on   •  Bring  the  computa=on  to  the  data   •  Hadoop  uses  the  same  machines  for  storage  and  processing   •  Significantly  reduces  need  to  transfer  data  across  network  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Other  Hadoop  Design  Fundamentals   •  Machine  failure  is  unavoidable  –  embrace  it   •  Build  reliability  into  the  system   •  “More”  is  usually  beNer  than  “faster”   •  Throughput  maNers  more  than  latency  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Hadoop  Distributed  Filesystem   HDFS  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS:  Hadoop  Distributed  File  System   •  Inspired  by  the  Google  File  System   •  Reliable,  low-­‐cost  storage  for  massive  amounts  of  data   •  Similar  to  a  UNIX  filesystem  in  some  ways   •  Hierarchical   •  UNIX-­‐style  paths  (e.g.,  /sales/alice.txt)   •  UNIX-­‐style  file  ownership  and  permissions  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS:  Hadoop  Distributed  File  System   •  There  are  also  some  major  devia=ons  from  UNIX  filesystems   •  Highly-­‐op=mized  for  processing  data  with  MapReduce   •  Designed  for  sequen=al  access  to  large  files   •  Cannot  modify  file  content  once  wriNen   •  It’s  actually  a  user-­‐space  Java  process   •  Accessed  using  special  commands  or  APIs   •  No  concept  of  a  current  working  directory  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Copying  Local  Data  To  and  From  HDFS   •  Remember  that  HDFS  is  dis=nct  from  your  local  filesystem   •  hadoop fs –put  copies  local  files  to  HDFS   •  hadoop fs –get  fetches  a  local  copy  of  a  file  from  HDFS   $ hadoop fs -put sales.txt /reports Hadoop Cluster Client Machine $ hadoop fs -get /reports/sales.txt
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Demo   •  I  will  now  demonstrate  the  following   1.  How  to  list  the  contents  of  a  directory   2.  How  to  create  a  directory  in  HDFS   3.  How  to  copy  a  local  file  to  HDFS   4.  How  to  display  the  contents  of  a  file  in  HDFS   5.  How  to  remove  a  file  from  HDFS  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   A  Scalable  Data  Processing  Framework   Data  Processing  with  MapReduce  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What  is  MapReduce?   •  MapReduce  is  a  programming  model   •  It’s  a  way  of  processing  data     •  You  can  implement  MapReduce  in  any  language  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Understanding  Map  and  Reduce   •  You  supply  two  func=ons  to  process  data:  Map  and  Reduce   •  Map:  typically  used  to  transform,  parse,  or  filter  data   •  Reduce:  typically  used  to  summarize  results   •  The  Map  func=on  always  runs  first   •  The  Reduce  func=on  runs  aUerwards,  but  is  op=onal   •  Each  piece  is  simple,  but  can  be  powerful  when  combined  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Benefits   •  Scalability   •  Hadoop  divides  the  processing  job  into  individual  tasks   •  Tasks  execute  in  parallel  (independently)  across  cluster   •  Simplicity   •  Processes  one  record  at  a  =me   •  Ease  of  use   •  Hadoop  provides  job  scheduling  and  other  infrastructure   •  Far  simpler  for  developers  than  typical  distributed  compu=ng  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  in  Hadoop   •  MapReduce  processing  in  Hadoop  is  batch-­‐oriented   •  A  MapReduce  job  is  broken  down  into  smaller  tasks   •  Tasks  run  concurrently   •  Each  processes  a  small  amount  of  overall  input   •  MapReduce  code  for  Hadoop  is  usually  wriNen  in  Java   •  This  uses  Hadoop’s  API  directly   •  You  can  do  basic  MapReduce  in  other  languages   •  Using  the  Hadoop  Streaming  wrapper  program   •  Some  advanced  features  require  Java  code  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Example  in  Python   •  The  following  example  uses  Python   •  Via  Hadoop  Streaming   •  It  processes  log  files  and  summarizes  events  by  type   •  I’ll  explain  both  the  data  flow  and  the  code  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Job  Input   •  Here’s  the  job  input     •  Each  map  task  gets  a  chunk  of  this  data  to  process   •  Typically  corresponds  to  a  single  block  in  HDFS   2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%st1" % level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Python  Code  for  Map  Func=on   If  it  matches  a  known  level,  print   it,  a  tab  separator,  and  the  literal   value  1  (since  the  level  can  only   occur  once  per  line)   Read  records  from  standard  input.   Use  whitespace  to  split  into  fields.       Define  list  of  known  log  levels   Extract  “level”  field  and  convert  to   uppercase  for  consistency.  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Output  of  Map  Func=on   •  The  map  func=on  produces  key/value  pairs  as  output   INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  “Shuffle  and  Sort”   •  Hadoop  automa9cally  merges,  sorts,  and  groups  map  output   •  The  result  is  passed  as  input  to  the  reduce  func=on   •  More  on  this  later…   INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Shuffle  and  Sort   Map  Output   Reduce  Input  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Input  to  Reduce  Func=on   •  Reduce  func=on  receives  a  key  and  all  values  for  that  key       •  Keys  are  always  passed  to  reducers  in  sorted  order   •  Although  not  obvious  here,  values  are  unordered   ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Python  Code  for  Reduce  Func=on   #!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide 1 2 3 4 5 6 7 8 9 10 11 12 13 Ini=alize  loop  variables   Extract  the  key  and  value   passed  via  standard  input   If  key  unchanged,     increment  the  count  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Python  Code  for  Reduce  Func=on   # continued from previous slide else: if previous_key: print '%st%i' % (previous_key, sum) previous_key = key sum = 1 print '%st%i' % (previous_key, sum) 14 15 16 17 18 19 20 21 22 Print  data  for  the  final   key   If  key  changed,     print  data  for  old  level   Start  tracking  data  for   the  new  record  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Output  of  Reduce  Func=on   •  Its  output  is  a  sum  for  each  level   ERROR 1 INFO 4 WARN 2
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Recap  of  Data  Flow       ERROR 1 INFO 4 WARN 2 INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Map  input   Map  output   Reduce  input   Reduce  output   2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!" Shuffle   and  sort  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  to  Run  a  Hadoop  Streaming  Job   •  I’ll  demonstrate  this  now…    
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Open  Source  Tools  that  Complement  Hadoop   The  Hadoop  Ecosystem  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Hadoop  Ecosystem   •  "Core  Hadoop"  consists  of  HDFS  and  MapReduce   •  These  are  the  kernel  of  a  much  broader  plauorm   •  Hadoop  has  many  related  projects   •  Some  help  you  integrate  Hadoop  with  other  systems   •  Others  help  you  analyze  your  data   •  These  are  not  considered  “core  Hadoop”   •  Rather,  they’re  part  of  the  Hadoop  ecosystem   •  Many  are  also  open  source  Apache  projects  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Visual  Overview  of  a  Complete  Workflow   Import Transaction Data from RDBMSSessionize Web Log Data with Pig Analyst uses Impala for business intelligence Sentiment Analysis on Social Media with Hive Hadoop Cluster with Impala Generate Nightly Reports using Pig, Hive, or Impala Build product recommendations for Web site
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Key  Points   •  We’re  genera=ng  massive  volumes  of  data   •  This  data  can  be  extremely  valuable   •  Companies  can  now  analyze  what  they  previously  discarded   •  Hadoop  supports  large-­‐scale  data  storage  and  processing   •  Heavily  influenced  by  Google's  architecture   •  Already  in  produc=on  by  thousands  of  organiza=ons   •  HDFS  is  Hadoop's  storage  layer   •  MapReduce  is  Hadoop's  processing  framework   •  Many  ecosystem  projects  complement  Hadoop   •  Some  help  you  to  integrate  Hadoop  with  exis=ng  systems   •  Others  help  you  analyze  the  data  you’ve  stored  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Highly  Recommended  Books   Author:  Tom  White   ISBN:  1-­‐449-­‐31152-­‐0   Author:  Eric  Sammer   ISBN:  1-­‐449-­‐32705-­‐2  
  • ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Ques=ons?   •  Thank  you  for  aNending!   •  I’ll  be  happy  to  answer  any  addi=onal  ques=ons  now…   •  Demo  and  slides  at  github.com/markgrover/hadoop-­‐intro-­‐fast   •  TwiNer:  mark_grover   •  Survey  page:  =ny.cloudera.com/mark