• Share
  • Email
  • Embed
  • Like
  • Private Content
Processing Big Data (Chapter 3, SC 11 Tutorial)
 

Processing Big Data (Chapter 3, SC 11 Tutorial)

on

  • 2,954 views

This is Chapter 3 of a tutorial that I gave at SC 11 on November 14, 2011.

This is Chapter 3 of a tutorial that I gave at SC 11 on November 14, 2011.

Statistics

Views

Total Views
2,954
Views on SlideShare
2,954
Embed Views
0

Actions

Likes
0
Downloads
116
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Processing Big Data (Chapter 3, SC 11 Tutorial) Processing Big Data (Chapter 3, SC 11 Tutorial) Presentation Transcript

    • An  Introduc+on  to     Data  Intensive  Compu+ng    Chapter  3:  Processing  Big  Data   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
    • 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)  2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)  3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems  4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
    • Sec+on  3.1  Processing  Big  Data  Using  U+lity  and  Data  Clouds   A  Google  produc+on  rack  of   servers  from  about  1999.  
    • •  How  do  you  do  analy+cs  over  commodity   disks  and  processors?  •  How  do  you  improve  the  efficiency  of   programmers?  
    • Serial  &  SMP  Algorithms   Task   Task   Task   Task   local  disk*   local  disk*   Serial  algorithm   Symmetric   Mul+processing   (SMP)  algorithm  •  *  local  disk  and  memory  
    • Pleasantly  (=  Embarrassingly)  Parallel     Task   Task   Task   Task   Task   Task   Task   Task   Task   local  disk   local  disk   local  disk   MPI  •  Need  to  par++on  data,  start  tasks,  collect  results.    •  Oden  tasks  organized  into  DAG.  
    • How  Do  You  Program  A  Data  Center?   7  
    • The  Google  Data  Stack  •  The  Google  File  System  (2003)  •  MapReduce:  Simplified  Data  Processing…  (2004)  •  BigTable:  A  Distributed  Storage  System…  (2006)   8  
    • Google’s  Large  Data  Cloud   Applica+ons   Compute  Services   Google’s  MapReduce  Data  Services   Google’s  BigTable   Storage  Services   Google  File  System  (GFS)  Google’s  Early  Data  Stack   circa  2000   9
    • Hadoop’s  Large  Data  Cloud     (Open  Source)   Applica+ons   Compute  Services   Hadoop’s  MapReduce  Data  Services   NoSQL,  e.g.  HBase   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   10
    • A  very  nice  recent  book  by    Barroso  and  Holzle  
    • The  Amazon  Data  Stack   Amazon  uses  a  highly   decentralized,  loosely  coupled,   service  oriented  architecture   consis+ng  of  hundreds  of   services.  In  this  environment   there  is  a  par+cular  need  for   storage  technologies  that  are   always  available.  For  example,   customers  should  be  able  to   view  and  add  items  to  their   shopping  cart  even  if  disks  are   failing,  network  routes  are   flapping,  or  data  centers  are   being  destroyed  by  tornados.    SOSP’07  
    • Amazon  Style  Data  Cloud   Load  Balancer   Simple  Queue  Service  SDB   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instances   S3  Storage  Services   13
    • Open  Source  Versions  •  Eucalyptus   –  Ability  to  launch  VMs   –  S3  like  storage  •  Open  Stack   –  Ability  to  launch  VMs   –  S3  like  storage  -­‐  Swid    •  Cassandra   –  Key-­‐value  store  like  S3   –  Columns  like  BigTable  •  Many  other  open  source  Amazon  style  services   available.  
    • Some  Programming  Models  for  Data  Centers  •  Opera+ons  over  data  center  of  disks   –  MapReduce  (“string-­‐based”  scans  of  data)   –  User-­‐Defined  Func+ons  (UDFs)  over  data  center   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  disk-­‐based  data.   –  SQL  and  NoSQL  over  data  center  •  Opera+ons  over  data  center  of  memory   –  Grep  over  distributed  memory   –  UDFs  over  distributed  memory   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  membory-­‐based  data.   –  SQL  and  NoSQL  over  distributed  memory  
    • Sec+on  3.2        Processing  Data  By  Scaling  Out    Virtual  Machines  
    • Processing  Big  Data  PaCern  1:    Launch  Independent  Virtual  Machines   and  Task  with  a  Messaging  Service  
    • Task  With  Messaging  Service  Task   &  Use  S3  (Variant  1)  VM   Control  VM:  Launches  and   tasks  workers  Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   S3  
    • Task  With  Messaging  Service  Task   &  Use  NoSQL  DB  (Variant  2)  VM   Control  VM:  Launches  and   tasks  workers  Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   AWS  SimpleDB  
    • Task  With  Messaging  Service  Task   &  Use  Clustered  FS  (Variant  3)  VM   Control  VM:  Launches  and   tasks  workers  Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   GlusterFS  
    • Sec+on  3.3  MapReduce   Google  2004   Technical  Report  
    • Core  Concepts  •  Data  are  (key,  value)  pairs  and  that’s  it  •  Par++on  data  over  commodity  nodes  filling  racks   in  a  data  center.  •  Sodware  handles  failures,  restarts,  etc.  This  is   the  hard  part.    •  Basic  examples:   –  Word  Count   –  Inverted  index  
    • Processing  Big  Data  PaCern  2:     MapReduce  
    • Map   Task   Reduce   Map   Map   Task   Tracker   Task   Task   Task   HDFS  HDFS   local  disk   local  disk   Map   Task   Map   Map   Task   Tracker   Task   Task   Reduce   Task  HDFS   local  disk   HDFS   Map   Task   Map   Map   Task   Tracker   local  disk   Task   Task  HDFS   local  disk   Shuffle  &  Sort  
    • Example:  Word  Count  &  Inverted  Index   •  How  do  you  count   the  words  in  a   million  books?   –  (best,  7)   •  Inverted  index:   –  (best;  page  1,  page   82,  …)   –  (worst;  page  1,   page  12,  …)    Cover  of  serial  Vol.  V,  1859,  London  
    • •  Assume  you  have  a  cluster  of  50  computers,  each   with  an  aCached  local  disk  and  half  full  of  web   pages.  •  What  is  a  simple  parallel  programming  framework   that  would  support  the  computa+on  of  word  counts   and  inverted  indices?  
    • Basic  PaCern:  Strings  1.  Extract  words   2.  Hash  and   3.  Count  (or  from  web  pages  in   sort  words.   construct  inverted  parallel.   index)  in  parallel.  
    • What  about  data  records?   1.  Extract  words   2.  Hash  and   3.  Count  (or   from  web  pages  in   sort  words.   construct  inverted   parallel.   index)  in  parallel.   1.  Extract  binned   2.  Hash  and   3.  Count  (or   field  value  from   sort  binned   construct  inverted   data  records  in   field  values.   index)  in  parallel.   parallel.  
    • Map-­‐Reduce  Example  •  Input  is  files  with  one  document  per  record  •  User  specifies  map  func+on   –  key  =  document  URL   –  Value  =  document  contents  Input  of  map   doc  cdickens  two  ci+es ,   it  was  the  best  of  +mes  Output  of  map   it ,  1   was ,  1   the ,  1   best ,  1  
    • Example  (cont d)   •  MapReduce  library  gathers  together  all  pairs   with  the  same  key  value  (shuffle/sort  phase)   •  The  user-­‐defined  reduce  func+on  combines  all   the  values  associated  with  the  same  key  Input  of  reduce  key  =   it   key  =   was   key  =   best   key  =   worst  values  =  1,  1   values  =  1,  1   values  =  1   values  =  1    Output  of  reduce   it ,  2   was ,  2   best ,  1   worst ,  1  
    • Why  Is  Word  Count  Important?  •  It  is  one  of  the  most  important  examples  for   the  type  of  text  processing  oden  done  with   MapReduce.  •  There  is  an  important  mapping                document          <  -­‐-­‐-­‐-­‐-­‐  >            data  record                      words                  <  -­‐-­‐-­‐-­‐-­‐  >              (field,  value)   Inversion  
    • Pleasantly  Parallel   MapReduce  Data  structure   Arbitrary   (key,  value)  pairs  Func+ons   Arbitrary   Map  &  Reduce  Middleware   MPI  (message   Hadoop   passing)  Ease  of  use   Difficult   Medium  Scope   Wide   Narrow  Challenge     Geung  something   Moving  to   working   MapReduce    
    • Common  MapReduce  Design  PaCerns  •  Word  count  •  Inversion  –  inverted  index  •  Compu+ng  simple  sta+s+cs  •  Compu+ng  windowed  sta+s+cs  •  Sparse  matrix  (document-­‐term,  data  record-­‐ FieldBinValue,  …)  •   Site-­‐en+ty  sta+s+cs  •  PageRank  •  Par++oned  and  ensemble  models  •  EM  
    • Sec+on  3.4  User  Defined  Func+ons  over  DFS   sector.sf.net  
    • Processing  Big  Data  PaCern  3:     User  Defined  Func+ons  over   Distributed  File  Systems  
    • Sector/Sphere  •  Sector/Sphere  is  a  plaworm  for  data  intensive   compu+ng.    
    • Idea  1:  Apply  User  Defined  Func+ons  (UDF)  to  Files  in  a  Distributed  File  System   map/shuffle reduce UDF UDFThis  generalizes  Hadoop’s  implementa+on  of  MapReduce  over  the  Hadoop  Distributed  File  system.  
    • Idea  2:  Add  Security  From  the  Start  Security •  Security  server  maintains   Master Client informa+on  about  users  Server SSL and  slaves.   SSL •  User  access  control:   password  and  client  IP   address.   AAA data •  File  level  access  control.   •  Messages  are  encrypted   over  SSL.  Cer+ficate  is   used  for  authen+ca+on.   •  Sector  is  a  good  basis  for   HIPAA  compliant   Slaves applica+ons.  
    • Idea  3:  Extend  the  Stack  to  Include   Network  Transport  Services   Compute  Services   Compute  Services  Data  Services   Data  Services   Storage  Services   Storage  Services   Rou+ng  &     Google,  Hadoop   Transport  Services   Sector   39  
    • Sec+on  3.5    Compu+ng  With  Streams:    Warming  Up  With  Means  and  Variances  
    • Warm  Up:  Par++oned  Means   Step  1.  Compute  local   (Σ  xi,    Σ  xi2,    ni)   in  parallel  for  each   par++on.     Step  2.  Compute  global   mean  and  variance  from   these  tuples.      •  Means  and  variances  cannot  be  computed   naively  when  the  data  is  in  distributed   par++ons.  
    • Trivial  Observa+on  1  If  si  =  Σ  xi  is  a  the  i’th  local  means,  then  global  mean  =  Σ  si  /    Σ  ni.    •  If  local  means  for  each  par++on  are  passed   (without  corresponding  counts),  then  there  is   not  enough  informa+on  to  compute  global   means.  •  Same  tricks  works  for  variance,  but  need  to   pass  triples  (Σ  xi,    Σ  xi2,    ni).    
    • Trivial  Observa+on  2  •  To  reduce  data  passed  over  the  network,   combine  appropriate  sta+s+cs  as  early  as   possible.  •  Consider  average.      Recall  with  MapReduce  there   are  4  steps  (Map,  Shuffle,  Sort  and  Reduce)  and   Reduce  pulls  data  from  local  disk  that  performs   Map.  •  A  Combine  Step  in  MapReduce  combines  local   data  before  it  is  pulled  for  Reduce  Step.  •  There  are  built  in  combiners  for  counts,  means,   etc.  
    • Sec+on  3.6  Hadoop  Streams  
    • Processing  Big  Data  PaCern  4:    Streams  over  Distributed  File  Systems  
    • Hadoop  Streams  •  In  addi+on  to  the  Java  API,  Hadoop  offers   –  Streaming  interface  for  any  language  that  supports   reading  and  wri+ng  to  Standard  In  and  Out   –  Pipes  for  C++  •  Why  would  I  want  to  use  something  besides   Java?    Because  Hadoop  Streams  provide  direct   access  to   –  (Without  JNI/  NIO)  to  C++  libraries  like  Boost,  GNU   Scien+fic  Library  (GSL)   –  R  modules  
    • Pros  and  Cons  •  Java   +    Best  documented   +    Largest  community   –  More  LOC  per  MR  job  •  Python   +    Efficient  memory  handling   +    Programmers  can  be  very  efficient   –  Limited  logging  /  debugging  •  R   +    Vast  collec+on  of  sta+s+cal  algorithms   –  Poor  error  handling  and  memory  handling   –  Less  familiar  to  developers  
    • Word  Count  Python  Mapper    def read_input(file): for line in file: yield line.split()def main(separator=t): data = read_input(sys.stdin) for words in data: for word in words: print %s%s%d % (word, separator, 1)
    • Word  Count  Python  Reducer  def read_mapper_output(file, separator=t): for line in file: yield line.rstrip().split(separator, 1)def main(sep=t): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
    • MalStone  Benchmark   MalStone  A   MalStone  B   Hadoop  MapReduce   455m  13s   840m  50s   Hadoop  Streams   87m  29s   142m  32s   (Python)   C++  implemented  UDFs   33m  40s   43m  44s  Sector/Sphere  1.20,  Hadoop  0.18.3  with  no  replica+on  on  Phase  1  of  Open  Cloud  Testbed  in  a  single  rack.    Data  consisted  of  20  nodes  with  500  million  100-­‐byte  records  /  node.  
    • Word  Count  R  Mapper  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)","", line)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn =FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="")}close(con)  
    • Word  Count  R  Reducer  trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2]))}env <- new.env(hash = TRUE)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) >0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count  
    • Word  Count  R  Reducer  (cont’d)  if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env)}close(con)for (w in ls(env, all = TRUE)) cat(w, "t", get(w, envir = env), "n", sep ="”)  
    • Word  Count  Java  Mapper  public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }  
    • Word  Count  Java  Reducer  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }  
    • Code  Comparison  –  Word  Count   Mapper  Python Javadef read_input(file): for line in file: public static class Map yield line.split() extends Mapper<LongWritable, Text,Text, IntWritable>def main(separator=t): private final static IntWritable one = new IntWritable(1); data = read_input(sys.stdin) private Text word = new Text(); for words in data: for word in words: public void map(LongWritable key, Text value, Context context print %s%s%d % (word, separator, 1) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }RtrimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="")}close(con)
    • Code  Comparison  –  Word  Count   Reducer  Python if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env)def read_mapper_output(file, separator=t): } for line in file: else assign(word, count, envir = env) yield line.rstrip().split(separator, 1) } close(con)def main(sep=t): for (w in ls(env, all = TRUE)) data = read_mapper_output(sys.stdin, sep=sepa) cat(w, "t", get(w, envir = env), "n", sep = "”) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context)R throws IOException, InterruptedException {trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) int sum = 0; for (IntWritable val : values) {splitLine <- function(line) { sum += val.get(); val <- unlist(strsplit(line, "t")) } list(word = val[1], count = as.integer(val[2])) context.write(key, new IntWritable(sum));} } }env <- new.env(hash = TRUE)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
    • Ques+ons?  For  the  most  current  version  of  these  notes,  see   rgrossman.com