• Save
Introduction to Scalding and Monoids
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Introduction to Scalding and Monoids

  • 3,768 views
Uploaded on

This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they......

This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,768
On Slideshare
3,599
From Embeds
169
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
11

Embeds 169

https://twitter.com 104
http://www.cnblogs.com 55
http://www.makaidong.com 6
http://tweetedtimes.com 4

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduc)on  to   Scalding  and  Monoids   Hugo  Gävert   @hgavert    
  • 2. Map  Reduce   •  Programming  model  for  processing  large  data  sets   with  a  parallel,  distributed  algorithm  on  a  cluster.   •  Inspired  by  map  and  reduce  func)ons  commonly   found  in  func)onal  programming  languages   •  map()  performs  transla)ons  and  filtering  on  given  values   •  reduce()  performs  summary  opera)on  on  given  values  
  • 3. How  does  it  work?   Found  this  from  the  Internet,  forgot  from  where  
  • 4. The  scene   •  Hadoop  –  open  source  implementa)on  of    Google’s   MapReduce  and  Google  File  System  papers   •  Java…   •  Higher  level  frameworks/plaOorms   –  Hive  ≈  SQL   –  Pig      (procedural  ≈  “more  programming  than  SQL”)   –  Cascading  –  Java  MR  applica)on  framework  for  enterprise  data  flows   •  If  you  must  do  Java,  do  this!   –  Scalding    -­‐  Scala  DSL  for  Cascading,  easy  to  pick  up  yet  very   powerful   –  Cascalog  –  Clojure  DSL  for  Cascading,  declara)ve,  logic   programming  
  • 5. The  scene  (*)   *  Borrowed  from  excellent  presenta)on  by  Vitaly  Gordon  and  Christopher  Severs    
  • 6. “Hadoop  is  a  distributed  system   for  coun)ng  words”   package  org.myorg;         import  java.io.IOException;   import  java.util.*;       import  org.apache.hadoop.fs.Path;   import  org.apache.hadoop.conf.*;   import  org.apache.hadoop.io.*;   import  org.apache.hadoop.mapred.*;   import  org.apache.hadoop.util.*;       public  class  WordCount  {              public  static  class  Map  extends  MapReduceBase  implements   Mapper<LongWritable,  Text,  Text,  IntWritable>  {              private  final  static  IntWritable  one  =  new   IntWritable(1);              private  Text  word  =  new  Text();                  public  void  map(LongWritable  key,  Text  value,   OutputCollector<Text,  IntWritable>  output,  Reporter  reporter)   throws  IOException  {                  String  line  =  value.toString();                  StringTokenizer  tokenizer  =  new  StringTokenizer(line);                  while  (tokenizer.hasMoreTokens())  {                      word.set(tokenizer.nextToken());                      output.collect(word,  one);                  }              }          }                  public  static  class  Reduce  extends  MapReduceBase  implements   Reducer<Text,  IntWritable,  Text,  IntWritable>  {              public  void  reduce(Text  key,  Iterator<IntWritable>  values,   OutputCollector<Text,  IntWritable>  output,  Reporter  reporter)   throws  IOException  {                  int  sum  =  0;                  while  (values.hasNext())  {                      sum  +=  values.next().get();                  }                  output.collect(key,  new  IntWritable(sum));              }          }              public  static  void  main(String[]  args)  throws  Exception  {              JobConf  conf  =  new  JobConf(WordCount.class);              conf.setJobName("wordcount");                  conf.setOutputKeyClass(Text.class);              conf.setOutputValueClass(IntWritable.class);                  conf.setMapperClass(Map.class);              conf.setCombinerClass(Reduce.class);              conf.setReducerClass(Reduce.class);                  conf.setInputFormat(TextInputFormat.class);              conf.setOutputFormat(TextOutputFormat.class);                  FileInputFormat.setInputPaths(conf,  new  Path(args[0]));              FileOutputFormat.setOutputPath(conf,  new  Path(args[1]));                  JobClient.runJob(conf);          }   }    
  • 7. What  do  we  actually  want  to  do?   Documents   (lines)   Tokenize   GroupBy   (token)   Count   Word   count  
  • 8. Word  Count  in  Scalding   •  asd   package  com.sanoma.cda.examples   import  com.twitter.scalding._       class  WordCount1(args:  Args)  extends  Job(args)  {      TextLine(args("input"))          .flatMap('line  -­‐>  'word)  {  line:  String  =>  line.split("s+")  }          .groupBy('word)  {  _.size  }          .write(Tsv(args("output")))   }   There  is  scald.rb  to  get  you  started  (get  it  from  Github  project)     Building  and  running  a  fat  jar  (for  local,  include  hadoop,  for  cluster  mark  it  “provided”):   > sbt assembly > java -jar target/scala-2.10/scalding_talk-assembly-0.1.jar com.sanoma.cda.examples.WordCount1 --local --input data/11.txt.utf-8 --output wc.txt > hadoop jar job-jars/scalding_talk-assembly-0.1.jar --Dmapred.reduce.tasks=70 com.sanoma.cda.examples.WordCount1 --hdfs --input /data/AliceInWonderland --output /user/Alice_wc   the and to a of she said in it was you I as that Alice …   Alice, Alice. Alice; Alice's Alice: (Alice Alice! Alice,)  1664    1172    780    773    662    596    484    416    401    356    329    301    260    246    226    221    76    54    16    11    7    4    3    2  
  • 9. Word  Count  in  Scalding   •  asd   package  com.sanoma.cda.examples   import  com.twitter.scalding._       class  WordCount2(args:  Args)  extends  Job(args)  {      TextLine(args("input"))          .flatMap('line  -­‐>  'word)  {  line:  String  =>  tokenize(line)  }          .filter('word)  {  word:  String  =>  word  !=  ""  }          .groupBy('word)  {  _.size  }            .groupAll{  _.sortBy(('size,  'word)).reverse  }  //  this  is  just  for  easy  results          .write(Tsv(args("output")))          def  tokenize(text:  String):  Array[String]  =  {          text.toLowerCase.replaceAll("[^a-­‐z0-­‐9s]",  "").split("s+")      }   }   the and to a of it she said you in i alice was that as her with at on all  1804    912    801    684    625    541    538    462    429    428    400    385    358    291    272    248    228    224    204    197  
  • 10. Word  count  in  Scalding   Almost  1-­‐to-­‐1  rela)on  between  the   process  and  the  Scalding  code!     UDFs  directly  in  Scala   And  Java  libraries  can  be  used   Documents   (lines)   Tokenize   package  com.sanoma.cda.examples   import  com.twitter.scalding._       class  WordCount2(args:  Args)  extends  Job(args)  {      TextLine(args("input"))          .flatMap('line  -­‐>  'word)  {  tokenize  }          .groupBy('word)  {  _.size  }          .write(Tsv(args("output")))          def  tokenize(text:  String):  Array[String]  =  {          text.toLowerCase.replaceAll("[^a-­‐z0-­‐9s]",  "").split("s+")      }   }     GroupBy   (token)   Count   Word   count  
  • 11. About  Scalding   •  Started  at  Twiper  –  years  of  produc)on  use   •  Well  tested  and  op)mized  by  different  teams,   including  Twiper,  Concurrent  Inc.,  Etsy,  …   •  Has  very  fast  local  mode  (no  need  to  install   Hadoop  locally)   •  Flow  planner  is  designed  to  be  portable  à  in   future,  the  same  jobs  might  run  on  Storm  cluster   for  example   •  Scala…  very  nice  programming  language  –  YMMV   –  Func)onal  &  object  oriented,  has  REPL  
  • 12. Scalding  Func)ons   •  3  APIs:   –  Fields-­‐based  API  –  easy  to  start  from  here   –  Type-­‐safe  API   –  Matrix  API   •  Field-­‐based  API   –  Map-­‐like  func)ons   •  map,  flatMap,  project,  insert,  filter,  limit…   –  Grouping/reducing  func)ons   •  groupBy,  groupAll   •  .size,  .sum,  .average,  .sizeAveStdev,  .toList,  .max,   sortBy,  .reduce,  .foldLeu,  .pivot,  …   –  Join  Opera)ons   •  joinWithSmaller,  joinWithLarger,  joinWithTiny,  crossWithTiny   •  InnerJoin,  LeuJoin,  RightJoin,  OuterJoin  
  • 13. Scalding  matrix  API   package  com.twitter.scalding.examples   import  com.twitter.scalding._   import  com.twitter.scalding.mathematics.Matrix       /**   *  Loads  a  directed  graph  adjacency  matrix  where  a[i,j]  =  1  if  there  is  an  edge  from  a[i]  to  b[j]   *  and  computes  the  cosine  of  the  angle  between  every  two  pairs  of  vectors   */   class  ComputeCosineJob(args  :  Args)  extends  Job(args)  {      import  Matrix._          val  adjacencyMatrix  =  Tsv(  args("input"),  ('user1,  'user2,  'rel)  )          .read          .toMatrix[Long,Long,Double]('user1,  'user2,  'rel)          //  we  compute  the  L2  normalized  adjacency  graph        val  matL2Norm  =  adjacencyMatrix.rowL2Normalize        //  we  compute  the  innerproduct  of  the  normalized  matrix  with  itself      //  which  is  equivalent  with  computing  cosine:  AA^T  /  ||A||  *  ||A||      val  cosDist  =  matL2Norm  *  matL2Norm.transpose        cosDist.write(Tsv(args("output”)))   }    
  • 14. What  is  a  monoid?   •  Closure   ∀a, b ∈ T : a • b ∈ T •  Associa)vity   ∀a, b, c ∈ T : (a • b)•c = a •(b •c) •  Iden)ty  element   ∃I ∈ T : ∀a ∈ T : I • a = a • I = a Scala  trait:   trait  Monoid[T]  {        def  zero:  T        def  plus(left:  T,  right:  T):  T   }  
  • 15. Examples  of  monoids   •  Numbers,  String,  list,  set,  map   •  Algorithms:     –  Min,  Max   –  Moments  (count,  mean,  std,  …)   –  Approximate  histograms,  quan)les   –  Approximate  data  structures   •  Bloom  Filter,  CountMinSketch,  HyperLogLog   –  Stochas)c  gradient  descent  
  • 16. What’s  the  point?   a0  +  a1  +  a2  +  a3  +  a4  +  a5  +  a6  +  a7     (a0  +  a1)  +  (a2  +  a3)  +  (a4  +  a5)  +  (a6  +  a7)     (  (a0  +  a1)  +  (a2  +  a3)  )  +  (  (a4  +  a5)  +  (a6  +  a7)  )     (  (  (a0  +  a1)  +  (a2  +  a3)  )  +  (  (a4  +  a5)  +  (a6  +  a7)  )  )   à  Parallelism  
  • 17. What’s  the  point?   a0  +  a1  +  a2  +  a3  +  a4  +  a5  +  a6  +  a7                    (a0  +  a1  +  a2  +  a3  +  a4  +  a5  +  a6  +  a7)  +  a8     à  Incremental  aggrega)on  
  • 18. What’s  the  point?   •  Easily  unit  testable  opera)ons   •  Simple  aggrega)on  code   à  Beper  quality  
  • 19. Word  Count  with  Map  Monoid   package  com.sanoma.cda.examples   import  com.twitter.scalding._   import  com.twitter.algebird.Operators._       class  WordCount3(args:  Args)  extends  Job(args)  {      TextLine(args("input"))          .flatMap('line  -­‐>  'word)  {  tokenize  }          .map('word  -­‐>  'word)  {  w:  String  =>  Map[String,  Int](w  -­‐>  1)  }          .groupAll{  _.sum[Map[String,  Int]]('word)  }            //  We  could  save  the  map  here,  but  if  we  want  similar  output  as  in  previous...          .flatMap('word  -­‐>  ('word,  'size))  {  words:  Map[String,  Int]  =>  words.toList  }          .groupAll{  _.sortBy(('size,  'word)).reverse  }  //  this  is  just  for  easy  results          .write(Tsv(args("output")))          def  tokenize(text:  String):  Array[String]  =  {          text.toLowerCase.replaceAll("[^a-­‐z0-­‐9s]",  "").split("s+").filter(  _  !=  "")      }   }   the and to a of it she said you in i alice was that as her with at on all  1804    912    801    684    625    541    538    462    429    428    400    385    358    291    272    248    228    224    204    197  
  • 20. Top  Words  with  CMS   •  asd   package  com.sanoma.cda.examples   import  com.twitter.scalding._   import  com.twitter.algebird._       class  WordCount5(args:  Args)  extends  Job(args)  {      implicit  def  utf8(s:  String):  Array[Byte]  =  com.twitter.bijection.Injection.utf8(s)      implicit  val  cmsm  =  new  SketchMapMonoid[String,  Long](128,  6,  0,  20)  //  top  20      type  ApproxMap  =  SketchMap[String,  Long]          TextLine(args("input"))          .flatMap('line  -­‐>  'word)  {  tokenize  }          .map('word  -­‐>  'word)  {  w:  String  =>  cmsm.create((w,  1L))  }          .groupAll{  _.sum[ApproxMap]('word)  }          .flatMap('word  -­‐>  ('word,  'size))  {  words:  ApproxMap  =>  words.heavyHitters  }          .write(Tsv(args("output")))          def  tokenize(text:  String):  Array[String]  =  {          text.toLowerCase.replaceAll("[^a-­‐z0-­‐9s]",  "").split("s+").filter(  _  !=  "")      }   }   the and to a of she it said you in i alice at was that her with as not be  1859    972    867    748    711    636    619    579    504    495    456    431    407    394    342    341    338    337    290    286  
  • 21. Start  using  Scalding  now!  :-­‐)   GitHub:   hpps://github.com/twiper/scalding     Tutorials:   hpps://github.com/twiper/scalding/tree/ develop/tutorial      
  • 22. Thanks!       Hugo  Gävert   Sanoma