SlideShare a Scribd company logo
Scoobi

Ben Lever
@bmlever
Me



                                                  Haskell DSL for development of computer
      Machine learning, software
                                                      vision algorithms targeting GPUs
          systems, computer
vision, optimisation, networks, control
         and signal processing




                         Predictive analytics for the enterprise
Hadoop app development – wish list

        Quick dev cycles
          Expressive
          Reusability
          Type safety
           Reliability
Bridging the “tooling” gap
                      Scoobi

 MapReduce
  pipelines
          DList
DObject
                                     ScalaCheck
  Implementation                     Testing


                      Java APIs



                   HadoopMapReduce
At a glance
•   Scoobi = Scala for Hadoop
•   Inspired by Google’s FlumeJava
•   Developed at NICTA
•   Open-sourced Oct 2011
•   Apache V2
HadoopMapReduce – word count
                                                      ... hello … cat
                        324   ... cat …    323                                  325         ... hello … fire …       326       ... fire… cat …
                                                             …




(k1, v1)  [(k2, v2)]           Mapper                    Mapper                            Mapper                       Mapper


                                    cat   1                   hello     1                       hello    1                 fire       1
                                                              cat       1                       fire     1                 cat        1


[(k2, v2)]  [(k2, [v2])]                          Sort and shuffle: aggregate values by key

                                          hello       1   1             cat         1       1    1           fire    1     1



(k2, [v2])  [(k3, v3)]                   Reducer                           Reducer                          Reducer


                                           hello      2                       cat       3                     fire   2


                                                                                                                                            6
Java style
public class WordCount {                                           public static void main(String[] args) throws Exception {
                                                                       Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text,
    IntWritable> {                                                    Job job = new Job(conf, "wordcount");
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();                                 job.setOutputKeyClass(Text.class);
                                                                   job.setOutputValueClass(IntWritable.class);
    public void map(LongWritable key, Text value, Context
      context) throws IOException, InterruptedException {          job.setMapperClass(Map.class);
         String line = value.toString();                           job.setReducerClass(Reduce.class);
StringTokenizertokenizer = new StringTokenizer(line);
         while (tokenizer.hasMoreTokens()) {                       job.setInputFormatClass(TextInputFormat.class);
word.set(tokenizer.nextToken());                                   job.setOutputFormatClass(TextOutputFormat.class);
context.write(word, one);
         }                                                         FileInputFormat.addInputPath(job, new Path(args[0]));
    }                                                              FileOutputFormat.setOutputPath(job, new Path(args[1]));
 }
 public static class Reduce extends Reducer<Text, IntWritable,     job.waitForCompletion(true);
      Text, IntWritable> {
                                                                    }
                                                                   }
    public void reduce(Text key, Iterable<IntWritable> values,
      Context context)
       throws IOException, InterruptedException {
int sum = 0;
         for (IntWritableval : values) {
                                                                                        Source: http://wiki.apache.org/hadoop/WordCount
             sum += val.get();
         }
context.write(key, new IntWritable(sum));
    }
 }


                                                                                                                                 7
DList abstraction
                                                            Distributed List (DList)
 Data
   on
HDFS

                                        Transform




                   DList type                           Abstraction for
   DList[String]                      Lines of text files
   DList[(Int, String, Boolean)]      CVS files of the form “37,Joe,M”
   DList[(Float,Map[(String, Int)]]   Avro files withshema: {record { int, map}}
Scoobi style
importcom.nicta.scoobi.Scoobi._

// Count the frequency of words from corpus of documents
objectWordCountextendsScoobiApp {
def run() {
vallines: DList[String] = fromTextFile(args(0))

valfreqs: DList[(String, Int)] =
lines.flatMap(_.split(" ")) // DList[String]
             .map(w=> (w, 1))      // DList[(String, Int)]
             .groupByKey// DList[(String, Iterable[Int])]
             .combine(_+_)          // DList[(String, Int)]

persist(toTextFile(freqs, args(1)))
  }
}
DList trait
traitDList[A] {
/* Abstract methods */
def parallelDo[B](dofn: DoFn[A, B]): DList[B]

def ++(that: DList[A]): DList[A]

def groupByKey[K, V]
    (implicit A <:< (K, V)): DList[(K, Iterable[V])]

def combine[K, V]
    (f: (V, V) => V)
    (implicit A <:< (K, Iterable[V])): DList[(K, V)]

/* All other methods are derived, e.g. „map‟ */
}
Under the hood
fromTextFile               LD

                   lines           HDFS
    flatMap                PD

                  words

        map                PD    MapReduce Job


                  word1

 groupByKey                GBK

                  wordG            HDFS

    combine                CV

                   freq
    persist
Removing less than the average
importcom.nicta.scoobi.Scoobi._

// Remove all integers that are less than the average integer
objectBetterThanAverageextendsScoobiApp {
def run() {
valints: DList[Int] =
fromTextFile(args(0)) collect { case AnInt(i) =>i }

valtotal: DObject[Int] = ints.sum
valcount: DObject[Int] = ints.size

valaverage: DObject[Int] =
      (total, count) map { case (t, c) =>t / c }

valbigger: DList[Int] =
      (average join ints) filter { case (a, i) =>i> a }

persist(toTextFile(bigger, args(1)))
  }
}
Under the hood        HDFS
                 LD


                                        MapReduce Job
    ints         PD


           PD         PD
                                            HDFS
           GBK        GBK


           CV         CV               Client computation



           M          M
                                       DCach
total                       count                  HDFS
                 OP                      e
    average

                 PD                     MapReduce Job


                 PD
        bigger
                                            HDFS
DObject abstraction
                                                Dlist[B]
                                                                 HDFS

          map              map                                 Distributed
DObject          DObject           DObject
                                                                  Cache
                                               Dobject[A]


      Client-side computations                     join



                                               DList[(A, B)]
                                                                 HDFS +
trait DObject[A] {                                             Distributed
  def map[B](f: A => B): DObject[B]                               Cache
  def join[B](list: DList[B]): DList[(A, B)]
}
Mirroring the Scala Collection API
     DList =>DList    DList =>DObject

       flatMap           reduce
         map             product
        filter             sum
      filterNot           length
      groupBy              size
      partition           count
       flatten             max
       distinct           maxBy
          ++               min
     keys, values         minBy
Building abstractions
               Functional programming




Functions as                            Functions as
procedures                              parameters




                   Composability
                        +
                    Reusability
Composing
// Compute the average of a DList of “numbers”
defaverage[A : Numeric](in: DList[A]): DObject[A] =
  (in.sum, in.size) map { case (sum, size) => sum / size }




// Compute histogram
defhistogram[A](in: DList[A]): DList[(A, Int)] =
in.map(x=> (x, 1)).groupByKey.combine(_+_)




// Throw away words with less-than-average frequency
defbetterThanAvgWords(lines: DList[String]): DList[String] = {
val words = lines.flatMap(_.split(“ “))
valwordCnts = histogram(words)
valavgFreq = average(wordCounts.values)
  (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }
}
Unit-testing ‘histogram’
// Specification for histogram function
class HistogramSpecextendsHadoopSpecification {

“Histogram from DList”>> {
                                                                   ScalaCheck
“Sum of bins must equal size of DList”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
valhist = histogram(list.toDList)
valbinSum = persist(hist.values.sum)
binSum == list.sz
      }
    }

“Number of bins must equal number of unique values”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
val input = list.toDList
val bins = histogram(input).keys.size
valuniques = input.distinct.size
val (b, u) = persist(bins, uniques)
b == u
       }
    }
  }
}
sbt integration
> test-only *Histogram* -- exclude cluster
[info] HistogramSpec
[info]
[info] Histogram from DList
[info] + Sum of bins must equal size of DList
[info] No cluster execution time
[info] + Number of bins must equal number of unique values
[info] No cluster execution time
[info]
[info]
[info] Total for specification BoundedFilterSpec
[info] Finished in 12 seconds, 600 ms
[info] 2 examples, 4 expectations, 0 failure, 0 error                 Dependent JARs are
[info]                                                                 copied (once) to a
[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0   directory on the cluster
>                                                                     (~/libjars by default)
> test-only *Histogram*
> test-only *Histogram* -- scoobi verbose
> test-only *Histogram* -- scoobiverbose.warning
Other features
• Grouping:
  – API for controlling Hadoop’s sort-and-shuffle
  – Useful for implementing secondary sorting
• Join and Co-group helper methods
• Matrix multiplication utilities
• I/O:
  – Text, sequence, Avro
  – Roll your own
Want to know more?
• http://nicta.github.com/scoobi
• Mailing lists:
  – http://groups.google.com/group/scoobi-users
  – http://groups.google.com/group/scoobi-dev
• Twitter:
  – @bmlever
  – @etorreborre
• Meet me:
  – Will also be at Hadoop Summit (June 13-14)
  – Keen to get feedback

More Related Content

What's hot

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
BigDataEverywhere
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
Sunghyouk Bae
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
LivePerson
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Requery overview
Requery overviewRequery overview
Requery overview
Sunghyouk Bae
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Eelco Visser
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring framework
Sunghyouk Bae
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash course
Holden Karau
 

What's hot (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Scala+data
Scala+dataScala+data
Scala+data
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring framework
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash course
 

Similar to Scoobi - Scala for Startups

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Hadoop
HadoopHadoop
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
Marc de Palol
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
Baishampayan Ghose
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Coen De Roover
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
Kuldeep Dhole
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
Brian O'Neill
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 

Similar to Scoobi - Scala for Startups (20)

Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 

Recently uploaded

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Scoobi - Scala for Startups

  • 2. Me Haskell DSL for development of computer Machine learning, software vision algorithms targeting GPUs systems, computer vision, optimisation, networks, control and signal processing Predictive analytics for the enterprise
  • 3. Hadoop app development – wish list Quick dev cycles Expressive Reusability Type safety Reliability
  • 4. Bridging the “tooling” gap Scoobi MapReduce pipelines DList DObject ScalaCheck Implementation Testing Java APIs HadoopMapReduce
  • 5. At a glance • Scoobi = Scala for Hadoop • Inspired by Google’s FlumeJava • Developed at NICTA • Open-sourced Oct 2011 • Apache V2
  • 6. HadoopMapReduce – word count ... hello … cat 324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat … … (k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper cat 1 hello 1 hello 1 fire 1 cat 1 fire 1 cat 1 [(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key hello 1 1 cat 1 1 1 fire 1 1 (k2, [v2])  [(k3, v3)] Reducer Reducer Reducer hello 2 cat 3 fire 2 6
  • 7. Java style public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class); StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class); word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class); context.write(word, one); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); } public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true); Text, IntWritable> { } } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { Source: http://wiki.apache.org/hadoop/WordCount sum += val.get(); } context.write(key, new IntWritable(sum)); } } 7
  • 8. DList abstraction Distributed List (DList) Data on HDFS Transform DList type Abstraction for DList[String] Lines of text files DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M” DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
  • 9. Scoobi style importcom.nicta.scoobi.Scoobi._ // Count the frequency of words from corpus of documents objectWordCountextendsScoobiApp { def run() { vallines: DList[String] = fromTextFile(args(0)) valfreqs: DList[(String, Int)] = lines.flatMap(_.split(" ")) // DList[String] .map(w=> (w, 1)) // DList[(String, Int)] .groupByKey// DList[(String, Iterable[Int])] .combine(_+_) // DList[(String, Int)] persist(toTextFile(freqs, args(1))) } }
  • 10. DList trait traitDList[A] { /* Abstract methods */ def parallelDo[B](dofn: DoFn[A, B]): DList[B] def ++(that: DList[A]): DList[A] def groupByKey[K, V] (implicit A <:< (K, V)): DList[(K, Iterable[V])] def combine[K, V] (f: (V, V) => V) (implicit A <:< (K, Iterable[V])): DList[(K, V)] /* All other methods are derived, e.g. „map‟ */ }
  • 11. Under the hood fromTextFile LD lines HDFS flatMap PD words map PD MapReduce Job word1 groupByKey GBK wordG HDFS combine CV freq persist
  • 12. Removing less than the average importcom.nicta.scoobi.Scoobi._ // Remove all integers that are less than the average integer objectBetterThanAverageextendsScoobiApp { def run() { valints: DList[Int] = fromTextFile(args(0)) collect { case AnInt(i) =>i } valtotal: DObject[Int] = ints.sum valcount: DObject[Int] = ints.size valaverage: DObject[Int] = (total, count) map { case (t, c) =>t / c } valbigger: DList[Int] = (average join ints) filter { case (a, i) =>i> a } persist(toTextFile(bigger, args(1))) } }
  • 13. Under the hood HDFS LD MapReduce Job ints PD PD PD HDFS GBK GBK CV CV Client computation M M DCach total count HDFS OP e average PD MapReduce Job PD bigger HDFS
  • 14. DObject abstraction Dlist[B] HDFS map map Distributed DObject DObject DObject Cache Dobject[A] Client-side computations join DList[(A, B)] HDFS + trait DObject[A] { Distributed def map[B](f: A => B): DObject[B] Cache def join[B](list: DList[B]): DList[(A, B)] }
  • 15. Mirroring the Scala Collection API DList =>DList DList =>DObject flatMap reduce map product filter sum filterNot length groupBy size partition count flatten max distinct maxBy ++ min keys, values minBy
  • 16. Building abstractions Functional programming Functions as Functions as procedures parameters Composability + Reusability
  • 17. Composing // Compute the average of a DList of “numbers” defaverage[A : Numeric](in: DList[A]): DObject[A] = (in.sum, in.size) map { case (sum, size) => sum / size } // Compute histogram defhistogram[A](in: DList[A]): DList[(A, Int)] = in.map(x=> (x, 1)).groupByKey.combine(_+_) // Throw away words with less-than-average frequency defbetterThanAvgWords(lines: DList[String]): DList[String] = { val words = lines.flatMap(_.split(“ “)) valwordCnts = histogram(words) valavgFreq = average(wordCounts.values) (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w } }
  • 18. Unit-testing ‘histogram’ // Specification for histogram function class HistogramSpecextendsHadoopSpecification { “Histogram from DList”>> { ScalaCheck “Sum of bins must equal size of DList”>> { implicitc: SC=> Prop.forAll { list: List[Int] => valhist = histogram(list.toDList) valbinSum = persist(hist.values.sum) binSum == list.sz } } “Number of bins must equal number of unique values”>> { implicitc: SC=> Prop.forAll { list: List[Int] => val input = list.toDList val bins = histogram(input).keys.size valuniques = input.distinct.size val (b, u) = persist(bins, uniques) b == u } } } }
  • 19. sbt integration > test-only *Histogram* -- exclude cluster [info] HistogramSpec [info] [info] Histogram from DList [info] + Sum of bins must equal size of DList [info] No cluster execution time [info] + Number of bins must equal number of unique values [info] No cluster execution time [info] [info] [info] Total for specification BoundedFilterSpec [info] Finished in 12 seconds, 600 ms [info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are [info] copied (once) to a [info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster > (~/libjars by default) > test-only *Histogram* > test-only *Histogram* -- scoobi verbose > test-only *Histogram* -- scoobiverbose.warning
  • 20. Other features • Grouping: – API for controlling Hadoop’s sort-and-shuffle – Useful for implementing secondary sorting • Join and Co-group helper methods • Matrix multiplication utilities • I/O: – Text, sequence, Avro – Roll your own
  • 21. Want to know more? • http://nicta.github.com/scoobi • Mailing lists: – http://groups.google.com/group/scoobi-users – http://groups.google.com/group/scoobi-dev • Twitter: – @bmlever – @etorreborre • Meet me: – Will also be at Hadoop Summit (June 13-14) – Keen to get feedback

Editor's Notes

  1. 5VSQVUH22