SlideShare a Scribd company logo
1 of 59
Download to read offline
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 3
                  September 8, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
      Course design and slides derived from
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Today’s Agenda
• Review
• Toward MapReduce “design patterns”
  – Building block: preserving state across calls
  – In-Map & In-Mapper combining (vs. combiners)
  – Secondary sorting (via value-to-key Conversion)
  – Pairs and Stripes
  – Order Inversion
• Group Work (examples)
  – Interlude: scaling counts, TF-IDF
Review
MapReduce: Recap
Required:
   map ( K1, V1 ) → list ( K2, V2 )
   reduce ( K2, list(V2) ) → list ( K3, V3)
All values with the same key are reduced together
Optional:
   partition (K2, N) → Rj      maps K2 to some reducer Rj in [1..N]
      Often a simple hash of the key, e.g., hash(k’) mod n
      Divides up key space for parallel reduce operations


   combine ( K2, list(V2) ) → list ( K2, V2 )
      Mini-reducers that run in memory after the map phase
      Used as an optimization to reduce network traffic


The execution framework handles everything else…
“Everything Else”
    The execution framework handles everything else…
        Scheduling: assigns workers to map and reduce tasks
        ―Data distribution‖: moves processes to data
        Synchronization: gathers, sorts, and shuffles intermediate data
        Errors and faults: detects worker failures and restarts
    Limited control over data and execution flow
        All algorithms must expressed in m, r, c, p
    You don’t know:
        Where mappers and reducers run
        When a mapper or reducer begins or finishes
        Which input a particular mapper is processing
        Which intermediate key a particular reducer is processing
k1 v1   k2 v2   k3 v3    k4 v4     k5 v5      k6 v6




  map                   map                    map                   map


a 1    b 2           c 3     c 6           a 5     c 2             b 7     c 8

 combine              combine               combine                 combine



a 1    b 2                 c 9             a 5     c 2             b 7     c 8

 partition            partition               partition             partition

      Shuffle and Sort: aggregate values by keys
               a     1 5             b     2 7               c     2 9 8




         reduce                   reduce                  reduce


             r1 s1                 r2 s2                   r3 s3
Shuffle and Sort

     Mapper                                   intermediate files
                                                   (on disk)
                              merged spills
                                (on disk)
                                                                   Combiner   Reducer



  circular buffer
   (in memory)


                           Combiner




                                                other reducers
        spills (on disk)




                       other mappers
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178


    Shuffle and 2 Sorts




   As map emits values, local sorting
    runs in tandem (1st sort)
   Combine is optionally called
    0..N times for local aggregation
    on sorted (K2, list(V2)) tuples (more sorting of output)
   Partition determines which (logical) reducer Rj each key will go to
   Node’s TaskTracker tells JobTracker it has keys for Rj
   JobTracker determines node to run Rj based on data locality
   When local map/combine/sort finishes, sends data to Rj’s node
   Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
   For each (K, list(V)) tuple in merged output, call reduce(…)
Scalable Hadoop Algorithms: Themes
   Avoid object creation
       Inherently costly operation
       Garbage collection
   Avoid buffering
       Limited heap size
       Works for small datasets, but won’t scale!
         • Yet… we’ll talk about patterns involving buffering…
Importance of Local Aggregation
   Ideal scaling characteristics:
       Twice the data, twice the running time
       Twice the resources, half the running time
   Why can’t we achieve this?
       Synchronization requires communication
       Communication kills performance
   Thus… avoid communication!
       Reduce intermediate data via local aggregation
       Combiners can help
Tools for Synchronization
    Cleverly-constructed data structures
        Bring partial results together
    Sort order of intermediate keys
        Control order in which reducers process keys
    Partitioner
        Control which reducer processes which keys
    Preserving state in mappers and reducers
        Capture dependencies across multiple keys and values
Secondary Sorting
   MapReduce sorts input to reducers by key
       Values may be arbitrarily ordered
   What if want to sort value also?
       E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…
   Solutions?
       Swap key and value to sort by value?
       What if we use (k,v) as a joint key (and change nothing else)?
Secondary Sorting: Solutions
   Solution 1: Buffer values in memory, then sort
       Tradeoffs?
   Solution 2: ―Value-to-key conversion‖ design pattern
       Form composite intermediate key: (k, v1)
       Let execution framework do the sorting
       Preserve state across multiple key-value pairs
       …how do we make this happen?
Secondary Sorting (Lin 57, White 241)
    Create composite key: (k,v)
    Define a Key Comparator to sort via both
        Possibly not needed in some cases (e.g. strings & concatenation)
    Define a partition function based only on the (original) key
        All pairs with same key should go to same reducer
        Multiple keys may still go to the same reduce node; how do you
         know when the key changes across invocations of reduce()?
          • i.e. assume you want to do something with all values associated with
            a given key (e.g. print all on the same line, with no other keys)
    Preserve state in the reducer across invocations
        reduce() will be called separately for each pair, but we need to
         track the current key so we can detect when it changes


 Hadoop also provides Group Comparator
Preserving State in Hadoop


     Mapper object                                    Reducer object

                        one object per task
         state                                            state



       configure       API initialization hook          configure


                     one call per input
                     key-value pair
          map                                            reduce
                                   one call per
                                   intermediate key

         close           API cleanup hook                 close
Combiner Design
   Combiners and reducers share same method signature
       Sometimes, reducers can serve as combiners
       Often, not…
   Remember: combiner are optional optimizations
       Should not affect algorithm correctness
       May be run 0, 1, or multiple times
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);

    Combiner?
MapReduce Algorithm Design
Design Pattern for Local Aggregation
   ―In-mapper combining‖
       Fold the functionality of the combiner into the mapper,
        including preserving state across multiple map calls
   Advantages
       Speed
       Why is this faster than actual combiners?
         • Construction/deconstruction, serialization/deserialization
         • Guarantee and control use
   Disadvantages
       Buffering! Explicit memory management required
         • Can use disk-backed-buffer, based on # items or byes in memory
         • What if multiple mappers running on the same node? Do we know?
       Potential for order-dependent bugs
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);

    Combine = reduce
Word Count: in-map combining




Are combiners still needed?
Word Count: in-mapper combining




Are combiners still needed?
Example 2: Compute the Mean (v1)




Why can’t we use reducer as combiner?
Example 2: Compute the Mean (v2)




Why doesn’t this work?
Example 2: Compute the Mean (v3)
Computing the Mean:
                 in-mapper combining




Are combiners still needed?
Example 3: Term Co-occurrence
   Term co-occurrence matrix for a text collection
       M = N x N matrix (N = vocabulary size)
       Mij: number of times i and j co-occur in some context
        (for concreteness, let’s say context = sentence)
   Why?
       Distributional profiles as a way of measuring semantic distance
       Semantic distance useful for many language processing tasks
MapReduce: Large Counting Problems
   Term co-occurrence matrix for a text collection
    = specific instance of a large counting problem
       A large event space (number of terms)
       A large number of observations (the collection itself)
       Goal: keep track of interesting statistics about the events
   Basic approach
       Mappers generate partial counts
       Reducers aggregate partial counts



        How do we aggregate partial counts efficiently?
Approach 1: “Pairs”
    Each mapper takes a sentence:
        Generate all co-occurring term pairs
        For all pairs, emit (a, b) → count
    Reducers sum up counts associated with these pairs
    Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis
    Advantages
        Easy to implement, easy to understand
    Disadvantages
        Lots of pairs to sort and shuffle around (upper bound?)
        Not many opportunities for combiners to work
Another Try: “Stripes”
    Idea: group together pairs into an associative array
          (a, b) → 1
          (a, c) → 2
          (a, d) → 5                   a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
          (a, e) → 3
          (a, f) → 2

    Each mapper takes a sentence:
        Generate all co-occurring term pairs
        For each term, emit a → { b: countb, c: countc, d: countd … }
    Reducers perform element-wise sum of associative arrays
                a → { b: 1,       d: 5, e: 3 }
           +    a → { b: 1, c: 2, d: 2,       f: 2 }
                a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
Stripes: Pseudo-Code
“Stripes” Analysis
    Advantages
        Far less sorting and shuffling of key-value pairs
        Can make better use of combiners
    Disadvantages
        More difficult to implement
        Underlying object more heavyweight
        Fundamental limitation in terms of size of event space
          • Buffering!
Cluster size: 38 cores
Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),
which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
Relative Frequencies
   How do we estimate relative frequencies from counts?

                       count ( A, B)    count ( A, B)
          f ( B | A)                
                        count ( A)      count ( A, B' )
                                       B'



   Why do we want to do this?
   How do we do this with MapReduce?
f(B|A): “Stripes”

     a → {b1:3, b2 :12, b3 :7, b4 :1, … }


    Easy!
        One pass to compute (a, *)
        Another pass to directly compute f(B|A)
f(B|A): “Pairs”

         (a, *) → 32    Reducer holds this value in memory

         (a, b1) → 3                        (a, b1) → 3 / 32
         (a, b2) → 12                       (a, b2) → 12 / 32
         (a, b3) → 7                        (a, b3) → 7 / 32
         (a, b4) → 1                        (a, b4) → 1 / 32
         …                                  …


    For this to work:
        Must emit extra (a, *) for every bn in mapper
        Must make sure all a’s get sent to same reducer (use partitioner)
        Must make sure (a, *) comes first (define sort order)
        Must hold state in reducer across different key-value pairs
“Order Inversion”
    Common design pattern
        Computing relative frequencies requires marginal counts
        But marginal cannot be computed until you see all counts
        Buffering is a bad idea!
        Trick: getting the marginal counts to arrive at the reducer before
         the joint counts
    Optimizations
        Apply in-memory combining pattern to accumulate marginal counts
        Should we apply combiners?
Synchronization: Pairs vs. Stripes
    Approach 1: turn synchronization into an ordering problem
        Sort keys into correct order of computation
        Partition key space so that each reducer gets the appropriate set
         of partial results
        Hold state in reducer across multiple key-value pairs to perform
         computation
        Illustrated by the ―pairs‖ approach
    Approach 2: construct data structures that bring partial
     results together
        Each reducer receives all the data it needs to complete the
         computation
        Illustrated by the ―stripes‖ approach
Recap: Tools for Synchronization
   Cleverly-constructed data structures
       Bring data together
   Sort order of intermediate keys
       Control order in which reducers process keys
   Partitioner
       Control which reducer processes which keys
   Preserving state in mappers and reducers
       Capture dependencies across multiple keys and values
Issues and Tradeoffs
   Number of key-value pairs
       Object creation overhead
       Time for sorting and shuffling pairs across the network
   Size of each key-value pair
       De/serialization overhead
   Local aggregation
       Opportunities to perform local aggregation varies
       Combiners make a big difference
       Combiners vs. in-mapper combining
       RAM vs. disk vs. network
Group Work (Examples)
Task 5
   How many distinct words in the document collection start
    with each letter?
       Note: ―types‖ vs. ―tokens‖
Task 5
   How many distinct words in the document collection start
    with each letter?
        Note: ―types‖ vs. ―tokens‖

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)




   Ways to make more efficient?
Task 5
   How many distinct words in the document collection start
    with each letter?
        Note: ―types‖ vs. ―tokens‖

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)

    Reducer<Integer,Integer  Integer,V3>
    Reduce(Integer length, Iterator<K2> values):
        set of words = empty set;
        for each word
           add word to set
        emit(letter, size word set)



   Ways to make more efficient?
Task 5b
   How many distinct words in the document collection start
    with each letter?
        How to use in-mapper combining and a separate combiner
        Tradeoffs

    Mapper<String,String  String,String>
    Map(String docID, String document)
        for each word in document
             emit (first character, word)
Task 5b
   How many distinct words in the document collection start
    with each letter?
       How to use in-mapper combining and a separate combiner
       Tradeoffs?

 Mapper<String,String  String,String>
 Map(String docID, String document)
     for each word in document
          emit (first character, word)

 Combiner<String,String  String,String>
 Combine(String letter, Iterator<String> words):
     set of words = empty set;
     for each word
        add word to set
     for each word in set
     emit(letter, word)
Task 6: find median document length
Task 6: find median document length
  Mapper<K1,V1  Integer,Integer>
  Map(K1 xx, V1 xx)
    10,000 / N times
       emit( length(generateRandomDocument()), 1)
Task 6: find median document length
    Mapper<K1,V1  Integer,Integer>
    Map(K1 xx, V1 xx)
      10,000 / N times
         emit( length(generateRandomDocument()), 1)

    Reducer<Integer,Integer  Integer,V3>
    Reduce(Integer length, Iterator<K2> values):
        static list lengths = empty list;
        for each value
           append length to list

    Close() { output median }




   conf.setNumReduceTasks(1)
   Problems with this solution?
Interlude: Scaling counts
    Many applications require counts of words in some
     context.
        E.g. information retrieval, vector-based semantics
    Counts from frequent words like ―the‖ can overwhelm the
     signal from content words such as ―stocks‖ and ―football‖
    Two strategies for combating high frequency words:
        Use a stop list that excludes them
        Scale the counts such that high frequency words are downgraded.
Interlude: Scaling counts, TF-IDF
    TF-IDF, or term frequency—inverse document frequency
     is a standard way of scaling.
    Inverse document frequency for a term t is the ratio of the
     number of documents in the collection to the number of
     documents containing t:




    TF-IDF is just the term frequency times the idf:
Interlude: Scaling counts, TF-IDF
    TF-IDF, or term frequency—inverse document frequency
     is a standard way of scaling.
    Inverse document frequency for a term t is the ratio of the
     number of documents in the collection to the number of
     documents containing t:




    TF-IDF is just the term frequency times the idf:
Interlude: Scaling counts using DF
    Recall the word co-occurrence counts task from the earlier
     slides.
        mij represents the number of times word j has occurred in the
         neighborhood of word i.
        The row mi gives a vector profile of word i that we can use for
         tasks like determining word similarity (e.g. using cosine distance)
        Words like ―the‖ will tend to have high counts that we want to scale
         down so they don’t dominate this computation.
    The counts in mij can be scaled down using dfj. Let’s
     create a transformed matrix S where:
Task 7
     Compute S, the co-occurrence counts scaled by document
      frequency.
       • First: do the simplest mapper
       • Then: simplify things for the reducer

More Related Content

What's hot

Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...kcitp
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteElectronic Arts / DICE
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Mark Kilgard
 
Assignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduceAssignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduceShantanu Sharma
 

What's hot (20)

Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Scalding
ScaldingScalding
Scalding
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
 
High Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in FrostbiteHigh Dynamic Range color grading and display in Frostbite
High Dynamic Range color grading and display in Frostbite
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
 
OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Assignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduceAssignment of Different-Sized Inputs in MapReduce
Assignment of Different-Sized Inputs in MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 

Viewers also liked

Curso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaSCurso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaSAsimov Consultores
 
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)Mario Jose Villamizar Cano
 
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startupsFrameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startupsMario Jose Villamizar Cano
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011eEvolution GmbH &amp; Co. KG
 
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...Néstor González
 
Modelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaSModelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaSSergio Montoro Ten
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 

Viewers also liked (14)

Curso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaSCurso desarrollo y comercialización de aplicaciones SaaS
Curso desarrollo y comercialización de aplicaciones SaaS
 
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
 
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startupsFrameworks y herramientas de desarrollo ágil para emprendedores y startups
Frameworks y herramientas de desarrollo ágil para emprendedores y startups
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
ERP-System eEvolution - ein Ausblick auf kommende Entwicklungen 2011
 
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
Guía SaaS en la empresa. La externalización de servicios en TI como mejora de...
 
Modelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaSModelos de Negocio con Software Libre 4/6 SaaS
Modelos de Negocio con Software Libre 4/6 SaaS
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 

Similar to Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataSteven Francia
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 

Similar to Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Hadoop
HadoopHadoop
Hadoop
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 

More from Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
 

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Recently uploaded

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 3 September 8, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of the following excellent Hadoop books (order yours today!) • Chuck Lam’s Hadoop In Action (2010) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda • Review • Toward MapReduce “design patterns” – Building block: preserving state across calls – In-Map & In-Mapper combining (vs. combiners) – Secondary sorting (via value-to-key Conversion) – Pairs and Stripes – Order Inversion • Group Work (examples) – Interlude: scaling counts, TF-IDF
  • 5. MapReduce: Recap Required: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) All values with the same key are reduced together Optional: partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Often a simple hash of the key, e.g., hash(k’) mod n  Divides up key space for parallel reduce operations combine ( K2, list(V2) ) → list ( K2, V2 )  Mini-reducers that run in memory after the map phase  Used as an optimization to reduce network traffic The execution framework handles everything else…
  • 6. “Everything Else”  The execution framework handles everything else…  Scheduling: assigns workers to map and reduce tasks  ―Data distribution‖: moves processes to data  Synchronization: gathers, sorts, and shuffles intermediate data  Errors and faults: detects worker failures and restarts  Limited control over data and execution flow  All algorithms must expressed in m, r, c, p  You don’t know:  Where mappers and reducers run  When a mapper or reducer begins or finishes  Which input a particular mapper is processing  Which intermediate key a particular reducer is processing
  • 7. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 8. Shuffle and Sort Mapper intermediate files (on disk) merged spills (on disk) Combiner Reducer circular buffer (in memory) Combiner other reducers spills (on disk) other mappers
  • 9. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts  As map emits values, local sorting runs in tandem (1st sort)  Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)  Partition determines which (logical) reducer Rj each key will go to  Node’s TaskTracker tells JobTracker it has keys for Rj  JobTracker determines node to run Rj based on data locality  When local map/combine/sort finishes, sends data to Rj’s node  Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)  For each (K, list(V)) tuple in merged output, call reduce(…)
  • 10. Scalable Hadoop Algorithms: Themes  Avoid object creation  Inherently costly operation  Garbage collection  Avoid buffering  Limited heap size  Works for small datasets, but won’t scale! • Yet… we’ll talk about patterns involving buffering…
  • 11. Importance of Local Aggregation  Ideal scaling characteristics:  Twice the data, twice the running time  Twice the resources, half the running time  Why can’t we achieve this?  Synchronization requires communication  Communication kills performance  Thus… avoid communication!  Reduce intermediate data via local aggregation  Combiners can help
  • 12. Tools for Synchronization  Cleverly-constructed data structures  Bring partial results together  Sort order of intermediate keys  Control order in which reducers process keys  Partitioner  Control which reducer processes which keys  Preserving state in mappers and reducers  Capture dependencies across multiple keys and values
  • 13. Secondary Sorting  MapReduce sorts input to reducers by key  Values may be arbitrarily ordered  What if want to sort value also?  E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…  Solutions?  Swap key and value to sort by value?  What if we use (k,v) as a joint key (and change nothing else)?
  • 14. Secondary Sorting: Solutions  Solution 1: Buffer values in memory, then sort  Tradeoffs?  Solution 2: ―Value-to-key conversion‖ design pattern  Form composite intermediate key: (k, v1)  Let execution framework do the sorting  Preserve state across multiple key-value pairs  …how do we make this happen?
  • 15. Secondary Sorting (Lin 57, White 241)  Create composite key: (k,v)  Define a Key Comparator to sort via both  Possibly not needed in some cases (e.g. strings & concatenation)  Define a partition function based only on the (original) key  All pairs with same key should go to same reducer  Multiple keys may still go to the same reduce node; how do you know when the key changes across invocations of reduce()? • i.e. assume you want to do something with all values associated with a given key (e.g. print all on the same line, with no other keys)  Preserve state in the reducer across invocations  reduce() will be called separately for each pair, but we need to track the current key so we can detect when it changes Hadoop also provides Group Comparator
  • 16. Preserving State in Hadoop Mapper object Reducer object one object per task state state configure API initialization hook configure one call per input key-value pair map reduce one call per intermediate key close API cleanup hook close
  • 17. Combiner Design  Combiners and reducers share same method signature  Sometimes, reducers can serve as combiners  Often, not…  Remember: combiner are optional optimizations  Should not affect algorithm correctness  May be run 0, 1, or multiple times
  • 18. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 19. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); Combiner?
  • 21. Design Pattern for Local Aggregation  ―In-mapper combining‖  Fold the functionality of the combiner into the mapper, including preserving state across multiple map calls  Advantages  Speed  Why is this faster than actual combiners? • Construction/deconstruction, serialization/deserialization • Guarantee and control use  Disadvantages  Buffering! Explicit memory management required • Can use disk-backed-buffer, based on # items or byes in memory • What if multiple mappers running on the same node? Do we know?  Potential for order-dependent bugs
  • 22. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); Combine = reduce
  • 23. Word Count: in-map combining Are combiners still needed?
  • 24. Word Count: in-mapper combining Are combiners still needed?
  • 25. Example 2: Compute the Mean (v1) Why can’t we use reducer as combiner?
  • 26. Example 2: Compute the Mean (v2) Why doesn’t this work?
  • 27. Example 2: Compute the Mean (v3)
  • 28. Computing the Mean: in-mapper combining Are combiners still needed?
  • 29. Example 3: Term Co-occurrence  Term co-occurrence matrix for a text collection  M = N x N matrix (N = vocabulary size)  Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)  Why?  Distributional profiles as a way of measuring semantic distance  Semantic distance useful for many language processing tasks
  • 30. MapReduce: Large Counting Problems  Term co-occurrence matrix for a text collection = specific instance of a large counting problem  A large event space (number of terms)  A large number of observations (the collection itself)  Goal: keep track of interesting statistics about the events  Basic approach  Mappers generate partial counts  Reducers aggregate partial counts How do we aggregate partial counts efficiently?
  • 31. Approach 1: “Pairs”  Each mapper takes a sentence:  Generate all co-occurring term pairs  For all pairs, emit (a, b) → count  Reducers sum up counts associated with these pairs  Use combiners!
  • 33. “Pairs” Analysis  Advantages  Easy to implement, easy to understand  Disadvantages  Lots of pairs to sort and shuffle around (upper bound?)  Not many opportunities for combiners to work
  • 34. Another Try: “Stripes”  Idea: group together pairs into an associative array (a, b) → 1 (a, c) → 2 (a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } (a, e) → 3 (a, f) → 2  Each mapper takes a sentence:  Generate all co-occurring term pairs  For each term, emit a → { b: countb, c: countc, d: countd … }  Reducers perform element-wise sum of associative arrays a → { b: 1, d: 5, e: 3 } + a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
  • 36. “Stripes” Analysis  Advantages  Far less sorting and shuffling of key-value pairs  Can make better use of combiners  Disadvantages  More difficult to implement  Underlying object more heavyweight  Fundamental limitation in terms of size of event space • Buffering!
  • 37. Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
  • 38.
  • 39. Relative Frequencies  How do we estimate relative frequencies from counts? count ( A, B) count ( A, B) f ( B | A)   count ( A)  count ( A, B' ) B'  Why do we want to do this?  How do we do this with MapReduce?
  • 40. f(B|A): “Stripes” a → {b1:3, b2 :12, b3 :7, b4 :1, … }  Easy!  One pass to compute (a, *)  Another pass to directly compute f(B|A)
  • 41. f(B|A): “Pairs” (a, *) → 32 Reducer holds this value in memory (a, b1) → 3 (a, b1) → 3 / 32 (a, b2) → 12 (a, b2) → 12 / 32 (a, b3) → 7 (a, b3) → 7 / 32 (a, b4) → 1 (a, b4) → 1 / 32 … …  For this to work:  Must emit extra (a, *) for every bn in mapper  Must make sure all a’s get sent to same reducer (use partitioner)  Must make sure (a, *) comes first (define sort order)  Must hold state in reducer across different key-value pairs
  • 42. “Order Inversion”  Common design pattern  Computing relative frequencies requires marginal counts  But marginal cannot be computed until you see all counts  Buffering is a bad idea!  Trick: getting the marginal counts to arrive at the reducer before the joint counts  Optimizations  Apply in-memory combining pattern to accumulate marginal counts  Should we apply combiners?
  • 43. Synchronization: Pairs vs. Stripes  Approach 1: turn synchronization into an ordering problem  Sort keys into correct order of computation  Partition key space so that each reducer gets the appropriate set of partial results  Hold state in reducer across multiple key-value pairs to perform computation  Illustrated by the ―pairs‖ approach  Approach 2: construct data structures that bring partial results together  Each reducer receives all the data it needs to complete the computation  Illustrated by the ―stripes‖ approach
  • 44. Recap: Tools for Synchronization  Cleverly-constructed data structures  Bring data together  Sort order of intermediate keys  Control order in which reducers process keys  Partitioner  Control which reducer processes which keys  Preserving state in mappers and reducers  Capture dependencies across multiple keys and values
  • 45. Issues and Tradeoffs  Number of key-value pairs  Object creation overhead  Time for sorting and shuffling pairs across the network  Size of each key-value pair  De/serialization overhead  Local aggregation  Opportunities to perform local aggregation varies  Combiners make a big difference  Combiners vs. in-mapper combining  RAM vs. disk vs. network
  • 47. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖
  • 48. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖ Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word)  Ways to make more efficient?
  • 49. Task 5  How many distinct words in the document collection start with each letter?  Note: ―types‖ vs. ―tokens‖ Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word) Reducer<Integer,Integer  Integer,V3> Reduce(Integer length, Iterator<K2> values): set of words = empty set; for each word add word to set emit(letter, size word set)  Ways to make more efficient?
  • 50. Task 5b  How many distinct words in the document collection start with each letter?  How to use in-mapper combining and a separate combiner  Tradeoffs Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word)
  • 51. Task 5b  How many distinct words in the document collection start with each letter?  How to use in-mapper combining and a separate combiner  Tradeoffs? Mapper<String,String  String,String> Map(String docID, String document) for each word in document emit (first character, word) Combiner<String,String  String,String> Combine(String letter, Iterator<String> words): set of words = empty set; for each word add word to set for each word in set emit(letter, word)
  • 52. Task 6: find median document length
  • 53. Task 6: find median document length Mapper<K1,V1  Integer,Integer> Map(K1 xx, V1 xx) 10,000 / N times emit( length(generateRandomDocument()), 1)
  • 54. Task 6: find median document length Mapper<K1,V1  Integer,Integer> Map(K1 xx, V1 xx) 10,000 / N times emit( length(generateRandomDocument()), 1) Reducer<Integer,Integer  Integer,V3> Reduce(Integer length, Iterator<K2> values): static list lengths = empty list; for each value append length to list Close() { output median }  conf.setNumReduceTasks(1)  Problems with this solution?
  • 55. Interlude: Scaling counts  Many applications require counts of words in some context.  E.g. information retrieval, vector-based semantics  Counts from frequent words like ―the‖ can overwhelm the signal from content words such as ―stocks‖ and ―football‖  Two strategies for combating high frequency words:  Use a stop list that excludes them  Scale the counts such that high frequency words are downgraded.
  • 56. Interlude: Scaling counts, TF-IDF  TF-IDF, or term frequency—inverse document frequency is a standard way of scaling.  Inverse document frequency for a term t is the ratio of the number of documents in the collection to the number of documents containing t:  TF-IDF is just the term frequency times the idf:
  • 57. Interlude: Scaling counts, TF-IDF  TF-IDF, or term frequency—inverse document frequency is a standard way of scaling.  Inverse document frequency for a term t is the ratio of the number of documents in the collection to the number of documents containing t:  TF-IDF is just the term frequency times the idf:
  • 58. Interlude: Scaling counts using DF  Recall the word co-occurrence counts task from the earlier slides.  mij represents the number of times word j has occurred in the neighborhood of word i.  The row mi gives a vector profile of word i that we can use for tasks like determining word similarity (e.g. using cosine distance)  Words like ―the‖ will tend to have high counts that we want to scale down so they don’t dominate this computation.  The counts in mij can be scaled down using dfj. Let’s create a transformed matrix S where:
  • 59. Task 7  Compute S, the co-occurrence counts scaled by document frequency. • First: do the simplest mapper • Then: simplify things for the reducer