SlideShare a Scribd company logo
Mapreduce Algorithms
 O'Reilly Strata Conference,
  London UK, October 1st 2012

             Amund Tveit
   amund@atbrox.com - twitter.com/atveit

 http://atbrox.com/about/ - twitter.com/atbrox
Background

● Been blogging about Mapreduce Algorithms in Academic
  Papers since since Oct 2009 (1st Hadoop World)
   1. http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-
      papers/
   2. http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-
      academic-papers-updated/
   3. http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-
      academic-papers-may-2010-update/
   4. http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-
      academic-papers-4th-update-may-2011/
   5. http://atbrox.com/2011/11/09/mapreduce-hadoop-algorithms-in-
      academic-papers-5th-update-%E2%80%93-nov-2011/
● Atbrox works on IR-related Hadoop and cloud projects
● My prior experience: Google (software infrastructure and
  mobile news), PhD in Computer Science
TOC

1. Brief introduction to Mapreduce Algorithms

2. Overview of a few Recent Mapreduce Algorithms in Papers

3. In-Depth look at a Mapreduce Algorithm

4. Recommendations for Designing Mapreduce Algorithms

5. Appendix - 6th (partial) list of Mapreduce and Hadoop
Algorithms in Acemic papers
1. Brief Introduction
           to
Mapreduce Algorithms
1.1 So What is Mapreduce?

Mapreduce is a concept,method and software for typically batch-based
large-scale parallelization. It is inspired by functional programming's
map() and reduce() functions

Nice features of mapreduce systems include:
  ● reliably processing job even though machines die (vs MPI,BSP)
  ● parallelization, e.g. thousands of machines for terasort and
    petasort

Mapreduce was invented by the Google fellows:
        Jeff Dean       Sanjay Ghemawat
1.2 Mapper function

Processes one key and value pair at the time, e.g.

 ● word count
    ○ map(key: uri, value: text):
       ■ for word in tokenize(value)
       ■ emit(word, 1) # found 1 occurence of word

 ● inverted index
     ○ map(key: uri, value: text):
        ■ for word in tokenize(value)
        ■ emit(word, key) # word and uri pair
1.3 Reducer function

Reducers processes one key and all values that belong to it (as
received and aggregated from the map function), e.g.

 ● word count
    ○ reduce(key: word type, value: list of 1s):
        ■ emit(key, sum(value))

 ● inverted index
     ○ reduce(key: word type, value: list of URIs):
         ■ # perhaps transformation of value, e.g. encoding
         ■ emit(key, value) // e.g. to a distr. hash table
1.4 Mapreduce Pan Patterns
1.4 Pattern 1 - Data Reduction




  ● Word Count
  ● Machine Learning (e.g. training models)
  ● Probably the most common way of using
    mapreduce
1.5 Pattern 2 - Transformation




  ● Sorting (e.g. Terasort and Petasort)
1.6 Pattern 3 - Data Increase




  ● Decompression
  ● Annotation, e.g. traditional indexing pipeline
2. Examples of recently published use
   and development of Mapreduce
             Algorithms
2.1 Machine Learning - ILP

 ● Problem: Automatically find (induce) rules from examples
   and knowledge base

 ● Paper:
    ○ Data and Task Parallelism in ILP using
      Mapreduce (IBM Research India et.al)


This follows Pan Pattern 1 - Data Reduction - output is a set of
rules from a (typically larger) set of examples and knowledge
base
2.1 Machine Learning - ILP - II

  Example Input:




  Example Result:
2.2 Finance - Trading

Problem: Optimize Algorithmic Trading

Paper:
    ○ Optimizing Parameters of Algorithm Trading Strategies
      using Mapreduce (EMC-Greenplum Research China et.
      al)


This follows Pan Pattern 1 - Data Reduction - output is the set
of best parameter sets for algorithmic trading. Note that during
map phase there is increase in data, i.e. creation of
permutations of possible parameters
2.3 Software Engineering

Problem: Automatically generate unit test code to increase test
coverage and offload developers

Paper:
    ○ A Parallel Genetic Algorithm Based on Hadoop
       Mapreduce for the Automatic Generation of JUnit Test
       Suites (University of Salerno, Italy)

This (probably) follows Pan Pattern 1, 2 and 3, i.e. - assumably
- fixed amount of chromosomes (i.e. transformation), collection
unit tests are being evolved and the combined lengths of unit
tests evolved might increase or decrease compared to the
original input.
2.3 Software Engineering - II
  Figure from "EvoTest: Test Case Generation using
  Genetic Programming and Software Analysis"
3. In-Depth look at a
Mapreduce Algorithm
3.1 The Challenge

● Task:
   ○ Build a low-latency key-value store for disk or SSD

● Features:
   ○ Low startup time
      ■ i.e. no/little pre-loading of (large) caches to memory
   ○ Prefix-search
      ■ i.e. support searching for both all prefixes of a key as
        well as the entire key
   ○ Low-latency
      ■ i.e. reduce number of disk/SSD seeks, e.g. by
        increase probability of disk cache hits
   ○ Static/Immutable data - write once, read many
3.2 A few Possible Ways

 1. Binary Search or
    Interpolation Search
    within a file of sorted keys
    and then look up value
    ~ lg(N) or lg(lg(N))

 2. Prefix-algorithms mapped
    to file, e.g.
      1. Trie,
      2. Ternary search tree
      3. Patricia Tree
     ~ O(k)
3.3 Overall Approach

1. Scale - divide key,value data into shards

2. Build patricia tree per shard and store all key, values for
   later

3. Prepare trees to have placeholder (short) value for each key

4. Flatten each patricia tree to a disk-friendly and byte-aligned
   format fit for random access

5. Recalculate file addresses in each patricia tree to be able to
   store the actual values

6. Create final patricia tree with values on disk
3.4 Split data with mapper

 1. Scale - divide key,value data into shards

map(key, value):
 # e.g. simple - hash(first char), or use a classifier
 # personalization etc.
 shard_key = shard_function(key, value)
 out_value = (key,value)
 emit shard_key, out_value
3.5 Init and run reduce()

2. Build one patricia tree per (reduce) shard


reduce_init():        # called once per reducer before it starts
  self.patricia = Patricia()
  self.tempkeyvaluestore = TempKeyValueStore

reducer(shard_key, list of key_value pairs):
  for (key, value) in list of key_value pairs:
    self.tempkeyvaluestore[key] = value
3.6 Reducer cont.

3. Prepare trees to have placeholder values (=key) for each key


reduce_final():       # called once per reducer after all
reduce()
  for key, value in self.tempkeyvaluestore:
    self.patricia.add(key, key) # key == value for now
3.7 Flatten patricia tree for disk

4. Flatten each patricia tree to a disk-friendly and byte-
aligned format fit for random access

reduce_final():         # continued from 3.
  # num 0s below constrains addressable size of shard file
  self.firstblockaddress = "00000000000000"
  # create mapping from dict of dicts to a linear file
  self.flatten_patricia(self.patricia, parent=self.firstblockaddress)

  #
  self.recalculate_patricia_tree_for_actual_values()
  self.output_patricia_tree_with_actual_values()
3.8 Mapreduce Approach - 5
Generated file format below, and
corresponding patricia tree to the
right



00000000038["", {"r": "00000000038"}]
00000000060["", {"om": "00000000098", "ub": "00000000290"}]
00000000062["", {"ulus": "00000000265", "an": "00000000160"}]
00000000059["", {"e": "00000000219", "us": "00000000242"}]
00000000023["one", ""]
00000000023["two", ""]
00000000025["three", ""]
00000000059["", {"ic": "00000000456", "e": "00000000349"}]
00000000059["", {"r": "00000000432", "ns": "00000000408"}]
00000000024["four", ""]
00000000024["five", ""]
00000000063["", {"on": "00000000519", "undus": "00000000542"}]
00000000023["six", ""]
00000000025["seven", ""]
4. Recommendations for Designing
      Mapreduce Algorithms
Mapreduce Patterns

Map() and Reduce() methods typically follow patterns, a
recommended way of representing such patterns are:

extracting and generalize code skeleton fingerprints based on:
 1. loops: e.g. "do-while", "while", "for", "repeat-until" => "loop"
 2. conditions: e.g. "if", "exception" and "switch" => "condition"
 3. emits: e.g. outputs from map() => reduce() or IO => "emit"
 4. emit data types: e.g. string, number, list (if known)

map(key, value):                   reduce(key, values):
 loop # over tokenized value          emit # key = word,
   emit # key=word, val=1 or uri           # value = sum(values) or
                                          # list of URIs
General Mapreduce Advice

Performance
 1. IO/moving data is expensive - use compression and aggr.
 2. Use combiners, i.e. "reducer afterburners" for mappers
 3. Look out for skewness in key distribution, e.g. zipfs law
 4. Use the right programming language for the task
 5. Balance work between mappers and reducers - http:
    //atbrox.com/2010/02/08/parallel-machine-learning-for-
    hadoopmapreduce-a-python-example/

Cost, Readability & Maintainability
 6. Mapreduce = right tool? (seq./parallel/iterative/realtime)
 7. E.g. Crunch, Pig, Hive instead of full Mapreduce code?
 8. Split job into sequence of mapreduce jobs, e.g. with
    cascading, mrjob etc.
The End


● Mapreduce Paper Trends (from 2009 => 2012), roughly:
   ○ Increased use of mapreduce jobflows, i.e. more than one
     mapreduce in a sequence and also in various types of
     iterations
       ■ e.g. the Algorithmic Trading earlier
   ○ Increased amount of papers published related to
     semantic web (e.g. RDF) and AI reasoning/inference
   ○ Decreased (relative) amount of IR and Ads papers
APPENDIX
   List of Mapreduce and Hadoop
Algorithms in Academic Papers - 6th
      version (partial subset of
        forthcoming blogpost)
AI: Reasoning & Semantic Web

1. Reasoning with Fuzzy-cL+Ontologies Using Mapreduce
2. WebPIE: A Web-scale parallel inference engine using
   Mapreduce
3. Towards Scalable Reasoning over Annotated RDF Data
   Using Mapreduce
4. Reasoning with Large Scale Ontologies in Fuzzy pD* Using
   Mapreduce
5. Scalable RDF Compression with Mapreduce
6. Towards Parallel Nonmonotonic Reasoning with Billions of
   Facts
Biology & Medicine

1. A Mapreduce-based Algorithm for Motif Search
2. A MapReduce Approach for Ridge Regression in
   Neuroimaging Genetic Studies
3. Fractal Mapreduce decomposition of sequence alignment
4. Cloud-enabling Sequence Alignment with Hadoop
   Mapreduce: A Performance Analysis

AI Misc.

A MapReduce based Ant Colony Optimization approach to
combinatorial optimization problems
Machine Learning

1. An efficient Mapreduce Algorithm for Parallelizing Large-
   Scale Graph Clustering
2. Accelerating Bayesian Network Parameter Learning Using
   Hadoop and Mapreduce
3. The Performance Improvements of SPRINT Algorithm
   Based on the Hadoop Platform

Graphs & Graph Theory
4. Large-Scale Graph Biconnectivity in MapReduce
5. Parallel Tree Reduction on MapReduce
Datacubes & Joins
1. Data Cube Materialization and Mining Over Mapreduce
2. Fuzzy joins using Mapreduce
3. Efficient Distributed Parallel Top-Down Computation of
   ROLAP Data Cube Using Mapreduce
4. V-smart-join: A scalable MapReduce Framework for all-pair
   similarity joins of multisets and vectors
5. Data Cube Materialization and Mining over MapReduce

Finance & Business
6. Optimizing Parameters of Algorithm Trading Strategies
   using Mapreduce
7. Using Mapreduce to scale events correlation discovery for
   business processes mining
8. Computational Finance with Map-Reduce in Scala
Mathematics & Statistics

1. GigaTensor: scaling tensor analysis up by 100 times -
   algorithms and discoveries
2. Fast Parallel Algorithms for Blocked Dense Matrix
   Multiplication on Shared Memory Architectures
3. Mr. LDA: A Flexible Large Scale Topic Modelling Package
   using Variational Inference in MapReduce
4. Matrix chain multiplication via multi-way algorithms in
   MapReduce

More Related Content

What's hot

distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
Harshad Umredkar
 
Dendral
DendralDendral
Dendral
gupta8741
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
Sanghyuk Chun
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
Leila panahi
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
Alaa Khamis, PhD, SMIEEE
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
FellowBuddy.com
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
ZHAO Sam
 
Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)
Robert Grossman
 
Task programming
Task programmingTask programming
Task programming
Yogendra Tamang
 
predicate logic example
predicate logic examplepredicate logic example
predicate logic example
SHUBHAM KUMAR GUPTA
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
Prajakta Rane
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Mohammed Bennamoun
 
Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Planning in AI(Partial order planning)
Planning in AI(Partial order planning)
Vicky Tyagi
 
Message and Stream Oriented Communication
Message and Stream Oriented CommunicationMessage and Stream Oriented Communication
Message and Stream Oriented Communication
Dilum Bandara
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
vikas dhakane
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Chapter 4 (final)
Chapter 4 (final)Chapter 4 (final)
Chapter 4 (final)
Nateshwar Kamlesh
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
Dabbal Singh Mahara
 

What's hot (20)

distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Dendral
DendralDendral
Dendral
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)Open Cloud Consortium Overview (01-10-10 V6)
Open Cloud Consortium Overview (01-10-10 V6)
 
Task programming
Task programmingTask programming
Task programming
 
predicate logic example
predicate logic examplepredicate logic example
predicate logic example
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Planning in AI(Partial order planning)
Planning in AI(Partial order planning)
 
Message and Stream Oriented Communication
Message and Stream Oriented CommunicationMessage and Stream Oriented Communication
Message and Stream Oriented Communication
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Chapter 4 (final)
Chapter 4 (final)Chapter 4 (final)
Chapter 4 (final)
 
Mobile hci
Mobile hciMobile hci
Mobile hci
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 

Viewers also liked

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Yahoo Developer Network
 
正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e
正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e
正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e
眞之介 shinnosuke 広瀬 hirose
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Jeff Patti
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
Saliya Ekanayake
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 

Viewers also liked (8)

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
Design Patterns for Efficient Graph Algorithms in MapReduce__HadoopSummit2010
 
正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e
正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e
正直過ぎる家入さん2「そもそも今の知事選は政策が無意味化してない?」20140205e
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar to Mapreduce Algorithms

Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Easy R
Easy REasy R
Easy R
Ajay Ohri
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
R programming slides
R  programming slidesR  programming slides
R programming slides
Pankaj Saini
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
fredharris32
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
Sovello Hildebrand
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
Gleicon Moraes
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Piotr Przymus
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
betsegaw123
 
Gráficas en python
Gráficas en python Gráficas en python
Gráficas en python
Jhon Valle
 

Similar to Mapreduce Algorithms (20)

Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Easy R
Easy REasy R
Easy R
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
 
Lect1.pptx
Lect1.pptxLect1.pptx
Lect1.pptx
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
Gráficas en python
Gráficas en python Gráficas en python
Gráficas en python
 

Recently uploaded

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Mapreduce Algorithms

  • 1. Mapreduce Algorithms O'Reilly Strata Conference, London UK, October 1st 2012 Amund Tveit amund@atbrox.com - twitter.com/atveit http://atbrox.com/about/ - twitter.com/atbrox
  • 2. Background ● Been blogging about Mapreduce Algorithms in Academic Papers since since Oct 2009 (1st Hadoop World) 1. http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic- papers/ 2. http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in- academic-papers-updated/ 3. http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in- academic-papers-may-2010-update/ 4. http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in- academic-papers-4th-update-may-2011/ 5. http://atbrox.com/2011/11/09/mapreduce-hadoop-algorithms-in- academic-papers-5th-update-%E2%80%93-nov-2011/ ● Atbrox works on IR-related Hadoop and cloud projects ● My prior experience: Google (software infrastructure and mobile news), PhD in Computer Science
  • 3. TOC 1. Brief introduction to Mapreduce Algorithms 2. Overview of a few Recent Mapreduce Algorithms in Papers 3. In-Depth look at a Mapreduce Algorithm 4. Recommendations for Designing Mapreduce Algorithms 5. Appendix - 6th (partial) list of Mapreduce and Hadoop Algorithms in Acemic papers
  • 4. 1. Brief Introduction to Mapreduce Algorithms
  • 5. 1.1 So What is Mapreduce? Mapreduce is a concept,method and software for typically batch-based large-scale parallelization. It is inspired by functional programming's map() and reduce() functions Nice features of mapreduce systems include: ● reliably processing job even though machines die (vs MPI,BSP) ● parallelization, e.g. thousands of machines for terasort and petasort Mapreduce was invented by the Google fellows: Jeff Dean Sanjay Ghemawat
  • 6. 1.2 Mapper function Processes one key and value pair at the time, e.g. ● word count ○ map(key: uri, value: text): ■ for word in tokenize(value) ■ emit(word, 1) # found 1 occurence of word ● inverted index ○ map(key: uri, value: text): ■ for word in tokenize(value) ■ emit(word, key) # word and uri pair
  • 7. 1.3 Reducer function Reducers processes one key and all values that belong to it (as received and aggregated from the map function), e.g. ● word count ○ reduce(key: word type, value: list of 1s): ■ emit(key, sum(value)) ● inverted index ○ reduce(key: word type, value: list of URIs): ■ # perhaps transformation of value, e.g. encoding ■ emit(key, value) // e.g. to a distr. hash table
  • 8. 1.4 Mapreduce Pan Patterns
  • 9. 1.4 Pattern 1 - Data Reduction ● Word Count ● Machine Learning (e.g. training models) ● Probably the most common way of using mapreduce
  • 10. 1.5 Pattern 2 - Transformation ● Sorting (e.g. Terasort and Petasort)
  • 11. 1.6 Pattern 3 - Data Increase ● Decompression ● Annotation, e.g. traditional indexing pipeline
  • 12. 2. Examples of recently published use and development of Mapreduce Algorithms
  • 13. 2.1 Machine Learning - ILP ● Problem: Automatically find (induce) rules from examples and knowledge base ● Paper: ○ Data and Task Parallelism in ILP using Mapreduce (IBM Research India et.al) This follows Pan Pattern 1 - Data Reduction - output is a set of rules from a (typically larger) set of examples and knowledge base
  • 14. 2.1 Machine Learning - ILP - II Example Input: Example Result:
  • 15. 2.2 Finance - Trading Problem: Optimize Algorithmic Trading Paper: ○ Optimizing Parameters of Algorithm Trading Strategies using Mapreduce (EMC-Greenplum Research China et. al) This follows Pan Pattern 1 - Data Reduction - output is the set of best parameter sets for algorithmic trading. Note that during map phase there is increase in data, i.e. creation of permutations of possible parameters
  • 16. 2.3 Software Engineering Problem: Automatically generate unit test code to increase test coverage and offload developers Paper: ○ A Parallel Genetic Algorithm Based on Hadoop Mapreduce for the Automatic Generation of JUnit Test Suites (University of Salerno, Italy) This (probably) follows Pan Pattern 1, 2 and 3, i.e. - assumably - fixed amount of chromosomes (i.e. transformation), collection unit tests are being evolved and the combined lengths of unit tests evolved might increase or decrease compared to the original input.
  • 17. 2.3 Software Engineering - II Figure from "EvoTest: Test Case Generation using Genetic Programming and Software Analysis"
  • 18. 3. In-Depth look at a Mapreduce Algorithm
  • 19. 3.1 The Challenge ● Task: ○ Build a low-latency key-value store for disk or SSD ● Features: ○ Low startup time ■ i.e. no/little pre-loading of (large) caches to memory ○ Prefix-search ■ i.e. support searching for both all prefixes of a key as well as the entire key ○ Low-latency ■ i.e. reduce number of disk/SSD seeks, e.g. by increase probability of disk cache hits ○ Static/Immutable data - write once, read many
  • 20. 3.2 A few Possible Ways 1. Binary Search or Interpolation Search within a file of sorted keys and then look up value ~ lg(N) or lg(lg(N)) 2. Prefix-algorithms mapped to file, e.g. 1. Trie, 2. Ternary search tree 3. Patricia Tree ~ O(k)
  • 21. 3.3 Overall Approach 1. Scale - divide key,value data into shards 2. Build patricia tree per shard and store all key, values for later 3. Prepare trees to have placeholder (short) value for each key 4. Flatten each patricia tree to a disk-friendly and byte-aligned format fit for random access 5. Recalculate file addresses in each patricia tree to be able to store the actual values 6. Create final patricia tree with values on disk
  • 22. 3.4 Split data with mapper 1. Scale - divide key,value data into shards map(key, value): # e.g. simple - hash(first char), or use a classifier # personalization etc. shard_key = shard_function(key, value) out_value = (key,value) emit shard_key, out_value
  • 23. 3.5 Init and run reduce() 2. Build one patricia tree per (reduce) shard reduce_init(): # called once per reducer before it starts self.patricia = Patricia() self.tempkeyvaluestore = TempKeyValueStore reducer(shard_key, list of key_value pairs): for (key, value) in list of key_value pairs: self.tempkeyvaluestore[key] = value
  • 24. 3.6 Reducer cont. 3. Prepare trees to have placeholder values (=key) for each key reduce_final(): # called once per reducer after all reduce() for key, value in self.tempkeyvaluestore: self.patricia.add(key, key) # key == value for now
  • 25. 3.7 Flatten patricia tree for disk 4. Flatten each patricia tree to a disk-friendly and byte- aligned format fit for random access reduce_final(): # continued from 3. # num 0s below constrains addressable size of shard file self.firstblockaddress = "00000000000000" # create mapping from dict of dicts to a linear file self.flatten_patricia(self.patricia, parent=self.firstblockaddress) # self.recalculate_patricia_tree_for_actual_values() self.output_patricia_tree_with_actual_values()
  • 26. 3.8 Mapreduce Approach - 5 Generated file format below, and corresponding patricia tree to the right 00000000038["", {"r": "00000000038"}] 00000000060["", {"om": "00000000098", "ub": "00000000290"}] 00000000062["", {"ulus": "00000000265", "an": "00000000160"}] 00000000059["", {"e": "00000000219", "us": "00000000242"}] 00000000023["one", ""] 00000000023["two", ""] 00000000025["three", ""] 00000000059["", {"ic": "00000000456", "e": "00000000349"}] 00000000059["", {"r": "00000000432", "ns": "00000000408"}] 00000000024["four", ""] 00000000024["five", ""] 00000000063["", {"on": "00000000519", "undus": "00000000542"}] 00000000023["six", ""] 00000000025["seven", ""]
  • 27. 4. Recommendations for Designing Mapreduce Algorithms
  • 28. Mapreduce Patterns Map() and Reduce() methods typically follow patterns, a recommended way of representing such patterns are: extracting and generalize code skeleton fingerprints based on: 1. loops: e.g. "do-while", "while", "for", "repeat-until" => "loop" 2. conditions: e.g. "if", "exception" and "switch" => "condition" 3. emits: e.g. outputs from map() => reduce() or IO => "emit" 4. emit data types: e.g. string, number, list (if known) map(key, value): reduce(key, values): loop # over tokenized value emit # key = word, emit # key=word, val=1 or uri # value = sum(values) or # list of URIs
  • 29. General Mapreduce Advice Performance 1. IO/moving data is expensive - use compression and aggr. 2. Use combiners, i.e. "reducer afterburners" for mappers 3. Look out for skewness in key distribution, e.g. zipfs law 4. Use the right programming language for the task 5. Balance work between mappers and reducers - http: //atbrox.com/2010/02/08/parallel-machine-learning-for- hadoopmapreduce-a-python-example/ Cost, Readability & Maintainability 6. Mapreduce = right tool? (seq./parallel/iterative/realtime) 7. E.g. Crunch, Pig, Hive instead of full Mapreduce code? 8. Split job into sequence of mapreduce jobs, e.g. with cascading, mrjob etc.
  • 30. The End ● Mapreduce Paper Trends (from 2009 => 2012), roughly: ○ Increased use of mapreduce jobflows, i.e. more than one mapreduce in a sequence and also in various types of iterations ■ e.g. the Algorithmic Trading earlier ○ Increased amount of papers published related to semantic web (e.g. RDF) and AI reasoning/inference ○ Decreased (relative) amount of IR and Ads papers
  • 31. APPENDIX List of Mapreduce and Hadoop Algorithms in Academic Papers - 6th version (partial subset of forthcoming blogpost)
  • 32. AI: Reasoning & Semantic Web 1. Reasoning with Fuzzy-cL+Ontologies Using Mapreduce 2. WebPIE: A Web-scale parallel inference engine using Mapreduce 3. Towards Scalable Reasoning over Annotated RDF Data Using Mapreduce 4. Reasoning with Large Scale Ontologies in Fuzzy pD* Using Mapreduce 5. Scalable RDF Compression with Mapreduce 6. Towards Parallel Nonmonotonic Reasoning with Billions of Facts
  • 33. Biology & Medicine 1. A Mapreduce-based Algorithm for Motif Search 2. A MapReduce Approach for Ridge Regression in Neuroimaging Genetic Studies 3. Fractal Mapreduce decomposition of sequence alignment 4. Cloud-enabling Sequence Alignment with Hadoop Mapreduce: A Performance Analysis AI Misc. A MapReduce based Ant Colony Optimization approach to combinatorial optimization problems
  • 34. Machine Learning 1. An efficient Mapreduce Algorithm for Parallelizing Large- Scale Graph Clustering 2. Accelerating Bayesian Network Parameter Learning Using Hadoop and Mapreduce 3. The Performance Improvements of SPRINT Algorithm Based on the Hadoop Platform Graphs & Graph Theory 4. Large-Scale Graph Biconnectivity in MapReduce 5. Parallel Tree Reduction on MapReduce
  • 35. Datacubes & Joins 1. Data Cube Materialization and Mining Over Mapreduce 2. Fuzzy joins using Mapreduce 3. Efficient Distributed Parallel Top-Down Computation of ROLAP Data Cube Using Mapreduce 4. V-smart-join: A scalable MapReduce Framework for all-pair similarity joins of multisets and vectors 5. Data Cube Materialization and Mining over MapReduce Finance & Business 6. Optimizing Parameters of Algorithm Trading Strategies using Mapreduce 7. Using Mapreduce to scale events correlation discovery for business processes mining 8. Computational Finance with Map-Reduce in Scala
  • 36. Mathematics & Statistics 1. GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries 2. Fast Parallel Algorithms for Blocked Dense Matrix Multiplication on Shared Memory Architectures 3. Mr. LDA: A Flexible Large Scale Topic Modelling Package using Variational Inference in MapReduce 4. Matrix chain multiplication via multi-way algorithms in MapReduce