This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Hey friends, here is my "query tree" assignment. :-) I have searched a lot to get this master piece :p and I can guarantee you that this one gonna help you In Sha ALLAH more than any else document on the subject. Have a good day :-)
Hey friends, here is my "query tree" assignment. :-) I have searched a lot to get this master piece :p and I can guarantee you that this one gonna help you In Sha ALLAH more than any else document on the subject. Have a good day :-)
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
As MapReduce clusters have become popular these days, their scheduling is one of the important factor which is to be considered. In order to achieve good performance a MapReduce scheduler must avoid unnecessary data transmission. Hence different scheduling algorithms for MapReduce are necessary to provide good performance. This
slide provides an overview of many different scheduling algorithms for MapReduce.
Guest Lecture about genetic algorithms in the course ECE657: Computational Intelligence/Intelligent Systems Design, Spring 2016, Electrical and Computer Engineering (ECE) Department, University of Waterloo, Canada.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
this presentation is about planning process in AI. The presentation specifically explained POP(Partial order Planning). There are also another planning. In this presentation with help of an example the presentation is briefly explained the planning is done in AI
Message and Stream Oriented CommunicationDilum Bandara
Message and Stream Oriented Communication in distributed systems. Persistent vs. Transient Communication. Event queues, Pub/sub networks, MPI, Stream-based communication, Multicast communication
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.
Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
As MapReduce clusters have become popular these days, their scheduling is one of the important factor which is to be considered. In order to achieve good performance a MapReduce scheduler must avoid unnecessary data transmission. Hence different scheduling algorithms for MapReduce are necessary to provide good performance. This
slide provides an overview of many different scheduling algorithms for MapReduce.
Guest Lecture about genetic algorithms in the course ECE657: Computational Intelligence/Intelligent Systems Design, Spring 2016, Electrical and Computer Engineering (ECE) Department, University of Waterloo, Canada.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
this presentation is about planning process in AI. The presentation specifically explained POP(Partial order Planning). There are also another planning. In this presentation with help of an example the presentation is briefly explained the planning is done in AI
Message and Stream Oriented CommunicationDilum Bandara
Message and Stream Oriented Communication in distributed systems. Persistent vs. Transient Communication. Event queues, Pub/sub networks, MPI, Stream-based communication, Multicast communication
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.
Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:
This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline
Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Covered:
1. Databases and Schemas
2. Tablespaces
3. Data Type
4. Exploring Databases
5. Locating the database server's message log
6. Locating the database's system identifier
7. Listing databases on this database server
8. How much disk space does a table use?
9. Which are my biggest tables?
10. How many rows are there in a table?
11. Quickly estimating the number of rows in a table
12. Understanding object dependencies
Article link httpiveybusinessjournal.compublicationmanaging-.docxfredharris32
Article link: http://iveybusinessjournal.com/publication/managing-global-risk-to-seize-competitive-advantage/
Requirements: Write one summary and study note both no longer than one pages should include all point of article. Then do a PPT and write a presenting paper only for 5 minutes.
Groups of students will create and offer two MS PowerPoint presentation summarizing the main points of one of the readings for this course along with a one page handout for the students in the class. The aim of the presentations and the handouts is to provide the audience with the main ideas of the article and study notes. Groups will bring to class enough copies of the handout for each student in the class. The handout should list the name of the author, the title of the article, the title of the journal, and the publication date and page numbers along with a summary of its main points. Please do not exceed one page for this material.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.StringTokenizer;
/**
* Read a .dat file and reverse it.
*/
public class Reverse {
public static void main(String[]args) {
if (args.length != 3) {
System.err.println(" Incorrect number of arguments");
System.err.println(" Usage: ");
System.err.
println("\tjava Reverse <stack type> <input file> <output file>");
System.exit(1);
}
boolean useList = true;
if (args[0].compareTo("list")==0)
useList = true;
else if (args[0].compareTo("array")==0)
useList = false;
else {
System.err.println("\tSaw "+args[0]+" instead of list or array as first argument");
System.exit(1);
}
try {
//
// Set up the input file to read, and the output file to write to
//
BufferedReader fileIn =
new BufferedReader(new FileReader(args[1]));
PrintWriter fileOut =
new PrintWriter(new
BufferedWriter(new FileWriter(args[2])));
//
// Read the first line of the .dat file to get sample rate.
// We want to store the sample rate value in a variable,
// but we can ignore the "; Sample Rate" part of the line.
// Step through the first line one token (word) at a time
// using the StringTokenizer. The fourth token is the one
// we want (the sample rate).
//
StringTokenizer str;
String oneLine;
int sampleRate;
String strJunk;
oneLine = fileIn.readLine();
str = new StringTokenizer(oneLine);
strJunk = str.nextToken(); // Read in semicolon
strJunk = str.nextToken(); // Read in "Sample"
strJunk = str.nextToken(); // Read in "Rate"
// ...
This short text will get you up to speed in no time on creating visualizations using R's ggplot2 package. It was developed as part of a training to those who had no prior experience in R and had limited knowledge on general programming concepts. It's a must have initial guide for those exploring the field of Data Science
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Piotr Przymus
Have you ever wondered what happens to all the precious RAM after running your 'simple' CPython code? Prepare yourself for a short introduction to CPython memory management! This presentation will try to answer some memory related questions you always wondered about. It will also discuss basic memory profiling tools and techniques.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Mapreduce Algorithms
1. Mapreduce Algorithms
O'Reilly Strata Conference,
London UK, October 1st 2012
Amund Tveit
amund@atbrox.com - twitter.com/atveit
http://atbrox.com/about/ - twitter.com/atbrox
2. Background
● Been blogging about Mapreduce Algorithms in Academic
Papers since since Oct 2009 (1st Hadoop World)
1. http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-
papers/
2. http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-
academic-papers-updated/
3. http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-
academic-papers-may-2010-update/
4. http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-
academic-papers-4th-update-may-2011/
5. http://atbrox.com/2011/11/09/mapreduce-hadoop-algorithms-in-
academic-papers-5th-update-%E2%80%93-nov-2011/
● Atbrox works on IR-related Hadoop and cloud projects
● My prior experience: Google (software infrastructure and
mobile news), PhD in Computer Science
3. TOC
1. Brief introduction to Mapreduce Algorithms
2. Overview of a few Recent Mapreduce Algorithms in Papers
3. In-Depth look at a Mapreduce Algorithm
4. Recommendations for Designing Mapreduce Algorithms
5. Appendix - 6th (partial) list of Mapreduce and Hadoop
Algorithms in Acemic papers
5. 1.1 So What is Mapreduce?
Mapreduce is a concept,method and software for typically batch-based
large-scale parallelization. It is inspired by functional programming's
map() and reduce() functions
Nice features of mapreduce systems include:
● reliably processing job even though machines die (vs MPI,BSP)
● parallelization, e.g. thousands of machines for terasort and
petasort
Mapreduce was invented by the Google fellows:
Jeff Dean Sanjay Ghemawat
6. 1.2 Mapper function
Processes one key and value pair at the time, e.g.
● word count
○ map(key: uri, value: text):
■ for word in tokenize(value)
■ emit(word, 1) # found 1 occurence of word
● inverted index
○ map(key: uri, value: text):
■ for word in tokenize(value)
■ emit(word, key) # word and uri pair
7. 1.3 Reducer function
Reducers processes one key and all values that belong to it (as
received and aggregated from the map function), e.g.
● word count
○ reduce(key: word type, value: list of 1s):
■ emit(key, sum(value))
● inverted index
○ reduce(key: word type, value: list of URIs):
■ # perhaps transformation of value, e.g. encoding
■ emit(key, value) // e.g. to a distr. hash table
11. 1.6 Pattern 3 - Data Increase
● Decompression
● Annotation, e.g. traditional indexing pipeline
12. 2. Examples of recently published use
and development of Mapreduce
Algorithms
13. 2.1 Machine Learning - ILP
● Problem: Automatically find (induce) rules from examples
and knowledge base
● Paper:
○ Data and Task Parallelism in ILP using
Mapreduce (IBM Research India et.al)
This follows Pan Pattern 1 - Data Reduction - output is a set of
rules from a (typically larger) set of examples and knowledge
base
15. 2.2 Finance - Trading
Problem: Optimize Algorithmic Trading
Paper:
○ Optimizing Parameters of Algorithm Trading Strategies
using Mapreduce (EMC-Greenplum Research China et.
al)
This follows Pan Pattern 1 - Data Reduction - output is the set
of best parameter sets for algorithmic trading. Note that during
map phase there is increase in data, i.e. creation of
permutations of possible parameters
16. 2.3 Software Engineering
Problem: Automatically generate unit test code to increase test
coverage and offload developers
Paper:
○ A Parallel Genetic Algorithm Based on Hadoop
Mapreduce for the Automatic Generation of JUnit Test
Suites (University of Salerno, Italy)
This (probably) follows Pan Pattern 1, 2 and 3, i.e. - assumably
- fixed amount of chromosomes (i.e. transformation), collection
unit tests are being evolved and the combined lengths of unit
tests evolved might increase or decrease compared to the
original input.
17. 2.3 Software Engineering - II
Figure from "EvoTest: Test Case Generation using
Genetic Programming and Software Analysis"
19. 3.1 The Challenge
● Task:
○ Build a low-latency key-value store for disk or SSD
● Features:
○ Low startup time
■ i.e. no/little pre-loading of (large) caches to memory
○ Prefix-search
■ i.e. support searching for both all prefixes of a key as
well as the entire key
○ Low-latency
■ i.e. reduce number of disk/SSD seeks, e.g. by
increase probability of disk cache hits
○ Static/Immutable data - write once, read many
20. 3.2 A few Possible Ways
1. Binary Search or
Interpolation Search
within a file of sorted keys
and then look up value
~ lg(N) or lg(lg(N))
2. Prefix-algorithms mapped
to file, e.g.
1. Trie,
2. Ternary search tree
3. Patricia Tree
~ O(k)
21. 3.3 Overall Approach
1. Scale - divide key,value data into shards
2. Build patricia tree per shard and store all key, values for
later
3. Prepare trees to have placeholder (short) value for each key
4. Flatten each patricia tree to a disk-friendly and byte-aligned
format fit for random access
5. Recalculate file addresses in each patricia tree to be able to
store the actual values
6. Create final patricia tree with values on disk
22. 3.4 Split data with mapper
1. Scale - divide key,value data into shards
map(key, value):
# e.g. simple - hash(first char), or use a classifier
# personalization etc.
shard_key = shard_function(key, value)
out_value = (key,value)
emit shard_key, out_value
23. 3.5 Init and run reduce()
2. Build one patricia tree per (reduce) shard
reduce_init(): # called once per reducer before it starts
self.patricia = Patricia()
self.tempkeyvaluestore = TempKeyValueStore
reducer(shard_key, list of key_value pairs):
for (key, value) in list of key_value pairs:
self.tempkeyvaluestore[key] = value
24. 3.6 Reducer cont.
3. Prepare trees to have placeholder values (=key) for each key
reduce_final(): # called once per reducer after all
reduce()
for key, value in self.tempkeyvaluestore:
self.patricia.add(key, key) # key == value for now
25. 3.7 Flatten patricia tree for disk
4. Flatten each patricia tree to a disk-friendly and byte-
aligned format fit for random access
reduce_final(): # continued from 3.
# num 0s below constrains addressable size of shard file
self.firstblockaddress = "00000000000000"
# create mapping from dict of dicts to a linear file
self.flatten_patricia(self.patricia, parent=self.firstblockaddress)
#
self.recalculate_patricia_tree_for_actual_values()
self.output_patricia_tree_with_actual_values()
28. Mapreduce Patterns
Map() and Reduce() methods typically follow patterns, a
recommended way of representing such patterns are:
extracting and generalize code skeleton fingerprints based on:
1. loops: e.g. "do-while", "while", "for", "repeat-until" => "loop"
2. conditions: e.g. "if", "exception" and "switch" => "condition"
3. emits: e.g. outputs from map() => reduce() or IO => "emit"
4. emit data types: e.g. string, number, list (if known)
map(key, value): reduce(key, values):
loop # over tokenized value emit # key = word,
emit # key=word, val=1 or uri # value = sum(values) or
# list of URIs
29. General Mapreduce Advice
Performance
1. IO/moving data is expensive - use compression and aggr.
2. Use combiners, i.e. "reducer afterburners" for mappers
3. Look out for skewness in key distribution, e.g. zipfs law
4. Use the right programming language for the task
5. Balance work between mappers and reducers - http:
//atbrox.com/2010/02/08/parallel-machine-learning-for-
hadoopmapreduce-a-python-example/
Cost, Readability & Maintainability
6. Mapreduce = right tool? (seq./parallel/iterative/realtime)
7. E.g. Crunch, Pig, Hive instead of full Mapreduce code?
8. Split job into sequence of mapreduce jobs, e.g. with
cascading, mrjob etc.
30. The End
● Mapreduce Paper Trends (from 2009 => 2012), roughly:
○ Increased use of mapreduce jobflows, i.e. more than one
mapreduce in a sequence and also in various types of
iterations
■ e.g. the Algorithmic Trading earlier
○ Increased amount of papers published related to
semantic web (e.g. RDF) and AI reasoning/inference
○ Decreased (relative) amount of IR and Ads papers
31. APPENDIX
List of Mapreduce and Hadoop
Algorithms in Academic Papers - 6th
version (partial subset of
forthcoming blogpost)
32. AI: Reasoning & Semantic Web
1. Reasoning with Fuzzy-cL+Ontologies Using Mapreduce
2. WebPIE: A Web-scale parallel inference engine using
Mapreduce
3. Towards Scalable Reasoning over Annotated RDF Data
Using Mapreduce
4. Reasoning with Large Scale Ontologies in Fuzzy pD* Using
Mapreduce
5. Scalable RDF Compression with Mapreduce
6. Towards Parallel Nonmonotonic Reasoning with Billions of
Facts
33. Biology & Medicine
1. A Mapreduce-based Algorithm for Motif Search
2. A MapReduce Approach for Ridge Regression in
Neuroimaging Genetic Studies
3. Fractal Mapreduce decomposition of sequence alignment
4. Cloud-enabling Sequence Alignment with Hadoop
Mapreduce: A Performance Analysis
AI Misc.
A MapReduce based Ant Colony Optimization approach to
combinatorial optimization problems
34. Machine Learning
1. An efficient Mapreduce Algorithm for Parallelizing Large-
Scale Graph Clustering
2. Accelerating Bayesian Network Parameter Learning Using
Hadoop and Mapreduce
3. The Performance Improvements of SPRINT Algorithm
Based on the Hadoop Platform
Graphs & Graph Theory
4. Large-Scale Graph Biconnectivity in MapReduce
5. Parallel Tree Reduction on MapReduce
35. Datacubes & Joins
1. Data Cube Materialization and Mining Over Mapreduce
2. Fuzzy joins using Mapreduce
3. Efficient Distributed Parallel Top-Down Computation of
ROLAP Data Cube Using Mapreduce
4. V-smart-join: A scalable MapReduce Framework for all-pair
similarity joins of multisets and vectors
5. Data Cube Materialization and Mining over MapReduce
Finance & Business
6. Optimizing Parameters of Algorithm Trading Strategies
using Mapreduce
7. Using Mapreduce to scale events correlation discovery for
business processes mining
8. Computational Finance with Map-Reduce in Scala
36. Mathematics & Statistics
1. GigaTensor: scaling tensor analysis up by 100 times -
algorithms and discoveries
2. Fast Parallel Algorithms for Blocked Dense Matrix
Multiplication on Shared Memory Architectures
3. Mr. LDA: A Flexible Large Scale Topic Modelling Package
using Variational Inference in MapReduce
4. Matrix chain multiplication via multi-way algorithms in
MapReduce