This document introduces MapReduce, a programming model for processing large datasets in parallel. MapReduce uses two functions, Map and Reduce, to distribute work across clusters of computers. It allows processing of very large datasets, like performing a word count on 100 TB of text data, using thousands of machines. MapReduce provides automatic parallelization, failure recovery, and load balancing.
Even though 12.1.0.2 is "only" a patch set, it introduces a number of very interesting performance features. In-Memory Column Store is the most well known in this area. But, be aware, a number of additional features that, for example, helps optimizing the physical storage and the caching of data are also available. The aim of this session is to explain and demonstrate how these new features work.
The aim of the query optimizer is not only to provide the SQL engine execution plans that describe how to process data but also, and more importantly, to provide efficient execution plans. Even though this central component of Oracle Database is enhanced with every new release, there are always cases where it generates suboptimal execution plans. The aim of this presentation is to describe and demonstrate how, with Adaptive Query Optimization, which is a set of features available as of Oracle Database 12c, the query optimizer is able to generate less suboptimal execution plans.
Even though 12.1.0.2 is "only" a patch set, it introduces a number of very interesting performance features. In-Memory Column Store is the most well known in this area. But, be aware, a number of additional features that, for example, helps optimizing the physical storage and the caching of data are also available. The aim of this session is to explain and demonstrate how these new features work.
The aim of the query optimizer is not only to provide the SQL engine execution plans that describe how to process data but also, and more importantly, to provide efficient execution plans. Even though this central component of Oracle Database is enhanced with every new release, there are always cases where it generates suboptimal execution plans. The aim of this presentation is to describe and demonstrate how, with Adaptive Query Optimization, which is a set of features available as of Oracle Database 12c, the query optimizer is able to generate less suboptimal execution plans.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
Presentation to University of Kentucky Computer Science graduate studentrs on high level Cloud Computing, how MapReduce works, and the current competition for Parallel Processing on a Massive Scale
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
This is the speech Max Liu gave at Percona Live Open Source Database Conference 2016.
Max Liu: Co-founder and CEO, a hacker with a free soul
The slide covered the following topics:
- Why another database?
- What kind of database we want to build?
- How to design such a database, including the principles, the architecture, and design decisions?
- How to develop such a database, including the architecture and the core technologies for TiKV and TiDB?
- How to test the database to ensure the quality and stability?
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identifying and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is desired to optimize sophisticated error discovery, that requires inequality joins, rather than naïvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
Presentation to University of Kentucky Computer Science graduate studentrs on high level Cloud Computing, how MapReduce works, and the current competition for Parallel Processing on a Massive Scale
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
This is the speech Max Liu gave at Percona Live Open Source Database Conference 2016.
Max Liu: Co-founder and CEO, a hacker with a free soul
The slide covered the following topics:
- Why another database?
- What kind of database we want to build?
- How to design such a database, including the principles, the architecture, and design decisions?
- How to develop such a database, including the architecture and the core technologies for TiKV and TiDB?
- How to test the database to ensure the quality and stability?
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identifying and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is desired to optimize sophisticated error discovery, that requires inequality joins, rather than naïvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing.
Presentation on "Mizan: A System for Dynamic Load Balancing in Large-scale Gr...Zuhair khayyat
This presentation is for the published paper "Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing" which was presented in EuroSys 2013
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
2. What is MapReduce
● A programming model introduced by Google in OSDI '04 for
processing large datasets efficiently.
● Features:
– Automatic parallelization, no parallel experience required.
– Data and process redundancy for failure recovery.
– Auto scheduling and Load balancing.
– Easy to program, based on two simple functions:
● Map
● Reduce.
CS245 - 2012 Introduction to MapReduce 2
3. Why MapReduce?
● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).
● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
CS245 - 2012 Introduction to MapReduce 3
4. Why MapReduce?
● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).
● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
– Implement a parallel sort for the same input file.
● Can you use the same code for both applications?
CS245 - 2012 Introduction to MapReduce 4
5. How Fast is MapReduce (Hadoop)
● Sort Benchmark competition (http://sortbenchmark.org/):
– 2009: 100 TB in 173 minutes using 3452 nodes:
● 2 x Quad core Xeons @ 2.5 GHz.
● 8 GB RAM.
– 2008: 1TB in 3.48 minutes using 910 nodes:
● 4 x Dual core Xeons @ 2.0 GHz.
● 8 GB RAM.
CS245 - 2012 Introduction to MapReduce 5
7. Map & Reduce functions
● The Mapper (Pick a key):
– Input: Read input from disk.
– Output: Create pairs of <key, value>, known as
intermediate pairs.
– More input partitions == More parallel Mappers.
● The Reducer (Process values):
– Input: a list of <key,value> pairs with a unique key.
– Output: Single or multiple of <key, values>
– More unique keys == More Parallel Reducers.
CS245 - 2012 Introduction to MapReduce 7
8. How MapReduce Work
1) Partition input file into M partitions.
2) Create M Map tasks, read M partitions in parallel and emits
intermediate <key, value> pairs. Store them into local storage.
3) Wait for all Map workers to finish, sort and partition
intermediate <key, value> pairs into R regions.
4) Start R reduce workers, each reads a list of intermediate with
a unique key from remote disks.
5) Write the output of reduce workers to file(s).
CS245 - 2012 Introduction to MapReduce 8
9. Example – Word count
● Assume an input as following:
cat flower picture
snow cat cat
prince flower sun
king queen AC
CS245 - 2012 Introduction to MapReduce 9
10. Example – Word count
● Step1: Partition input file into M partitions.
cat flower picture cat flower picture
snow cat cat
prince flower sun snow cat cat
king queen AC
prince flower sun
king queen AC
CS245 - 2012 Introduction to MapReduce 10
11. Example – Word count
● Step2: Create M Map tasks, read M partitions in parallel and
emits intermediate <key, value> pairs. Store them into local
storage.
cat flower picture Mapper 1 <cat,1> <flower,1> <picture,1>
snow cat cat
Mapper 2 <snow,1> <cat,1> <cat,1>
prince flower sun
Mapper 3 <prince,1> <flower,1> <sun,1>
king queen AC
CS245 - 2012 Introduction to 4
Mapper MapReduce <king,1> <queen,1> <AC,1>
11
12. Example – Word count
● Step3: Wait for all Map workers to finish, sort and partition
intermediate <key, value> pairs into R regions.
<cat,1> <AC,1>
<cat,1> <flower,1> <picture,1> <flower,1> <cat,1>
<picture,1> <cat,1>
<cat,1> <cat,1>
<snow,1> <cat,1> <cat,1> <cat,1> <flower,1>
<snow,1> <flower,1>
<flower,1> <king,1>
<prince,1> <flower,1> <sun,1> <prince,1> <picture,1>
<sun,1> <prince,1>
<queen,1>
<AC,1> <snow,1>
CS245 - 2012
<king,1>
<king,1> <queen,1> <AC,1> Introduction to MapReduce <sun,1> 12
<queen,1>
13. Example – Word count
● Step4: Start R reduce workers, each reads a list of intermediate
with a unique key from remote disks.
<AC,1> Reducer 1 <AC,1>
<cat,1>
<cat,1>
<cat,1> Reducer 2 <cat,3>
<flower,1>
<flower,1> Reducer 3 <flower,2>
<king,1>
<picture,1>
<prince,1>
<queen,1>
<snow,1>
CS245 - 2012
<sun,1> Reducer 9
Introduction to MapReduce <sun,1> 13
14. Example – Word count
● Step5: Write the output of reduce workers to file(s).
<AC,1>
<AC,1>
<cat,3> <cat,3>
<flower,2>
<flower,2>
<king,1>
<king,1> <picture,1>
<prince,1>
<picture,1> <queen,1>
<snow,1>
<sun,1>
<sun,1>
CS245 - 2012 Introduction to MapReduce 14
16. MapReduce Failure Recovery
● The framework works as master worker paradigm.
● The master keeps records of the work done on each worker.
● If a worker fails, the master assigns the same work to another
worker.
● If a worker is late, another copy of the same work is assigned
to another worker.
● If the master fails, another backup copy of the master can pick
up and continue execution from the last check points.
CS245 - 2012 Introduction to MapReduce 16
17. Advantages of MapReduce
● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each
process one unique partition.
– Reduce functions work independently in parallel, each
on a unique intermediate key.
● Using large clusters of commodity machines gives better
results than small expensive clusters.
CS245 - 2012 Introduction to MapReduce 17
18. Advantages of MapReduce
● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each
process one unique partition.
– Reduce functions work independently in parallel, each
on a unique intermediate key.
● Using large clusters of commodity machines gives
comparable results than small expensive clusters.
CS245 - 2012 Introduction to MapReduce 18
20. MapReduce weak points
● Overhead of MapReduce is huge.
● Data dependent applications may need multiple iterations of
MapReduce, for example:
– K-means.
– PageRank.
● Complex algorithms can be very hard to implement.
– Range Queries.
● Sensitive to <key,value> pairs' skewed distribution
CS245 - 2012 Introduction to MapReduce 20
21. Implementations of MapReduce
● Hadoop in Java.
● Mars in C++ & CUDA.
● Skynet in Ruby.
● Phoenix in C++
● Microsoft Dryad:
– Schedule multiple levels of “MapReduce” like
operations..
CS245 - 2012 Introduction to MapReduce 21
23. MapReduce in Database - Ex1
● Select Name from Students where age = 23;
Students:
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
CS245 - 2012 Introduction to MapReduce 23
24. MapReduce in Database - Ex2
● Select COUNT(Name) from Students where age > 20 group
by Name;
Students:
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
CS245 - 2012 Introduction to MapReduce 24
25. MapReduce in Database - Ex3
● Select Name, Term from Students, Enrolment where ID = SID
and age != 20;
Students: Enrolment:
Name ID Age CID SID Term
Ahmed 1177 23 CS290 1177 042
Bob 1131 20 CS260 1177 052
Sara 1197 22 ME222 1131 051
AMCS220 1197 051
CS245 - 2012 Introduction to MapReduce 25
26. MapReduce in Database - Ex4
● Select Name, Term from Students, Enrolment where ID !=
SID;
Students: Enrolment:
Name ID Age CID SID Term
Ahmed 1177 23 CS290 1177 042
Bob 1131 20 CS260 1177 052
Sara 1197 22 ME222 1131 051
AMCS220 1197 051
● What if the condition ID > SID?
CS245 - 2012 Introduction to MapReduce 26
27. MapReduce in Database - Ex5
● Select Name, Term from Students, Enrolment where ID = SID
and Admission != Term;
Students:
Students: Enrolment:
Enrolment:
Name ID Age Admission CID SID Term
Ahmed 1177 23 042 CS290 1177 042
Bob 1131 20 051 CS260 1177 052
Sara 1197 22 042 ME222 1131 051
AMCS220 1197 051
CS245 - 2012 Introduction to MapReduce 27
28. MapReduce in Database - Ex6
● Select y from R, S, T where R.x = S.x and T.a = S.a;
R: S:
x y z a b x
T:
m n a
CS245 - 2012 Introduction to MapReduce 28
29. MapReduce in Academic Papers
● NIPS '07: Map-Reduce for Machine Learning on Multicore.
● Escience '08: CloudBLAST: Combining MapReduce and Virtualization on
Distributed Resources for Bioinformatics Applications.
● KDD '09: Large-scale behavioral targeting.
● GCC '09: Spatial Queries Evaluation with MapReduce.
● SIGIR '09: On single-pass indexing with MapReduce.
● MDAC '10: A novel approach to multiple sequence alignment using
hadoop data grids.
● VLDB Endowment '11: Social Content Matching in MapReduce.
● VLDB '12: Building Wavelet Histograms on Large Data in MapReduce.
CS245 - 2012 Introduction to MapReduce 29