Simplified Data Processing
On Large Cluster
1
Present By
Dipen Shah
110420107064
Harsh Kevadia
110420107049
Nancy Sukhadia
110420107025
2
What Is Cluster
 A computer cluster consists of a set of loosely connected or
tightly connected computers that work together so that in many
respects they can be viewed as a single system.
 The components of a cluster are usually connected to each
other through fast local area networks.
 Clusters are usually deployed to improve performance and
availability over that of a single computer.
3
Introduction
 On the web large amount of data or say Big Data are being stored,
processed and retrieved in few milliseconds.
 Big data cannot be stored, processed, and retrieved from one
machine.
4
Contd..
 How huge IT companies store their data? And how the data is
processed and retrieved?
 A Big Data requires a lots of processing power for computing
(Processing) and storing a data.
5
How To Divide Large Input Set Into
Smaller Input Set?
 The master node takes the input, divides it into smaller sub-problems,
and distributes them to worker nodes.
 The worker node processes the smaller problem, and passes the
answer back to its master node.
 Sometimes it creates problem for the data which comes in
sequence.
 Output of one data can be input of another.
 only suitable to those data which are independent to each other so
that the processing can be done independently without waiting for
the output of previous data.
6
How To Divide Work Among Various
Worker Node In Same Cluster?
 Master node calculates the time require for a normal computation
and it also considers priority of the particular data processing.
 Checks all the worker node’s schedule and processing speed.
 After analysing this data the work is provided to worker node.
7
Dividing Input Creates Problem
And Affects The Output.
 Large set of input are interrelated with each other or sequence of
inputs are important and we must process input as a given
sequence.
 Need to develop an algorithm which takes care about all this
problem.
8
Dividing Input So That Optimized
Performance Can Be Achieved.
 How to divide a problem into sub problem so that we can get
optimized performance. Optimized in the sense of minimum time
require, minimum resource allocated to that process, coordinating
between worker nodes in cluster.
9
What If Worker Node Fails?
 Master Node divides work among the workers. It pings worker node
periodically.
 What if the worker node doesn’t respond or worker node fails?
10
What Happen When Master Node
Fails?
 There is only single master.
 All the computation gets aborted if master node fails.
11
Programing Model
 The computation takes a set of input key/value pairs, and produces a set of
output key/value pairs.
 The user of the MapReduce library expresses the computation as two
functions: Map and Reduce.
 Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs.
 The MapReduce library groups together all intermediate values associated
with the same intermediate key I and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values.
 Typically just zero or one output value is produced per Reduce invocation.
 The intermediate values are supplied to the user's reduce function via an
iterator.
 This allows us to handle lists of values that are too large to fit in memory.
12
Example:
 Consider the problem of counting the number of occurrences of each word in a large collection of
documents.
 The user would write code similar to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
13
Cont.
• The map function emits each word plus an associated count of occurrences.
• The reduce function sums together all counts emitted for a particular word.
• In addition, the user writes code to file in a mapreduce specication object with the
names of the input and output files, and optional tuning parameters.
• The user then invokes the MapReduce function, passing it the specification object.
• The user's code is linked together with the MapReduce library (implemented in
C++).
14
Types:
 Even though the previous pseudo-code is written in terms of string
inputs and outputs, conceptually the map and reduce functions
supplied by the user have associated types:
map (k1,v1) -> list(k2,v2)
reduce (k2,list(v2)) -> list(v2)
 I.e., the input keys and values are drawn from a different domain
than the output keys and values.
 Furthermore, the intermediate keys and values are from the same
domain as the output keys and values.
 Our C++ implementation passes strings to and from the user-dened
functions and leaves it to the user code to convert between strings
and appropriate types.
15
Map Reduce: Example
 Distributed Grep
 Count of URL Access frequency
 Reverse Web Link graph
 Term Vector per host
 Inverted Index
16
Implementation
 Assumption
 Execution Overview
 Master Data Structure
 Fault Tolerance
 Implementation Issues
17
Assumption
 Each PC configuration cluster
 Networking
 Failures
 Storage
 Job scheduling system
18
Execution Overview
1. Split Input set
2. Master assign work to worker and copy it self
3. Worker read input set and produced output
4. Worker save in local disk
5. Reduce worker collect from local disk
6. External Sorting done because data is too large and give to master
7. Master create output file and wake up user program and give.
19
Function: 20
Master Data Structure
 State and identity of worker machine
 Intermediate file
 Update of location
 File size
21
Fault Tolerance
 Worker Failure
 Master Failure
 Master Election
1. Manually
2. Highest IP Address
3. Highest MAC Address
22
Implementation Issues
 Back up
 Network Bandwidth
 Locality
23
Conclusion
 attribute this success to several reasons. First, the model is easy to
use, even for programmers without experience with parallel and
distributed systems, since it hides the details of parallelization, fault-
tolerance, locality optimization, and load balancing. Second, a
large variety of problems are easily expressible.
 Google use for web search service, for sorting, for data mining, for
machine learning, and many other systems
24
References
1. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David
A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM
SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997.
2. Remzi H. Arpaci-Dusseau, Eric Anderson, NoahTreuhaft, David E. Culler, Joseph M. Hellerstein, David
Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the
Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS '99), pages 10.22, Atlanta,
Georgia, May 1999.
3. Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web.
In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems,
1996.
4. Luiz A. Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The Google cluster architecture.
IEEE Micro, 23(2):22–28, April 2003.
5. John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit
control in a batch-aware distributed le system. In Proceedings of the 1st USENIX Symposium on
Networked Systems Design and Implementation NSDI, March 2004.
6. Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11),
November 1989.
7. Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based
scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles,
pages 78–91, Saint-Malo, France, 1997.
25
Thank You
Q/A!
26

Simplified Data Processing On Large Cluster

  • 1.
  • 2.
    Present By Dipen Shah 110420107064 HarshKevadia 110420107049 Nancy Sukhadia 110420107025 2
  • 3.
    What Is Cluster A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system.  The components of a cluster are usually connected to each other through fast local area networks.  Clusters are usually deployed to improve performance and availability over that of a single computer. 3
  • 4.
    Introduction  On theweb large amount of data or say Big Data are being stored, processed and retrieved in few milliseconds.  Big data cannot be stored, processed, and retrieved from one machine. 4
  • 5.
    Contd..  How hugeIT companies store their data? And how the data is processed and retrieved?  A Big Data requires a lots of processing power for computing (Processing) and storing a data. 5
  • 6.
    How To DivideLarge Input Set Into Smaller Input Set?  The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes.  The worker node processes the smaller problem, and passes the answer back to its master node.  Sometimes it creates problem for the data which comes in sequence.  Output of one data can be input of another.  only suitable to those data which are independent to each other so that the processing can be done independently without waiting for the output of previous data. 6
  • 7.
    How To DivideWork Among Various Worker Node In Same Cluster?  Master node calculates the time require for a normal computation and it also considers priority of the particular data processing.  Checks all the worker node’s schedule and processing speed.  After analysing this data the work is provided to worker node. 7
  • 8.
    Dividing Input CreatesProblem And Affects The Output.  Large set of input are interrelated with each other or sequence of inputs are important and we must process input as a given sequence.  Need to develop an algorithm which takes care about all this problem. 8
  • 9.
    Dividing Input SoThat Optimized Performance Can Be Achieved.  How to divide a problem into sub problem so that we can get optimized performance. Optimized in the sense of minimum time require, minimum resource allocated to that process, coordinating between worker nodes in cluster. 9
  • 10.
    What If WorkerNode Fails?  Master Node divides work among the workers. It pings worker node periodically.  What if the worker node doesn’t respond or worker node fails? 10
  • 11.
    What Happen WhenMaster Node Fails?  There is only single master.  All the computation gets aborted if master node fails. 11
  • 12.
    Programing Model  Thecomputation takes a set of input key/value pairs, and produces a set of output key/value pairs.  The user of the MapReduce library expresses the computation as two functions: Map and Reduce.  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs.  The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.  Typically just zero or one output value is produced per Reduce invocation.  The intermediate values are supplied to the user's reduce function via an iterator.  This allows us to handle lists of values that are too large to fit in memory. 12
  • 13.
    Example:  Consider theproblem of counting the number of occurrences of each word in a large collection of documents.  The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 13
  • 14.
    Cont. • The mapfunction emits each word plus an associated count of occurrences. • The reduce function sums together all counts emitted for a particular word. • In addition, the user writes code to file in a mapreduce specication object with the names of the input and output files, and optional tuning parameters. • The user then invokes the MapReduce function, passing it the specification object. • The user's code is linked together with the MapReduce library (implemented in C++). 14
  • 15.
    Types:  Even thoughthe previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types: map (k1,v1) -> list(k2,v2) reduce (k2,list(v2)) -> list(v2)  I.e., the input keys and values are drawn from a different domain than the output keys and values.  Furthermore, the intermediate keys and values are from the same domain as the output keys and values.  Our C++ implementation passes strings to and from the user-dened functions and leaves it to the user code to convert between strings and appropriate types. 15
  • 16.
    Map Reduce: Example Distributed Grep  Count of URL Access frequency  Reverse Web Link graph  Term Vector per host  Inverted Index 16
  • 17.
    Implementation  Assumption  ExecutionOverview  Master Data Structure  Fault Tolerance  Implementation Issues 17
  • 18.
    Assumption  Each PCconfiguration cluster  Networking  Failures  Storage  Job scheduling system 18
  • 19.
    Execution Overview 1. SplitInput set 2. Master assign work to worker and copy it self 3. Worker read input set and produced output 4. Worker save in local disk 5. Reduce worker collect from local disk 6. External Sorting done because data is too large and give to master 7. Master create output file and wake up user program and give. 19
  • 20.
  • 21.
    Master Data Structure State and identity of worker machine  Intermediate file  Update of location  File size 21
  • 22.
    Fault Tolerance  WorkerFailure  Master Failure  Master Election 1. Manually 2. Highest IP Address 3. Highest MAC Address 22
  • 23.
    Implementation Issues  Backup  Network Bandwidth  Locality 23
  • 24.
    Conclusion  attribute thissuccess to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault- tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible.  Google use for web search service, for sorting, for data mining, for machine learning, and many other systems 24
  • 25.
    References 1. Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997. 2. Remzi H. Arpaci-Dusseau, Eric Anderson, NoahTreuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS '99), pages 10.22, Atlanta, Georgia, May 1999. 3. Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems, 1996. 4. Luiz A. Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28, April 2003. 5. John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit control in a batch-aware distributed le system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation NSDI, March 2004. 6. Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11), November 1989. 7. Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 78–91, Saint-Malo, France, 1997. 25
  • 26.