SlideShare a Scribd company logo
1 of 185
Chapter 8
Data Intensive Computing:
Map-Reduce Programming
INTRODUCTION
 Data intensive computing focuses on a
class of applications
 Deal with a large amount of data
 Several application fields produce large volumes
of data that need to be efficiently stored, made
accessible, indexed, and analyzed
 Ranging from computational science to social
networking
 Become challenging as the quantity of information
accumulates and increases over time at higher rates
 Distributed computing is definitely of help in
addressing these challenges
 By providing more scalable and efficient storage
architectures
INTRODUCTION (cont.)
 And a better performance in terms of data
computation and processing
 The use of parallel and distributed
techniques as a support of data intensive
computing is not straightforward
 Several challenges in form of data
representation, efficient algorithms, and scalable
infrastructures need to be faced
 Characterizes the nature of data intensive
computing
 Presents an overview of the challenges
introduced by production of large volumes of
data
INTRODUCTION (cont.)
 And how they are handled by storage systems
and computing models
 MapReduce is a popular programming
model
 For creating data intensive applications and their
deployment on Clouds
 Practical examples of MapReduce applications for
data intensive computing are demonstrated
 By using Aneka MapReduce Programming Model
What is Data-Intensive Computing
 Data-intensive computing
 Concerned with production, manipulation, and
analysis of large-scale data
 In the range of hundreds of megabytes (MB) to
petabytes (PB) and beyond
 The term dataset is commonly used to
identify a collection of information elements
 Relevant to one or more applications
 Datasets are often maintained in repositories
 Infrastructures supporting the storage, retrieval, and
indexing of large amount of information
 Metadata are attached to datasets to facilitate the
classification and the search of relevant bits of
information
What is Data-Intensive Computing
(cont.)
 Data-intensive computations occur in many
application domains
 Computational science is one of the most popular
 Scientific simulations and experiments are keen to
produce, analyze, and process huge volumes of data
 e.g., Hundreds of gigabytes of data are produced
every second by telescopes mapping the sky
 The collection of images of the sky easily reaches the
scale of petabytes over a year
 Bioinformatics applications mine databases
 May end up containing terabytes of data
 Earthquakes simulators process a massive
amount of data
What is Data-Intensive Computing
(cont.)
 Produced as a result of recording the vibrations of the
Earth across the entire globe
 Several IT industry sectors also require
support for data-intensive computations
 A customer data of any telecom company will
easily be in the range of 10-100 terabytes
 This volume of information is not only processed to
generate bill statements
 Also mined to identify scenarios, trends, and patterns
helping these companies to provide a better service
 Currently the US handset mobile traffic has
reached 8 petabytes per month
 Expected to grow up to 327 petabytes per month by
2015
What is Data-Intensive Computing
(cont.)
 The scale of petabytes is even more
common when considering IT giants
 e.g., Google
 Reported to process about 24 petabytes of information
per day and to sort petabytes of data in hours
 Social networking and gaming are two
other sectors
 Data intensive computing is now a reality
 Facebook inbox search operations involve
crawling about 150 terabytes of data
 The whole uncompressed data stored by the
distributed infrastructure reaches to 36 petabytes
What is Data-Intensive Computing
(cont.)
 Zynga, social gaming platform, moves 1 petabyte
of data daily
 Reported to add 1000 servers every week to sustain
to store the data generated by games
Characterizing Data-Intensive
Computations
 Data-intensive applications do not only deal
with huge volumes of data
 Also exhibit compute-intensive properties
 The domain of data-intensive computing
 Data intensive applications handle datasets in the
scale of multiple terabytes and petabytes
 Datasets are commonly persisted in several formats
and distributed across different locations
 Process data in multistep analytical pipelines including
transformation and fusion stages
 The processing requirements scale almost
linearly with the data size
 Can be easily processed in parallel
Characterizing Data-Intensive
Computations (cont.)
 Also need efficient mechanisms for data management,
filtering and fusion, and efficient querying and
distribution
Challenges Ahead
 The huge amount of data produced,
analyzed, or stored imposes requirements
on the supporting infrastructures and
middleware
 Hardly found in the traditional solutions for
distributed computing
 e.g., The location of data is crucial
 The need for moving terabytes of data becomes an
obstacle for high performing computations
 Data partitioning help in improving the
performance of data-intensive applications
 As well as content replication and scalable algorithms
Challenges Ahead (cont.)
 Open challenges in data-intensive
computing
 Scalable algorithms
 Can search and process massive datasets
 New metadata management technologies
 Can scale to handle complex, heterogeneous, and
distributed data sources
 Advances in high-performance computing
platform
 Aimed at providing a better support for accessing in-
memory multi-terabyte data structures
 High-performance, high-reliable, petascale
distributed file systems
Challenges Ahead (cont.)
 Data signature generation techniques
 For data reduction and rapid processing
 New approaches to software mobility
 For delivering algorithms able to move the
computation where the data is located
 Specialized hybrid interconnection architectures
 Providing a better support for filtering multi-gigabyte
data streams coming from high speed networks and
scientific instruments
 Flexible and high-performance software
integration techniques
 Facilitating the combination of software modules
running on different platforms to quickly form
analytical pipelines
Data Clouds and Big Data
 Large datasets have mostly been the
domain of scientific computing
 Starting to change as massive amount of data
are being produced, mined, and crunched by
companies providing Internet services
 e.g., Searching, on-line advertisement, and social
media
 Critical for such companies to efficiently analyze
these huge datasets
 They constitute a precious source of information about
their customers
 Logs analysis is an example of data-intensive
operation commonly performed in this context
Data Clouds and Big Data (cont.)
 Companies like Google have a massive amount of data
in the form of logs that are daily processed by using
their distributed infrastructure
 They settle on analytic infrastructure that differs from
the Grid-based infrastructure used by the scientific
community
 Cloud computing technologies support
data-intensive computations
 The term Big Data has become popular
 Characterizes the nature of data-intensive
computations nowadays
 Currently identifies datasets that grow so large that
they become complex to work with using on-hand
database management tools
Data Clouds and Big Data (cont.)
 Relational databases and desktop
statistics/visualization packages become
ineffective for that amount of information
 Requiring instead massively parallel software running
on tens, hundreds, or even thousands of servers
 Big Data problems are found in non-
scientific application domains
 e.g., Web logs, RFID, sensor networks, social
networks, Internet text and documents
 Internet search indexing, call detail records,
military surveillance, medical records,
photography archives, video archives
 Large scale e-Commerce
Data Clouds and Big Data (cont.)
 The massive size of data
 New data is accumulated with time rather than
replacing the old ones
 Big Data applies to datasets whose size is
beyond the ability of commonly used
software tools
 To capture, manage, and process the data within
a tolerable elapsed time
 A constantly moving target currently ranging
from a few dozen terabytes to many petabytes of
data in a single dataset
 Cloud technologies support data-intensive
computing in several ways
Data Clouds and Big Data (cont.)
 By providing a large amount of compute
instances on demand
 Can be used to process and analyze large datasets in
parallel
 By providing a storage system and other
distributed data store architectures
 Optimized for keeping large blobs of data
 By providing frameworks and programming APIs
 Optimized for the processing and management of
large amount of data
 Mostly coupled with a specific storage infrastructure to
optimize the overall performance of the system
 A Data Cloud is a combination of these
components
Data Clouds and Big Data (cont.)
 e.g., the MapReduce framework
 Provides the best performance when leveraging the
Google File System on top of Google’s large computing
infrastructure
 The Hadoop system
 The most mature, large, and open source Data Cloud
 Consists of the Hadoop Distributed File System
(HDFS) and Hadoop’s implementation of MapReduce
Databases and Data-Intensive
Computing
 Distributed databases have been
considered the natural evolution of
database management systems
 As the scale of the datasets becomes
unmanageable with a single system
 A collection of data stored at different sites of a
computer network
 Each site may expose a degree of autonomy providing
services for the execution of local applications
 Also participates in the execution of a global
application
 Can be created by splitting and scattering the
data of an existing database over different sites
Databases and Data-Intensive
Computing (cont.)
 Or by federating together multiple existing databases
 These systems are very robust
 Provide distributed transaction processing, distributed
query optimization, and efficient management of
resources
 Mostly concerned with datasets that can be
expressed by using the relational model
 The need to enforce ACID properties on data
limits their abilities to scale as Data Clouds and
Grids do
 Atomicity, Consistency, Isolation, Durability
Technologies for Data-Intensive
Computing
 Data-intensive computing concerns the
development of applications
 Mainly focused on processing large quantities of
data
 Storage systems and programming models
constitute a natural classification of the
technologies supporting data-intensive computing
Storage Systems
 Database management systems constitute
the de-facto storage support for several
types of applications
 The relational model in its original formulation
does not seem to be the preferred solution for
supporting data analytics at a large scale
 Due to the explosion of unstructured data in blogs,
web pages, software logs, and sensor readings
 New opportunities arise
 Some factors contributing to this change
 Growing of popularity of Big Data
 The management of large quantities of data is not
anymore a rare case
Storage Systems (cont.)
 Become common in several fields: scientific
computing, enterprise applications, media
entertainment, natural language processing, and
social network analysis
 The large volume of data imposes new and more
efficient techniques for their management
 Growing importance of data analytics in the
business chain
 The management of data is not considered anymore a
cost but a key element to the business profit
 This situation arises in popular social networks
 e.g., Facebook concentrates the focus on the
management of user profiles, interests, and
connections among people
 This massive amount of data, being constantly mined,
requires new technologies and strategies supporting
data analytics
Storage Systems (cont.)
 Presence of data in several forms, not only
structured
 What constitutes relevant information today exhibits a
heterogeneous nature and appears in several forms
and formats
 Structured data is constantly growing as a result of
the continuous use of traditional enterprise
applications and system
 The advances in technology and the democratization
of the Internet as a platform where everyone can pull
information has created a massive amount of
unstructured information
 Does not naturally fit into the relational model
 New approaches and technologies for computing
 Cloud computing promises access to a massive
amount of computing capacity on demand
Storage Systems (cont.)
 Allows engineers to design software systems that
incrementally scale to arbitrary degrees of parallelism
 Not rare anymore to build software applications and
services dynamically deployed on hundreds or
thousands of nodes, belonging to the system for few
hours or days
 Classical database infrastructures are not designed to
provide support to such a volatile environment
 All these factors identify the need of new
data management technologies
 The major directions towards support for data-
intensive computing are constituted by
 Advances in distributed file systems for the
management of raw data in the form of files
 Distributed object stores
 The spread of the NoSQL movement
High Performance Distributed File
Systems and Storage Clouds
 Distributed file systems constitute the
primary support for the management of
data
 Provide an interface where to store information in
the form of files and later access them for read
and write operations
 Few of file systems specifically address the
management of huge quantities of data on
a large number of nodes
 Mostly these file systems constitute the data
storage support for large computing clusters,
supercomputers, massively parallel architectures,
and lately storage/computing Clouds
Google File System (GFS)
 GFS is the storage infrastructure
 Supporting the execution of distributed
applications in the Google’s computing Cloud
 Has been designed to be a fault tolerant, high
available, distributed file system
 Built on commodity hardware and standard Linux
operating systems
 Not a generic implementation of a distributed file
system
 GFS specifically addresses the needs of
Google in terms of distributed storage
 Designed with the following assumptions
 The system is built on top of commodity hardware
that often fails
Google File System (GFS) (cont.)
 The system stores a modest number of large files,
multi-GB files are common and should be treated
efficiently, small files must be supported but there is
no need to optimize for that
 The workloads primarily consist of two kinds of reads:
large streaming reads and small random reads
 The workloads also have many large, sequential writes
that append data to files
 High-sustained bandwidth is more important than low
latency
 The architecture of the file system is
organized into
 A single master containing the metadata of the
entire file system
Google File System (GFS) (cont.)
 A collection of chunk servers providing storage
space
 The system is composed by a collection of
software daemons
 Implement either the master server or the chunk
server
 A file is a collection of chunks
 The size can be configured at file system level
 Chunks are replicated on multiple nodes in order
to tolerate failures
 Clients look up the master server
 Identify the specific chunk of a file they want to
access
Google File System (GFS) (cont.)
 Once the chunk is identified, the interaction happens
between the client and the chunk server
 Applications interact through the file
system with specific interfaces
 Supporting the usual operations for file creation,
deletion, read, and write
 Also supporting snapshots and record append
operations frequently performed by applications
 GFS has been conceived by considering
 Failures in a large distributed infrastructure are
common rather than a rarity
 A specific attention has been put in implementing a
highly available, lightweight, and fault tolerant
infrastructure
Google File System (GFS) (cont.)
 The potential single point of failure of the single
master architecture
 Addressed by giving the possibility of replicating the
master node on any other node belonging to the
infrastructure
 A stateless daemon and extensive logging
capabilities facilitate the system recovery from
failures
Not Only SQL (NoSQL) Systems
 The term NoSQL is originally to identify a
relational database
 Does not expose a SQL interface to manipulate
and query data
 SQL: Structured Query Language
 Relies on a set of UNIX shell scripts and
commands to operate on text files containing the
actual data
 Strictly speaking, NoSQL cannot be
considered a relational database
 Not a monolithic piece of software organizing the
information according to the relational model
 Instead a collection of scripts
Not Only SQL (NoSQL) Systems
(cont.)
 Allow users to manage most of the simplest and more
common database tasks by using text files as
information store
 Now the term NoSQL label all those
database management systems
 Does not use a relational model
 Provides simpler and faster alternatives for data
manipulation
 A big umbrella encompassing all the storage and
database management systems
 Differ in some way from the relational model
 The general philosophy is to overcome the
restrictions imposed by the relational model
 To provide more efficient systems
Not Only SQL (NoSQL) Systems
(cont.)
 Implies the use of tables without fixed schemas to
accommodate a larger range of data types
 Or avoiding joins to increase the performance and
scale horizontally
 Two main reasons have determined the
growth of the NoSQL movement
 Simple data models are often enough to
represent the information used by applications
 The quantity of information contained in
unstructured formats has considerably grown in
the last decade
 Make software engineers look to alternatives more
suitable to their specific application domain
 Several different initiatives explored the use of non-
relational storage systems
Not Only SQL (NoSQL) Systems
(cont.)
 A broad classification distinguishes NoSQL
implementations into
 Document stores
 Apache Jackrabbit, Apache CouchDB, SimpleDB,
Terrastore
 Graphs
 AllegroGraph, Neo4j, FlockDB, Cerebrum
 Key-value stores
 A macro classification that is further categorized into
 Key-value store on disk, key-value caches in RAM
 Hierarchically key-value stores, eventually consistent
key-value stores, and ordered key-value store
 Multi-value databases
 OpenQM, Rocket U2, OpenInsight
Not Only SQL (NoSQL) Systems
(cont.)
 Object databases
 ObjectStore, JADE, ZODB
 Tabular stores
 Google BigTable, Hadoop HBase, Hypertable
 Tuple stores
 Apache River
Google Bigtable
 Bigtable is the distributed storage system
 Designed to scale up to petabytes of data across
thousands of servers
 Provides storage support for several Google
applications exposing different types of workload
 From throughput-oriented batch-processing jobs to
latency-sensitive serving of data to end users
 The key design goals are wide applicability,
scalability, high performance, and high
availability
 Organizes the data storage in tables whose rows are
distributed over the distributed file system supporting
the middleware, i.e., the Google File System
 A table is a multidimensional sorted map
Google Bigtable (cont.)
 Indexed by a key represented by a string of
arbitrary length
 Organized in rows and columns
 Columns can be grouped in column family
 Allow for specific optimization for better access
control, the storage and the indexing of data
 Client applications are provided with very simple
data access model
 Allow them to address data at column level
 Each column value is stored in multiple versions
 Can be automatically time-stamped by Bigtable or by
the client applications
 Besides the basic data access
Google Bigtable (cont.)
 Bigtable APIs also allow more complex operations
 e.g., Single row transactions and advanced data
manipulation by means of the Sazwall scripting
language or the MapReduce APIs
 Sazwall is an interpreted procedural programming
language developed at Google for the manipulation of
large quantities of tabular data
 Includes specific capabilities for supporting statistical
aggregation of values read or computed from the
input and other features that simplify the parallel
processing of petabytes of data
 An overview of the Bigtable infrastructure
 A collection of processes that co-exist with other
processes in a cluster-based environment
 Bigtable identifies two kinds of processes
Google Bigtable (cont.)
 Master processes and tablet server processes
 A tablet server is responsible for serving the
requests for a given tablet
 A contiguous partition of rows of a table
 Each tablet server can manage multiple tablets
 From ten to a thousand commonly
 The master server is responsible
 Of keeping track of the status of the tablet servers
 Of the allocation of tablets to tablets servers
 The master server constantly monitors the tablet
servers to check whether they are alive
 In case they are not reachable, the allocated tablets
are reassigned and eventually partitioned to other
servers
Google Bigtable (cont.)
 Chubby supports the activity of the master and
tablet servers
 A distributed highly available and persistent lock
service
 System monitoring and data access is filtered
through Chubby
 Also responsible of managing replicas and providing
consistency among them
 At the very bottom layer, the data is stored into
the Google File System in the form of files
 All the update operations are logged into the file for
the easy recovery of data in case of failures
 Or when tablets need to be reassigned to other servers
 Bigtable uses a specific file format for storing the
data of a tablet
Google Bigtable (cont.)
Google Bigtable (cont.)
 Can be compressed for optimizing the access and the
storage of data
 Bigtable is the result of a study of the
requirements of several distributed
applications in Google
 Serves as a storage backend for 60 applications
and manages petabytes of data
 e.g., Google Personalized Search, Google Anlytics,
Google Finance, and Google Earth
Hadoop HBase
 HBase is the distributed database
 Supporting the storage needs of the Hadoop
distributed programming platform
 Inspired by Google Bigtable
 Its main goal is to offer real time read/write
operations for tables with billions of rows and
millions of columns
 By leveraging clusters of commodity hardware
 Its internal architecture and logic model is very
similar to Google Bigtable
 The entire system is backed by the Hadoop
Distributed File System (HDFS)
 Mimics the structure and the services of GFS
Programming Platforms
 Platforms for programming data intensive
applications provide abstractions
 Helping to express the computation over a large
quantity of information
 Runtime systems able to manage efficiently huge
volumes of data
 Database management systems based on
the relational model have been used
 To express the structure and the connections
between the entities of a data model
 Proven to be not successful in the case of Big
Data
Programming Platforms (cont.)
 Information is mostly found unstructured or semi-
structured
 Data is most likely to be organized in files of large size
or a huge number of medium size files rather than
rows in a database
 Distributed workflows have been often used
 To analyze and process large amounts of data
 Introduce a plethora of frameworks for workflow
management systems
 Eventually incorporated capabilities to leverage the
elastic features offered by Cloud computing
 These systems are fundamentally based on the
abstraction of task
 Puts a lot of burden on the developer
Programming Platforms (cont.)
 Needs to deal with data management and, often, data
transfer issues
 Programming platforms for data intensive
computing provide higher level abstractions
 Focus on the processing of data
 Move into the runtime system the management
of transfers
 Making the data always available where needed
 The approach followed by the MapReduce
programming platform
 Expresses the computation in the form of two simple
functions: map and reduce
 Hides away the complexities of managing large and
numerous data files into the distributed file system
supporting the platform
MapReduce Programming Model
 MapReduce is a programming platform
introduced by Google
 For processing large quantities of data
 Expresses the computation logic of an application
into two simple functions: map and reduce
 Data transfer and management is completely
handled by the distributed storage infrastructure
 i.e. The Google File System
 In charge of providing access to data, replicate files,
and eventually move them where needed
 Developers do not have to handle these issues
 Provided with an interface presenting data at a higher
level: as a collection of key-value pairs
MapReduce Programming Model
(cont.)
 The computation of the applications is organized
in a workflow of map and reduce operations
 Entirely controlled by the runtime system
 Developers have only to specify how the map and
reduce functions operate on the key value pairs
 The model is expressed in the form of the
two functions defined as follows
 map(k1,v1)  list(k2,v2)
 reduce(k2, list(v2))  list(v2)
 The map function reads a key-value pair
 Produces a list of key-value pairs of different types
 The reduce function reads a pair composed by a
key and a list of values
MapReduce Programming Model
(cont.)
 Produces a list of values of the same type
 The types (k1,v1,k2,kv2) used in the expression
of the two functions provide hints on
 How these two functions are connected and are
executed to carry out the computation of a MapReduce
job
 The output of map tasks is aggregated together by
grouping the values according to their associated keys
 Constitute the input of reduce tasks
 For each of the keys found, reduces the list of attached
values to a single value
 The input of a computation is expressed as a
collection of key-value pairs <k1,v1>
 The final output is represented by a list values: list(v2)
MapReduce Programming Model
(cont.)
 A reference workflow characterizing
MapReduce computations
 The user submits a collection of files that are
expressed in the form of a list of <k1,v1> pairs
 Specifies the map and reduce functions
 These files are entered into the distributed file
system supporting MapReduce
 If necessary, partitioned to be the input of map tasks
 Map tasks generate intermediate files that store
collections of <k2, list(v2)> pairs
 These files are saved into the distributed file system
 The MapReduce runtime may finally aggregate
the values corresponding to the same keys
MapReduce Programming Model
(cont.)
 These files constitute the input of reduce tasks
 Eventually produce output files in the form of list(v2)
 The operation performed by reduce tasks is
generally expressed as an aggregation of all the
values mapped by a specific key
 Responsibilities of the MapReduce runtime
 The number of map and reduce tasks to create, the
way in which files are partitioned with respect to these
tasks, and how many map tasks are connected to a
single reduce task
 Responsibilities of the distributed file system
supporting MapReduce
 The way in which files are stored and moved
MapReduce Programming Model
(cont.)
MapReduce Programming Model
(cont.)
 The computation model expressed by
MapReduce is very straightforward
 Allows a greater productivity for those who have
to code the algorithms for processing huge
quantities of data
 Proven to be successful in the case of Google
 The majority of the information that needs to be
processed is stored in textual form and represented by
web pages or log files
 Some of the examples showing the
flexibility of MapReduce
 Distributed Grep
MapReduce Programming Model
(cont.)
 The grep operation performs the recognition of
patterns within text streams
 Performed across a wide set of files
 MapReduce is leveraged to provide a parallel and
faster execution of this operation
 The input file is a plain text file
 The map function emits a line into the output each
time it recognizes the given pattern
 The reduce task aggregates all the lines emitted by
the map tasks into a single file
 Count of URL-Access Frequency
 MapReduce is used to distribute the execution of web-
server log parsing
 The map function takes as input the log of a web
server
MapReduce Programming Model
(cont.)
 Emits into the output file a key-value pair <URL,1>
for each page access recorded in the log
 The reduce function aggregates all these lines by the
corresponding URL
 Summing the single accesses and outputs a <URL,
total-count> pair
 Reverse Web-Link Graph
 The Reverse web-link graph keeps track of all the
possible web pages that lead to a given link
 Input files are simple HTML pages that are scanned by
map tasks
 Emitting <target, source> pairs for each of the links
found given in the web page source
 The reduce task will collate all the pairs that have the
same target into a <target, list(source)> pair
MapReduce Programming Model
(cont.)
 The final result is given one or more files containing
these mappings
 Term-Vector per Host
 A term vector recaps the most important words
occurring in a set of documents in the form of
list(<word, frequency>)
 The number of occurrences of a word is taken as a
measure of its importance
 MapReduce is used to provide a mapping between the
origin of a set of documents, obtained as the host
component of the URL of a document, and the
corresponding term vector
 The map task creates a pair <host, term-vector> for
each text document retrieved
 The reduce task aggregates the term vectors of
documents retrieved from the same host
MapReduce Programming Model
(cont.)
 Inverted Index
 The inverted index contains information about the
presence of words in documents
 Useful to allow fast full text searches if compared to
direct document scans
 In this case, the map task takes as input a document
 For each document it emits a collection of <word,
document-id>
 The reduce function aggregates the occurrences of the
same word, producing a pair <word, list(document-
id)>
 Distributed Sort
 MapReduce is used to parallelize the execution of a
sort operation over a large number of records
MapReduce Programming Model
(cont.)
 This application mostly rely on the properties of the
MapReduce runtime rather than in the operations
performed in the map and reduce tasks
 The runtime sorts and creates partitions of the
intermediate files
 The map task extracts the key from a record and
emits a <key, record> pair for each record
 The reduce task will simply copy through all the pairs
 The actual sorting process is performed by the
MapReduce runtime
 Emit and partition the key value pair by ordering them
according to the key
 The above mostly concerned with text based
processing
MapReduce Programming Model
(cont.)
 MapReduce can also be used to solve a
wider range of problems
 With some adaptation
 An interesting example is its application in the
field of machine learning
 Statistical algorithms such as Support Vector Machines
(SVM), Linear Regression (LR), Naïve Bayes (NB), and
Neural Network (NN)
 Expressed in the form of map and reduce functions
 Other interesting applications can be found in the
field of compute intensive applications
 e.g., The computation of Pi with high degree of
precision
MapReduce Programming Model
(cont.)
 The Yahoo Hadoop cluster has been used to compute
the 1015 + 1 bit of Pi
 Hadoop is an open source implementation of the
MapReduce platform
 Any computation can be represented in the
terms of MapReduce computation
 Can be expressed in the form of two major stages
 Analysis
 This phase operates directly to the data input file
 Corresponds to the operation performed by the map
task
 The computation at this stage is expected to be
embarrassingly parallel since map tasks are executed
without any sequencing or ordering
MapReduce Programming Model
(cont.)
 Aggregation
 This phase operates on the intermediate results
 Characterized by operations aimed at aggregating,
summing, and elaborating the data obtained at the
previous stage to present it into its final form
 The task performed by the reduce function
 Adaptations to this model are mostly
concerned with
 Identifying what are the appropriate keys
 How to create reasonable keys when the original
problem does not have such a model
 How to partition the computation between map
and reduce functions
MapReduce Programming Model
(cont.)
 More complex algorithms can be
decomposed into multiple MapReduce
programs
 The output of one program constitutes the input
of the following program
 The abstraction proposed by MapReduce
provides developers with a very minimal
interface
 Strongly focused on the algorithm to implement
 Rather than the infrastructure on which it is executed
 A very effective approach but at the same time
demands a lot of common tasks
MapReduce Programming Model
(cont.)
 Important in the management of a distributed
application to the MapReduce runtime
 Allowing the user only to specify configuration
parameters to control the behavior of applications
 These tasks are the management of data transfer
 The scheduling of map and reduce tasks over a
distributed infrastructure
 A more complete overview of a MapReduce
infrastructure
 According to the implementation proposed by
Google
 The user submits the execution of MapReduce
jobs by using the client libraries
MapReduce Programming Model
(cont.)
 In charge of submitting the input data files
 Registering the map and reduce functions
 Returning the control to the user once the job is
completed
 A generic distributed infrastructure, i.e. a cluster,
can be used to run MapReduce applications
 Equipped with job scheduling capabilities and
distributed storage
 Two different kinds of processes are run on the
distributed infrastructure
 A master process and a worker process
 The master process is in charge of controlling the
execution of map and reduce tasks
 Partition and reorganize the intermediate output
produced by the map task to feed the reduce tasks
MapReduce Programming Model
(cont.)
 The worker processes are used to host the
execution of map and reduce tasks
 Provide basic I/O facilities that are used to interface
the map and reduce tasks with input and output files
 In a MapReduce computation, input files are
initially divided into splits
 Generally 16 to 64 MB
 Stored in the distributed file system
 The master process generates the map tasks
 Assigns input splits to each of them by balancing the
load
 Worker processes have input and output buffers
 Used to optimize the performance of map and reduce
tasks
MapReduce Programming Model
(cont.)
 Output buffer for map tasks are periodically
dumped to disk to create intermediate files
 Intermediate files are partitioned by using a user-
defined function to split evenly the output of map
tasks
 The locations of these pairs are notified to the
master process
 Forwards this information to the reduce task able to
collect the required input via a remote procedure call
to read from the map tasks’ local storage
 The key range is sorted and all the same keys
are grouped together
 The reduce task is executed to produce the final
output, stored into the global file system
MapReduce Programming Model
(cont.)
MapReduce Programming Model
(cont.)
 The above process is completely automatic
 Users may control it through configuration
parameters
 Allow specifying the number of map tasks, the number
of partitions into which separate the final output, and
the partition function for the intermediate key range
 Also the map and reduce function
 Besides orchestrating the execution of map
and reduce tasks
 The MapReduce runtime ensures a reliable
execution of applications by providing a fault-
tolerant infrastructure
 Failures of both master and worker processes are
handled
MapReduce Programming Model
(cont.)
 As well as machine failures that make intermediate
outputs not accessible
 Worker failures are handled by rescheduling map
tasks somewhere else
 Also the technique that is used to address
machine failures
 Since the valid intermediate output of map tasks has
become inaccessible
 Master process failure is instead addressed by
using check-pointing
 Allow restarting the MapReduce job with a minimum
loss of data and computation
Variations and Extensions of
MapReduce
 MapReduce constitute a simplified model
for processing large quantities of data
 Imposes constraints on how distributed
algorithms should be organized to run over a
MapReduce infrastructure
 The model can be applied to several different
problem scenarios
 The abstractions provided to process data are simple
 Complex problems may require considerable effort to
be represented in terms of map and reduce functions
 A series of extensions and variations to the
original MapReduce model are proposed
 Extending MapReduce application space
 Providing an easier interface to developers for
designing distributed algorithms
Hadoop
 Apache Hadoop is a collection of software
projects for reliable and scalable distributed
computing
 The entire collection is an open source
implementation of the MapReduce framework
 Supported by a GFS-like distributed file system
 The initiative consists of mostly two projects
 Hadoop Distributed File System (HDFS)
 Hadoop MapReduce
 The former is an implementation of the Google
File System
 The latter provides the same features and
abstractions of Google MapReduce
Hadoop (cont.)
 Initially developed and supported by Yahoo
 Hadoop constitutes now the most mature and
large data Cloud application
 Has a very robust community of developers and users
supporting it
 Yahoo runs now the world largest Hadoop
cluster
 Composed by 40000 machines and more than
300 thousands cores
 Made available to global academic institutions
 There is a collection of other projects
somehow related to it
 Providing services for distributed computing
Map-Reduce-Merge
 Map-Reduce-Merge is an extension to the
MapReduce model
 Introducing a third phase to the standard
MapReduce pipeline
 The Merge phase allows efficiently merging data
already partitioned and sorted (or hashed) by map
and reduce modules
 Simplifies the management of heterogeneous
related datasets
 Provides an abstraction able to express the
common relational algebra operators as well as
several join algorithms
Aneka MapReduce Programming
 Aneka provides an implementation of the
MapReduce abstractions
 By following the reference model introduced by
Google and implemented by Hadoop
 MapReduce is supported as one of the
available programming models
 Can be used to develop distributed applications
Introducing the MapReduce
Programming Model
 The MapReduce Programming Model
defines the abstractions and the runtime
support
 For developing MapReduce applications on top of
Aneka
 An overview of the infrastructure
supporting MapReduce in Aneka
 A MapReduce job in Google MapReduce or
Hadoop corresponds to the execution of a
MapReduce application in Aneka
 The application instance is specialized with
components that identify the map and reduce
functions to use
Introducing the MapReduce
Programming Model (cont.)
 These functions are expressed in the terms of a
Mapper and Reducer class
 Extended from the AnekaMapReduce APIs
 The runtime support is constituted by three
main elements
 MapReduce Scheduling Service
 Plays the role of the master process in the Google and
Hadoop implementation
 MapReduce Execution Service
 Plays the role of the worker process in the Google and
Hadoop implementation
 A specialized distributed file system
 Used to move data files
Introducing the MapReduce
Programming Model (cont.)
 Client components are used to submit the
execution of a MapReduce job, upload data
files, and monitor it
 Namely, the MapReduceApplication class
 The management of data files is
transparent
 Local data files are automatically uploaded to
Aneka
 Output files are automatically downloaded to the
client machine if requested
Introducing the MapReduce
Programming Model (cont.)
Programming Abstractions
 Aneka executes any piece of user code
within the context of a distributed
application
 Maintained even in the MapReduce model
 There is a natural mapping between the concept of
MapReduce job used in Google MapReduce and
Hadoop and the Aneka application concept
 The task creation is not responsibility of the user
but of the infrastructure
 Once the user has defined the map and reduce
functions
 The Aneka MapReduce APIs provide developers
with base classes for developing Mapper and
Reducer types
Programming Abstractions (cont.)
 Use a specialized type of application class,
MapReduceApplication, that better supports the
needs of this programming model
 An overview of the client components
defining the MapReduce programming model
 Three classes are of interest for application
development
 Mapper<K,V>, Reducer<K,V>, and
MapReduceApplication<M,R>
 The other classes are internally used to implement
all the functionalities required by the model
 Expose simple interfaces that require minimum amount
of coding for implementing the map and reduce
functions and controlling the job submission
Programming Abstractions (cont.)
 Mapper<K,V> and Reducer<K,V> constitute the
starting point of the application design and
implementation
 Template specialization is used to keep track of keys
and values types on which the two functions operate
 Generic types provide a more natural approach in
terms of object manipulation from within the map and
reduce methods
 Simplify the programming by removing the necessity
of casting and other type check operations
 The submission and execution of a MapReduce
job is performed through the class
MapReduceApplication<M,R>
 Provides the interface to the Aneka Cloud to support
MapReduce programming model
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 The above class exposes two generic class types
 M and R
 These two placeholders identify the specific class types of
Mapper<K,V> and Reducer<K,V> that will be used by
the application
 The definition of the Mapper<K,V> class and
of the related types for implementing the
map function
 To implement a specific mapper, necessary to
inherit this class and provide actual types for key K
and the value V
 The map operation is implemented by overriding
the abstract method void Map(IMapInput<K,V>
input)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 The other methods are internally used by the
framework
 IMapInput<K,V> provides access to the input key-
value pair on which the map operation is performed
 The implementation of the Mapper<K,V>
component for the Word Counter sample
 Counts the frequency of words into a set of large
text files
 The text files are divided into lines
 Each of the lines will become the value component of
a key-value pair
 The key will be represented by the offset in the file
where the line begins
 The mapper is specialized by using a long integer
as key type and a string for the value
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 To count the frequency of words, the map
function will emit a new key-value pair for each
of the word contained in the line
 By using the word as the key and the number 1 as
value
 This implementation will emit two pairs for the same
word if the word occurs twice in the line
 The reducer sums appropriately all these
occurrences
 The definition of the Reducer<K,V> class
 The implementation of a specific reducer requires
specializing the generic class
 Overriding the abstract method: Reduce
(IReduceInputEnumerator<V> input)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 The reduce operation is applied to a collection of
values that are mapped to the same key
 The IReduceInputEnumerator<V> allows developers to
iterate over such collection
 How to implement the reducer function for
the word counter sample
 The Reducer<K,V> class is specialized using a
string as a key type and an integer as a value
 The reducer simply iterates over all the values
 Accessible through the enumerator and sum them
 As the iteration is completed, the sum is dumped to file
 There is a link between the types used to
specialize the mapper and those used to specialize
the reducer
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 The key and value types used in the reducer are those
defining the key-value pair emitted by the mapper
 In this case, the mapper generates a key-value
pair (string,int)
 The reducer is of type Reducer<string,int>
 The Mapper<K,V> and Reducer<K,V>
classes provides facilities
 For defining the computation performed by a
MapReduce job
 Aneka provides the MapReduceApplication
<M,R> class
 To submit, execute, and monitor the progress
 Represents the local view of distributed
applications on top of Aneka
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 Such class provides limited facilities mostly
concerned with starting the MapReduce job and
waiting for its completion
 Due to the simplicity of the MapReduce model
 The interface of MapReduceApplication
<M,R>
 The interface of the class exhibits only the
MapReduce specific settings
 The control logic is encapsulated in the
ApplicationBase<M> class
 Possible to set the behavior of MapReduce for the
current execution
 The parameters that can be controlled
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 Partitions
 Stores an integer number containing the number of
partitions into which divide the final results
 Determines also the number of reducer tasks that will
be created by the runtime infrastructure
 The default value is 10
 Attempts
 Contains the number of times that the runtime will
retry to execute a task before declaring it failed.
 The default value is 3
 UseCombiner
 Stores a boolean value indicating if the MapReduce
runtime should add a combiner phase to the map task
execution to reduce the number of intermediate files
that are passed to the reduce task
Programming Abstractions (cont.)
 The default value is set to true
 SynchReduce
 Stores a boolean value indicating whether to
synchronize the reducers or not
 The default value is set to true
 Not used to determine the behavior of MapReduce
 IsInputReady
 A boolean property indicating whether the input files
are already stored into the distributed file system
 Or have to uploaded by the client manager before
executing the job
 The default value is set to false
 FetchResults
Programming Abstractions (cont.)
 A boolean property indicating whether the client
manager needs to download to the local computer the
result files produced by the execution of the job
 The default value is set to true
 LogFile
 Contains a string defining the name of the log file
used to store the performance statistics recorded
during the execution of the MapReduce job
 The default value is mapreduce.log
 The core control logic governing the
execution of a MapReduce job
 Resides within the MapReduceApplicationManager
<M,R> class
 Interacts with the MapReduce runtime
Programming Abstractions (cont.)
 Developers can control the application by
using the methods and the properties
 Exposed by the ApplicationBase<M> class
 The collection of methods of interest in the
above class for executing MapReduce jobs
 The constructors and the common properties that
are of interest of all the applications
 Two different overloads of the InvokeAndWait
method
 The first one simply starts the execution of the
MapReduce job and returns upon its completion
 The second one executes a client supplied callback at
the end of the execution
Programming Abstractions (cont.)
 Most commonly used to execute MapReduce jobs
 The use of InvokeAndWait is blocking
 Not possible to stop the application by calling
StopExecution within the same thread
 If necessary to implement a more sophisticated
management of the MapReduce job
 Possible to use the SubmitExecution method
 Submits the execution of the application and returns
without waiting for its completion
 The MapReduce implementation will
automatically upload all the files
 Found in the Configuration.Workspace directory
 Ignore the files added by the AddSharedFile methods
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
 How to create a MapReduce application for
running the Word Counter sample
 Defined by the previous WordCounterMapper and
WordCounterReducer classes
 The lines of interest are those put in evidence in
the try { … } catch { … } finally { …} block
 The execution of a MapReduce Job requires only
three lines of code
 The user reads the configuration file
 Creates a MapReduceApplication<M,R> instance and
Configures it
 Starts the execution
 All the rest of the code is mostly concerned with
setting up the logging and handling exceptions
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Programming Abstractions (cont.)
Runtime Support
 The runtime support for the execution of
MapReduce jobs is constituted by the
collection of services
 Deal with the scheduling and the execution of
MapReduce tasks
 MapReduce Scheduling Service and the MapReduce
Execution Service
 Integrated with the existing services of the framework
to provide persistence, application accounting
 Also with the features available for the applications
developed with other programming models
 Job and Task Scheduling
 The scheduling of jobs and tasks is responsibility
of the MapReduce Scheduling Service
Runtime Support (cont.)
 Covers the same role of the master process in the
Google MapReduce implementation
 The architecture of the scheduling service is
organized into two major components
 The MapReduceSchedulerService and the
MapReduceScheduler
 The former is a wrapper around the scheduler
implementing the interfaces required by Aneka to
expose a software component as a service
 The latter controls the execution of jobs and schedules
tasks
 The main role of the service wrapper is to
translate messages coming from the Aneka
runtime or the client applications
 Into calls or events directed to the scheduler
component and vice versa
Runtime Support (cont.)
 The relationship of two components
 The core functionalities for job and tasks
scheduling are implemented in the
MapReduceScheduler class
 The scheduler manages multiple queues for
several operations
 Uploading input files into the distributed file system
 Initializing jobs before scheduling
 Scheduling map and reduce tasks
 Keeping track of unreachable nodes
 Re-submitting failed tasks
 Reporting execution statistics
 All are performed asynchronously and triggered by
events happening in the Aneka middleware
Runtime Support (cont.)
Runtime Support (cont.)
 Task Execution
 The execution of tasks is controlled by the
MapReduce Execution Service
 Plays the role of the worker process in the Google
MapReduce implementation.
 The service manages the execution of map and
reduce tasks
 Also performs other operations such as sorting and
merging of intermediate files
 The internal organization of the above
service
 Three major components that coordinate
together for executing tasks
Runtime Support (cont.)
 MapReduceExecutorService, ExecutorManager, and
MapReduceExecutor
 The MapReduceExecutorService interfaces the
ExecutorManager with the Aneka middleware
 The ExecutorManager is in charge of keeping
track of the tasks being executed
 By demanding the specific execution of a task to the
MapReduceExecutor
 Also of sending the statistics about the execution back
to the scheduler service
 Possible to configure more than one
MapReduceExecutor instances
 Helpful in case of multi-core nodes where more than
one task can be executed at the same time
Runtime Support (cont.)
Distributed File System Support
 Differently from the other programming
models supported by Aneka
 The MapReduce model does not leverage the
default Storage Service for storage and data
transfer
 But uses a distributed file system implementation
 The requirements in terms of file management
are significantly different to the other models
 MapReduce has been designed to process
large quantities of data
 Stored in files of large dimensions
 The support provided by a distributed file system
is more appropriate
Distributed File System Support
(cont.)
 Can leverage multiple nodes for storing data
 Guarantee high availability and a better efficiency by
means of replication and distribution
 The original MapReduce implementation
assumes the existence of a distributed and
reliable storage
 The use of a distributed file system for
implementing the storage layer is natural
 Aneka provides the capability of interfacing
with different storage implementations
 Maintains the same flexibility for the integration
of a distributed file system
Distributed File System Support
(cont.)
 The level of integration required by
MapReduce requires the ability to perform
 Retrieving the location of files and file chunks
 Useful to the scheduler for optimizing the scheduling
of map and reduce tasks according to the location of
data
 Accessing a file by means of a stream
 Required for the usual I/O operations to and from data
files
 The stream might also access the network
 If the file chunk is not stored on the local node
 Aneka provides interfaces that allow performing
such operations
Distributed File System Support
(cont.)
 Also the capability to plug different file system behind
the operations by providing the appropriate
implementation
 The current implementation provides bindings to
HDFS
 On top of these low level interfaces
 The MapReduce programming model offers
classes to read from and write to files in a
sequential manner
 The classes SeqReader and SeqWriter
 Provide sequential access for reading and writing key-
value pairs
 The above two classes expect a specific file
format
Distributed File System Support
(cont.)
 An Aneka MapReduce file is composed by a
header used to identify the file and a sequence of
record blocks
 Each of them storing a key-value pair
 The header is composed by 4 bytes
 The first three bytes represent the character sequence
SEQ
 The fourth byte identifies the version of the file
 The record block is composed as follows
 The first 8 bytes are used to store two integers
representing the length of the rest of the block and
the length of the key section
 The remaining part of the block stores the data of the
value component of the pair
Distributed File System Support
(cont.)
 The SeqReader and SeqWriter classes are
designed to read and write files in this format
 By transparently handling the file format information
and translating key and value instances to and from
their binary representation
 All the .NET built-in types are supported
 MapReduce jobs operates very often with
data represented in common text files
 A specific version of the SeqReader and
SeqWriter classes have been designed
 To read and write text files as a sequence of key-value
pairs
 In the case of the read operation
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
 Each value of the pair is represented by a line in the
text file
 The key is automatically generated and assigned to
the position in bytes where the line starts in the file
 In case of the write operation
 The writing of the key is skipped and the values are
saved as single lines
 The interface of the SeqReader and
SeqWriter classes
 The SeqReader class provides an enumerator
based approach
 Possible to access to the key and the value
sequentially by calling the NextKey() and the
NextValue() methods respectively
Distributed File System Support
(cont.)
 Also possible to access the raw byte data of keys and
values by using the NextRawKey() and
NextRawValue()
 The HasNext() returns a Boolean indicating whether
there are more pairs to read or not
 The SeqWriter class exposes different versions of
the Append method
 A practical use of the SeqReader class
 By implementing the callback used in the Word
Counter sample
 To visualize the results of the application, use the
SeqReader class to read the content of the
output files
 Dump it into a proper textual form
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
 Can be visualized with any text editor, such as the
Notepad application
 The OnDone callback checks whether the
application has terminated successfully
 In case there are no errors, iterates over the result
files downloaded in the workspace output directory
 By default, the files are saved in the output
subdirectory of the workspace directory
 For each of the result files
 Opens a SeqReader instance on it
 Dumps the content of the key-value pair into a text
file, which can be opened by any text editor
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Distributed File System Support
(cont.)
Example Application
 MapReduce is a very useful model for
processing large quantities of data
 In many cases are maintained in a semi-
structured form like logs or web pages
 To demonstrate how to program real
applications with Aneka MapReduce
 Consider a very common task: log parsing
 Design a MapReduce application that processes
the logs produced by the Aneka container
 To extract some summary information about the
behavior of the Cloud
 Design the Mapper and Reducer classes
 Used to execute log parsing and data extraction
operations
Parsing Aneka Logs
 Aneka components produce a lot of
information
 Stored in the form of log files
 e.g., daemons, container instances, and services
 The most relevant information is stored into
the container instances logs
 Store the information about the applications that
are executed on the Cloud
 Parse these logs to extract useful information about
the execution of applications and the usage of services
in the Cloud
 The entire framework leverages the log4net
library
Parsing Aneka Logs (cont.)
 For collecting and storing the log information
 The system is configured to produce a log
file
 Partitioned in chunks every time the container
instance restarts
 The information contained in the log file can be
customized in their appearance
 Currently the default layout is
DD MMM YY hh:mm:ss level –message
 Some examples of formatted log messages
 The message part of almost all the log lines
exhibit a similar structure
Parsing Aneka Logs (cont.)
 Start with the information about the component that
enters the log line
 Can be easily extracted by looking at the first
occurrence of the ‘:’ character
 Following a sequence of characters that do not contain
spaces
15 Mar 2011 10:30:07 DEBUG - SchedulerService:HandleSubmitApplication - SchedulerService: …
15 Mar 2011 10:30:07 INFO - SchedulerService: Scanning candidate storage …
15 Mar 2011 10:30:10 INFO - Added [WU: 51d55819-b211-490f-b185-8a25734ba705, 4e86fd02…
15 Mar 2011 10:30:10 DEBUG - StorageService:NotifyScheduler - Sending FileTransferMessage…
15 Mar 2011 10:30:10 DEBUG - IndependentSchedulingService:QueueWorkUnit–Queueing…
15 Mar 2011 10:30:10 INFO - AlgorithmBase::AddTasks[64] Adding 1 Tasks
15 Mar 2011 10:30:10 DEBUG - AlgorithmBase:FireProvisionResources - Provision Resource: 1
Parsing Aneka Logs (cont.)
 Possible information intended to extract
from such logs
 The distribution of log messages according to the
level
 The distribution of log messages according to the
components
 This information can be easily extracted and
composed into a single view
 By creating Mapper tasks that count the occurrences
of log levels and component names
 Emit one simple key-value pair in the form (level-
name, 1) or (component-name, 1) for each of the
occurrences
Parsing Aneka Logs (cont.)
 The Reducer task will simply sum up all the key-value
pairs that have the same key
 The structure of the map and reduce
functions will be the following
map: (long, string) => (string, long)
reduce: (string, long) => (string, long)
 The Mapper class receive a key-value pair
 Containing the position of the line inside the file as a
key and the log message as the value component
 Produce a key-value pair containing a string
representing the name of the log level or the
component name and 1 as value
 The Reducer class will sum up all the key value
pairs that have the same name
Parsing Aneka Logs (cont.)
 Perform both of the analyses simultaneously
 Not developing two different MapReduce jobs
 The operation performed by the Reducer class is
the same in both cases
 The operation of the Mapper class changes
 The type of the key-value pair that is generated is the
same for the two jobs
 Possible to combine the two tasks performed by the
map function into one single Mapper class to produce
two key-value pairs for each input line
 By differentiating the name of Aneka components from
the log level names by using an initial underscore
character
 Very easy to post process the output of the reduce
function in order to present and organize data
Mapper Design and Implementation
 The operation performed by the map
function is a very simple text extraction
 Identifies the level of the logging and the name
of the component entering the information in the
log
 Once this information is extracted, a key-value pair
(string, long) is emitted by the function
 To combine the two MapReduce jobs into one
single job, each map task will at most emit two
key-value pair
 Some of the log lines do not record the name of
the component that entered the line
 For these lines, only the key-value pair corresponding
to the log level will be emitted
Mapper Design and Implementation
(cont.)
 The implementation of the mapper class for
the log parsing task
 The Map method simply locates the position of the
log level label into the line and extracts it
 Emits a corresponding key-value pair (label, 1)
 Then tries to locate the position of the name of the
Aneka component that enters the log line
 By looking for a sequence of characters limited by a colon
 If such a sequence does not contain spaces, it
represents the name of the Aneka component
 Another key-value pair (component-name, 1) is emitted
 To differentiate from the log level labels, an underscore is
pre-fixed to the name of the key in this second case
Mapper Design and Implementation
(cont.)
Mapper Design and Implementation
(cont.)
Mapper Design and Implementation
(cont.)
Reducer Design and Implementation
 The implementation of the reduce function
is even more straightforward
 The only operation that needs to be performed is
to add all the values associated to the same key
 Emit a key-value pair with the total sum
 The infrastructure will already aggregate all the
values that have been emitted for a given key
 Simply need to iterate the over the collection of values
and sum them up
 Actually the same for both of the two different key-
value pairs extracted from the log lines
 Responsibility of the driver program to
differentiate among the different type of
information assembled in the output files
Reducer Design and Implementation
(cont.)
Reducer Design and Implementation
(cont.)
Driver Program
 The LogParsingMapper and
LogParsingReducer constitute the core
functionality of the MapReduce job
 Only requires to be properly configured in the
main program to process and produce text files
 The mapper component extracts two different
types of information
 Another task performed in the driver application is the
separation of these two statistics into two different
files for further analysis
 The implementation of the driver program
 Three things to be noticed
 The configuration of the MapReduce job
Driver Program (cont.)
 The post processing of the result files
 The management of errors
 The configuration of the job is performed in the
Initialize method
 Reads the configuration file from the local file system
 Ensures that the input and output formats of files are
set to text
 MapReduce jobs can be configured by using a
specific section of the configuration file
 Named MapReduce
 Two sub-sections control the properties of input
and output files
 Named Input and Output respectively
Driver Program (cont.)
 The input and output sections may contain
the following properties
 Format (string)
 Defines the format of the input file
 If this property is set, the only supported value is text
 Filter (string)
 Defines the search pattern to be used to filter the
input files to process in the workspace directory
 Only applies for the Input Properties group
 NewLine (string)
 Defines the sequence of characters used to detect (or
write) a new line in the text stream
 This value is meaningful when the input/output format
is set to text
Driver Program (cont.)
 The default value is selected from the execution
environment if not set
 Separator (character)
 Only present in the Output section
 Defines the character that need to be used to separate
the key from the value in the output file
 This value is meaningful when the input/output format
is set to text
 Possible to control other parameters of a
MapReduce job
 These parameters are defined in the main
MapReduce configuration section
 Instead of a programming approach for the
initialization of the configuration
Driver Program (cont.)
Driver Program (cont.)
Driver Program (cont.)
Driver Program (cont.)
Driver Program (cont.)
Driver Program (cont.)
 Also possible to embed these settings into the
standard Aneka configuration file
 Possible to open a <Group name="MapReduce”>
…</Group/> tag
 Enter all the properties required for the execution
 The Aneka configuration file is based on a flexible
framework that allows to simply enter groups of
name-value properties
 The Aneka.Property and Aneka.PropertyGroup
classes also provide facilities
 For converting the strings representing the value of a
property into a corresponding built-in type if possible
 Simplifies the task of reading from and writing to
configuration objects
Driver Program (cont.)
Driver Program (cont.)
 The second element is to deal with the post
processing of the output files.
 Performed in the OnDone method
 Its invocation is triggered either if an error occurs
during the execution of the MapReduce job
 Or if it completes successfully
 Separates the occurrences of log level labels from
the occurrences of Aneka component names
 By saving them into two different files: loglevels.txt
and components.txt under the workspace directory
 Deletes the output directory where the output
files of the reduce phase have been downloaded
 These above two files contain the aggregated
result of the analysis
Driver Program (cont.)
 Can be used to extract statistic information about the
content of the log files and display graphically
 A final aspect is the management of errors
 Aneka provides a collection of APIs contained in
the Aneka.Util library
 Represent utility classes for automating tedious tasks
 e.g., The appropriate collection of stack trace
information associated to an exception or the
information about the types of the exception thrown
 The reporting features triggered upon exceptions
are implemented into the ReportError method
 Utilizes the facilities offered by the IOUtil class to
dump a simple error report to both the console and a
log file named using the pattern: error.YYYY-MM-DD_
hh-mm-ss.log
Driver Program (cont.)
Driver Program (cont.)
Driver Program (cont.)
Driver Program (cont.)
Running the Application
 Aneka produces a considerable amount of
logging information
 The default configuration of the logging
infrastructure creates a new log file for each
activation of the Container process
 Or as soon as the dimension of the log file goes
beyond 10 MB
 By simply keep running an Aneka Cloud for a few
days, quite easy to collect enough data to mine
for the above sample application
 Also constitutes a real case study for MapReduce
 One of its most common practical applications is
extracting semi-structured information from logs and
traces of execution
Running the Application (cont.)
 In the execution of the test
 Uses a distributed infrastructure consisting of 7
worker nodes and one master node
 Interconnected through a LAN
 Processes 18 log files of several sizes for a total
aggregate size of 122 MB
 The execution of the MapReduce job over the
collected data produces the results
 Stored in the loglevels.txt and components.txt files
 Represented graphically
 There is a considerable amount of unstructured
information in the log files produced by the
Container processes
Running the Application (cont.)
 About 60% of the log content is skipped during
the classification
 This content is more likely due to the result of stack
trace dumps into the log file
 As a result of ERROR and WARN entries, a sequence of
lines that are not recognized are produced
 The distribution among the components
that make use of the logging APIs
 This distribution is computed over the data
recognized as a valid log entry
 Just about 20% of these entries have not been
recognized by the parser implemented in the map
function
Running the Application (cont.)
DEBUG, 12%
ERROR, 4%
INFO, 22%
SKIPPED, 62%
WARN, 0%
DEBUG
ERROR
INFO
SKIPPED
WARN
Running the Application (cont.)
 The meaningful information extracted from the
log analysis constitutes about 32% of the entire
log data
 i.e., 80% of 40% of the total lines parsed
 Despite the simplicity of the parsing
function implemented in the map task
 The Aneka MapReduce programming model can
be used to easily perform massive data analysis
tasks
 Logically and programmatically approach a realistic
data analysis case study with MapReduce
 Implement it on top of the Aneka APIs
Running the Application (cont.)
3
189439
54878
10 180
55523
6 984 4385
5 5
76895
31 10 610 610

More Related Content

Similar to 12575474.ppt

High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud securitycsandit
 
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONSHIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONScscpconf
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewIJERA Editor
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdfAkuhuruf
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Labkevinflorian
 

Similar to 12575474.ppt (20)

High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud security
 
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONSHIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
HIGH LEVEL VIEW OF CLOUD SECURITY: ISSUES AND SOLUTIONS
 
Big Data
Big DataBig Data
Big Data
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Big Data
Big DataBig Data
Big Data
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdf
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 

More from Dr. Thippeswamy S.

More from Dr. Thippeswamy S. (19)

Bacterial Examination of Water of different sources.ppt
Bacterial Examination of Water of different sources.pptBacterial Examination of Water of different sources.ppt
Bacterial Examination of Water of different sources.ppt
 
Seven QC Tools New approach.pptx
Seven QC Tools New approach.pptxSeven QC Tools New approach.pptx
Seven QC Tools New approach.pptx
 
Soil Erosion.pptx
Soil Erosion.pptxSoil Erosion.pptx
Soil Erosion.pptx
 
Database Normalization.pptx
Database Normalization.pptxDatabase Normalization.pptx
Database Normalization.pptx
 
cloudapplications.pptx
cloudapplications.pptxcloudapplications.pptx
cloudapplications.pptx
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
 
deploymentmodelsofcloudcomputing-230211123637-08174981.pptx
deploymentmodelsofcloudcomputing-230211123637-08174981.pptxdeploymentmodelsofcloudcomputing-230211123637-08174981.pptx
deploymentmodelsofcloudcomputing-230211123637-08174981.pptx
 
DBMS.pptx
DBMS.pptxDBMS.pptx
DBMS.pptx
 
Normalization DBMS.ppt
Normalization DBMS.pptNormalization DBMS.ppt
Normalization DBMS.ppt
 
Normalization Alg.ppt
Normalization Alg.pptNormalization Alg.ppt
Normalization Alg.ppt
 
introduction to matlab.pptx
introduction to matlab.pptxintroduction to matlab.pptx
introduction to matlab.pptx
 
Conceptual Data Modeling
Conceptual Data ModelingConceptual Data Modeling
Conceptual Data Modeling
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Chp-1.pptx
Chp-1.pptxChp-1.pptx
Chp-1.pptx
 
Mod-2.pptx
Mod-2.pptxMod-2.pptx
Mod-2.pptx
 
module 5.1.pptx
module 5.1.pptxmodule 5.1.pptx
module 5.1.pptx
 
module 5.pptx
module 5.pptxmodule 5.pptx
module 5.pptx
 
Module 2 (1).pptx
Module 2 (1).pptxModule 2 (1).pptx
Module 2 (1).pptx
 
23. Journal of Mycology and Plant pathology.pdf
23. Journal of Mycology and Plant pathology.pdf23. Journal of Mycology and Plant pathology.pdf
23. Journal of Mycology and Plant pathology.pdf
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 

12575474.ppt

  • 1. Chapter 8 Data Intensive Computing: Map-Reduce Programming
  • 2. INTRODUCTION  Data intensive computing focuses on a class of applications  Deal with a large amount of data  Several application fields produce large volumes of data that need to be efficiently stored, made accessible, indexed, and analyzed  Ranging from computational science to social networking  Become challenging as the quantity of information accumulates and increases over time at higher rates  Distributed computing is definitely of help in addressing these challenges  By providing more scalable and efficient storage architectures
  • 3. INTRODUCTION (cont.)  And a better performance in terms of data computation and processing  The use of parallel and distributed techniques as a support of data intensive computing is not straightforward  Several challenges in form of data representation, efficient algorithms, and scalable infrastructures need to be faced  Characterizes the nature of data intensive computing  Presents an overview of the challenges introduced by production of large volumes of data
  • 4. INTRODUCTION (cont.)  And how they are handled by storage systems and computing models  MapReduce is a popular programming model  For creating data intensive applications and their deployment on Clouds  Practical examples of MapReduce applications for data intensive computing are demonstrated  By using Aneka MapReduce Programming Model
  • 5. What is Data-Intensive Computing  Data-intensive computing  Concerned with production, manipulation, and analysis of large-scale data  In the range of hundreds of megabytes (MB) to petabytes (PB) and beyond  The term dataset is commonly used to identify a collection of information elements  Relevant to one or more applications  Datasets are often maintained in repositories  Infrastructures supporting the storage, retrieval, and indexing of large amount of information  Metadata are attached to datasets to facilitate the classification and the search of relevant bits of information
  • 6. What is Data-Intensive Computing (cont.)  Data-intensive computations occur in many application domains  Computational science is one of the most popular  Scientific simulations and experiments are keen to produce, analyze, and process huge volumes of data  e.g., Hundreds of gigabytes of data are produced every second by telescopes mapping the sky  The collection of images of the sky easily reaches the scale of petabytes over a year  Bioinformatics applications mine databases  May end up containing terabytes of data  Earthquakes simulators process a massive amount of data
  • 7. What is Data-Intensive Computing (cont.)  Produced as a result of recording the vibrations of the Earth across the entire globe  Several IT industry sectors also require support for data-intensive computations  A customer data of any telecom company will easily be in the range of 10-100 terabytes  This volume of information is not only processed to generate bill statements  Also mined to identify scenarios, trends, and patterns helping these companies to provide a better service  Currently the US handset mobile traffic has reached 8 petabytes per month  Expected to grow up to 327 petabytes per month by 2015
  • 8. What is Data-Intensive Computing (cont.)  The scale of petabytes is even more common when considering IT giants  e.g., Google  Reported to process about 24 petabytes of information per day and to sort petabytes of data in hours  Social networking and gaming are two other sectors  Data intensive computing is now a reality  Facebook inbox search operations involve crawling about 150 terabytes of data  The whole uncompressed data stored by the distributed infrastructure reaches to 36 petabytes
  • 9. What is Data-Intensive Computing (cont.)  Zynga, social gaming platform, moves 1 petabyte of data daily  Reported to add 1000 servers every week to sustain to store the data generated by games
  • 10. Characterizing Data-Intensive Computations  Data-intensive applications do not only deal with huge volumes of data  Also exhibit compute-intensive properties  The domain of data-intensive computing  Data intensive applications handle datasets in the scale of multiple terabytes and petabytes  Datasets are commonly persisted in several formats and distributed across different locations  Process data in multistep analytical pipelines including transformation and fusion stages  The processing requirements scale almost linearly with the data size  Can be easily processed in parallel
  • 11. Characterizing Data-Intensive Computations (cont.)  Also need efficient mechanisms for data management, filtering and fusion, and efficient querying and distribution
  • 12. Challenges Ahead  The huge amount of data produced, analyzed, or stored imposes requirements on the supporting infrastructures and middleware  Hardly found in the traditional solutions for distributed computing  e.g., The location of data is crucial  The need for moving terabytes of data becomes an obstacle for high performing computations  Data partitioning help in improving the performance of data-intensive applications  As well as content replication and scalable algorithms
  • 13. Challenges Ahead (cont.)  Open challenges in data-intensive computing  Scalable algorithms  Can search and process massive datasets  New metadata management technologies  Can scale to handle complex, heterogeneous, and distributed data sources  Advances in high-performance computing platform  Aimed at providing a better support for accessing in- memory multi-terabyte data structures  High-performance, high-reliable, petascale distributed file systems
  • 14. Challenges Ahead (cont.)  Data signature generation techniques  For data reduction and rapid processing  New approaches to software mobility  For delivering algorithms able to move the computation where the data is located  Specialized hybrid interconnection architectures  Providing a better support for filtering multi-gigabyte data streams coming from high speed networks and scientific instruments  Flexible and high-performance software integration techniques  Facilitating the combination of software modules running on different platforms to quickly form analytical pipelines
  • 15. Data Clouds and Big Data  Large datasets have mostly been the domain of scientific computing  Starting to change as massive amount of data are being produced, mined, and crunched by companies providing Internet services  e.g., Searching, on-line advertisement, and social media  Critical for such companies to efficiently analyze these huge datasets  They constitute a precious source of information about their customers  Logs analysis is an example of data-intensive operation commonly performed in this context
  • 16. Data Clouds and Big Data (cont.)  Companies like Google have a massive amount of data in the form of logs that are daily processed by using their distributed infrastructure  They settle on analytic infrastructure that differs from the Grid-based infrastructure used by the scientific community  Cloud computing technologies support data-intensive computations  The term Big Data has become popular  Characterizes the nature of data-intensive computations nowadays  Currently identifies datasets that grow so large that they become complex to work with using on-hand database management tools
  • 17. Data Clouds and Big Data (cont.)  Relational databases and desktop statistics/visualization packages become ineffective for that amount of information  Requiring instead massively parallel software running on tens, hundreds, or even thousands of servers  Big Data problems are found in non- scientific application domains  e.g., Web logs, RFID, sensor networks, social networks, Internet text and documents  Internet search indexing, call detail records, military surveillance, medical records, photography archives, video archives  Large scale e-Commerce
  • 18. Data Clouds and Big Data (cont.)  The massive size of data  New data is accumulated with time rather than replacing the old ones  Big Data applies to datasets whose size is beyond the ability of commonly used software tools  To capture, manage, and process the data within a tolerable elapsed time  A constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single dataset  Cloud technologies support data-intensive computing in several ways
  • 19. Data Clouds and Big Data (cont.)  By providing a large amount of compute instances on demand  Can be used to process and analyze large datasets in parallel  By providing a storage system and other distributed data store architectures  Optimized for keeping large blobs of data  By providing frameworks and programming APIs  Optimized for the processing and management of large amount of data  Mostly coupled with a specific storage infrastructure to optimize the overall performance of the system  A Data Cloud is a combination of these components
  • 20. Data Clouds and Big Data (cont.)  e.g., the MapReduce framework  Provides the best performance when leveraging the Google File System on top of Google’s large computing infrastructure  The Hadoop system  The most mature, large, and open source Data Cloud  Consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce
  • 21. Databases and Data-Intensive Computing  Distributed databases have been considered the natural evolution of database management systems  As the scale of the datasets becomes unmanageable with a single system  A collection of data stored at different sites of a computer network  Each site may expose a degree of autonomy providing services for the execution of local applications  Also participates in the execution of a global application  Can be created by splitting and scattering the data of an existing database over different sites
  • 22. Databases and Data-Intensive Computing (cont.)  Or by federating together multiple existing databases  These systems are very robust  Provide distributed transaction processing, distributed query optimization, and efficient management of resources  Mostly concerned with datasets that can be expressed by using the relational model  The need to enforce ACID properties on data limits their abilities to scale as Data Clouds and Grids do  Atomicity, Consistency, Isolation, Durability
  • 23. Technologies for Data-Intensive Computing  Data-intensive computing concerns the development of applications  Mainly focused on processing large quantities of data  Storage systems and programming models constitute a natural classification of the technologies supporting data-intensive computing
  • 24. Storage Systems  Database management systems constitute the de-facto storage support for several types of applications  The relational model in its original formulation does not seem to be the preferred solution for supporting data analytics at a large scale  Due to the explosion of unstructured data in blogs, web pages, software logs, and sensor readings  New opportunities arise  Some factors contributing to this change  Growing of popularity of Big Data  The management of large quantities of data is not anymore a rare case
  • 25. Storage Systems (cont.)  Become common in several fields: scientific computing, enterprise applications, media entertainment, natural language processing, and social network analysis  The large volume of data imposes new and more efficient techniques for their management  Growing importance of data analytics in the business chain  The management of data is not considered anymore a cost but a key element to the business profit  This situation arises in popular social networks  e.g., Facebook concentrates the focus on the management of user profiles, interests, and connections among people  This massive amount of data, being constantly mined, requires new technologies and strategies supporting data analytics
  • 26. Storage Systems (cont.)  Presence of data in several forms, not only structured  What constitutes relevant information today exhibits a heterogeneous nature and appears in several forms and formats  Structured data is constantly growing as a result of the continuous use of traditional enterprise applications and system  The advances in technology and the democratization of the Internet as a platform where everyone can pull information has created a massive amount of unstructured information  Does not naturally fit into the relational model  New approaches and technologies for computing  Cloud computing promises access to a massive amount of computing capacity on demand
  • 27. Storage Systems (cont.)  Allows engineers to design software systems that incrementally scale to arbitrary degrees of parallelism  Not rare anymore to build software applications and services dynamically deployed on hundreds or thousands of nodes, belonging to the system for few hours or days  Classical database infrastructures are not designed to provide support to such a volatile environment  All these factors identify the need of new data management technologies  The major directions towards support for data- intensive computing are constituted by  Advances in distributed file systems for the management of raw data in the form of files  Distributed object stores  The spread of the NoSQL movement
  • 28. High Performance Distributed File Systems and Storage Clouds  Distributed file systems constitute the primary support for the management of data  Provide an interface where to store information in the form of files and later access them for read and write operations  Few of file systems specifically address the management of huge quantities of data on a large number of nodes  Mostly these file systems constitute the data storage support for large computing clusters, supercomputers, massively parallel architectures, and lately storage/computing Clouds
  • 29. Google File System (GFS)  GFS is the storage infrastructure  Supporting the execution of distributed applications in the Google’s computing Cloud  Has been designed to be a fault tolerant, high available, distributed file system  Built on commodity hardware and standard Linux operating systems  Not a generic implementation of a distributed file system  GFS specifically addresses the needs of Google in terms of distributed storage  Designed with the following assumptions  The system is built on top of commodity hardware that often fails
  • 30. Google File System (GFS) (cont.)  The system stores a modest number of large files, multi-GB files are common and should be treated efficiently, small files must be supported but there is no need to optimize for that  The workloads primarily consist of two kinds of reads: large streaming reads and small random reads  The workloads also have many large, sequential writes that append data to files  High-sustained bandwidth is more important than low latency  The architecture of the file system is organized into  A single master containing the metadata of the entire file system
  • 31. Google File System (GFS) (cont.)  A collection of chunk servers providing storage space  The system is composed by a collection of software daemons  Implement either the master server or the chunk server  A file is a collection of chunks  The size can be configured at file system level  Chunks are replicated on multiple nodes in order to tolerate failures  Clients look up the master server  Identify the specific chunk of a file they want to access
  • 32. Google File System (GFS) (cont.)  Once the chunk is identified, the interaction happens between the client and the chunk server  Applications interact through the file system with specific interfaces  Supporting the usual operations for file creation, deletion, read, and write  Also supporting snapshots and record append operations frequently performed by applications  GFS has been conceived by considering  Failures in a large distributed infrastructure are common rather than a rarity  A specific attention has been put in implementing a highly available, lightweight, and fault tolerant infrastructure
  • 33. Google File System (GFS) (cont.)  The potential single point of failure of the single master architecture  Addressed by giving the possibility of replicating the master node on any other node belonging to the infrastructure  A stateless daemon and extensive logging capabilities facilitate the system recovery from failures
  • 34. Not Only SQL (NoSQL) Systems  The term NoSQL is originally to identify a relational database  Does not expose a SQL interface to manipulate and query data  SQL: Structured Query Language  Relies on a set of UNIX shell scripts and commands to operate on text files containing the actual data  Strictly speaking, NoSQL cannot be considered a relational database  Not a monolithic piece of software organizing the information according to the relational model  Instead a collection of scripts
  • 35. Not Only SQL (NoSQL) Systems (cont.)  Allow users to manage most of the simplest and more common database tasks by using text files as information store  Now the term NoSQL label all those database management systems  Does not use a relational model  Provides simpler and faster alternatives for data manipulation  A big umbrella encompassing all the storage and database management systems  Differ in some way from the relational model  The general philosophy is to overcome the restrictions imposed by the relational model  To provide more efficient systems
  • 36. Not Only SQL (NoSQL) Systems (cont.)  Implies the use of tables without fixed schemas to accommodate a larger range of data types  Or avoiding joins to increase the performance and scale horizontally  Two main reasons have determined the growth of the NoSQL movement  Simple data models are often enough to represent the information used by applications  The quantity of information contained in unstructured formats has considerably grown in the last decade  Make software engineers look to alternatives more suitable to their specific application domain  Several different initiatives explored the use of non- relational storage systems
  • 37. Not Only SQL (NoSQL) Systems (cont.)  A broad classification distinguishes NoSQL implementations into  Document stores  Apache Jackrabbit, Apache CouchDB, SimpleDB, Terrastore  Graphs  AllegroGraph, Neo4j, FlockDB, Cerebrum  Key-value stores  A macro classification that is further categorized into  Key-value store on disk, key-value caches in RAM  Hierarchically key-value stores, eventually consistent key-value stores, and ordered key-value store  Multi-value databases  OpenQM, Rocket U2, OpenInsight
  • 38. Not Only SQL (NoSQL) Systems (cont.)  Object databases  ObjectStore, JADE, ZODB  Tabular stores  Google BigTable, Hadoop HBase, Hypertable  Tuple stores  Apache River
  • 39. Google Bigtable  Bigtable is the distributed storage system  Designed to scale up to petabytes of data across thousands of servers  Provides storage support for several Google applications exposing different types of workload  From throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users  The key design goals are wide applicability, scalability, high performance, and high availability  Organizes the data storage in tables whose rows are distributed over the distributed file system supporting the middleware, i.e., the Google File System  A table is a multidimensional sorted map
  • 40. Google Bigtable (cont.)  Indexed by a key represented by a string of arbitrary length  Organized in rows and columns  Columns can be grouped in column family  Allow for specific optimization for better access control, the storage and the indexing of data  Client applications are provided with very simple data access model  Allow them to address data at column level  Each column value is stored in multiple versions  Can be automatically time-stamped by Bigtable or by the client applications  Besides the basic data access
  • 41. Google Bigtable (cont.)  Bigtable APIs also allow more complex operations  e.g., Single row transactions and advanced data manipulation by means of the Sazwall scripting language or the MapReduce APIs  Sazwall is an interpreted procedural programming language developed at Google for the manipulation of large quantities of tabular data  Includes specific capabilities for supporting statistical aggregation of values read or computed from the input and other features that simplify the parallel processing of petabytes of data  An overview of the Bigtable infrastructure  A collection of processes that co-exist with other processes in a cluster-based environment  Bigtable identifies two kinds of processes
  • 42. Google Bigtable (cont.)  Master processes and tablet server processes  A tablet server is responsible for serving the requests for a given tablet  A contiguous partition of rows of a table  Each tablet server can manage multiple tablets  From ten to a thousand commonly  The master server is responsible  Of keeping track of the status of the tablet servers  Of the allocation of tablets to tablets servers  The master server constantly monitors the tablet servers to check whether they are alive  In case they are not reachable, the allocated tablets are reassigned and eventually partitioned to other servers
  • 43. Google Bigtable (cont.)  Chubby supports the activity of the master and tablet servers  A distributed highly available and persistent lock service  System monitoring and data access is filtered through Chubby  Also responsible of managing replicas and providing consistency among them  At the very bottom layer, the data is stored into the Google File System in the form of files  All the update operations are logged into the file for the easy recovery of data in case of failures  Or when tablets need to be reassigned to other servers  Bigtable uses a specific file format for storing the data of a tablet
  • 45. Google Bigtable (cont.)  Can be compressed for optimizing the access and the storage of data  Bigtable is the result of a study of the requirements of several distributed applications in Google  Serves as a storage backend for 60 applications and manages petabytes of data  e.g., Google Personalized Search, Google Anlytics, Google Finance, and Google Earth
  • 46. Hadoop HBase  HBase is the distributed database  Supporting the storage needs of the Hadoop distributed programming platform  Inspired by Google Bigtable  Its main goal is to offer real time read/write operations for tables with billions of rows and millions of columns  By leveraging clusters of commodity hardware  Its internal architecture and logic model is very similar to Google Bigtable  The entire system is backed by the Hadoop Distributed File System (HDFS)  Mimics the structure and the services of GFS
  • 47. Programming Platforms  Platforms for programming data intensive applications provide abstractions  Helping to express the computation over a large quantity of information  Runtime systems able to manage efficiently huge volumes of data  Database management systems based on the relational model have been used  To express the structure and the connections between the entities of a data model  Proven to be not successful in the case of Big Data
  • 48. Programming Platforms (cont.)  Information is mostly found unstructured or semi- structured  Data is most likely to be organized in files of large size or a huge number of medium size files rather than rows in a database  Distributed workflows have been often used  To analyze and process large amounts of data  Introduce a plethora of frameworks for workflow management systems  Eventually incorporated capabilities to leverage the elastic features offered by Cloud computing  These systems are fundamentally based on the abstraction of task  Puts a lot of burden on the developer
  • 49. Programming Platforms (cont.)  Needs to deal with data management and, often, data transfer issues  Programming platforms for data intensive computing provide higher level abstractions  Focus on the processing of data  Move into the runtime system the management of transfers  Making the data always available where needed  The approach followed by the MapReduce programming platform  Expresses the computation in the form of two simple functions: map and reduce  Hides away the complexities of managing large and numerous data files into the distributed file system supporting the platform
  • 50. MapReduce Programming Model  MapReduce is a programming platform introduced by Google  For processing large quantities of data  Expresses the computation logic of an application into two simple functions: map and reduce  Data transfer and management is completely handled by the distributed storage infrastructure  i.e. The Google File System  In charge of providing access to data, replicate files, and eventually move them where needed  Developers do not have to handle these issues  Provided with an interface presenting data at a higher level: as a collection of key-value pairs
  • 51. MapReduce Programming Model (cont.)  The computation of the applications is organized in a workflow of map and reduce operations  Entirely controlled by the runtime system  Developers have only to specify how the map and reduce functions operate on the key value pairs  The model is expressed in the form of the two functions defined as follows  map(k1,v1)  list(k2,v2)  reduce(k2, list(v2))  list(v2)  The map function reads a key-value pair  Produces a list of key-value pairs of different types  The reduce function reads a pair composed by a key and a list of values
  • 52. MapReduce Programming Model (cont.)  Produces a list of values of the same type  The types (k1,v1,k2,kv2) used in the expression of the two functions provide hints on  How these two functions are connected and are executed to carry out the computation of a MapReduce job  The output of map tasks is aggregated together by grouping the values according to their associated keys  Constitute the input of reduce tasks  For each of the keys found, reduces the list of attached values to a single value  The input of a computation is expressed as a collection of key-value pairs <k1,v1>  The final output is represented by a list values: list(v2)
  • 53. MapReduce Programming Model (cont.)  A reference workflow characterizing MapReduce computations  The user submits a collection of files that are expressed in the form of a list of <k1,v1> pairs  Specifies the map and reduce functions  These files are entered into the distributed file system supporting MapReduce  If necessary, partitioned to be the input of map tasks  Map tasks generate intermediate files that store collections of <k2, list(v2)> pairs  These files are saved into the distributed file system  The MapReduce runtime may finally aggregate the values corresponding to the same keys
  • 54. MapReduce Programming Model (cont.)  These files constitute the input of reduce tasks  Eventually produce output files in the form of list(v2)  The operation performed by reduce tasks is generally expressed as an aggregation of all the values mapped by a specific key  Responsibilities of the MapReduce runtime  The number of map and reduce tasks to create, the way in which files are partitioned with respect to these tasks, and how many map tasks are connected to a single reduce task  Responsibilities of the distributed file system supporting MapReduce  The way in which files are stored and moved
  • 56. MapReduce Programming Model (cont.)  The computation model expressed by MapReduce is very straightforward  Allows a greater productivity for those who have to code the algorithms for processing huge quantities of data  Proven to be successful in the case of Google  The majority of the information that needs to be processed is stored in textual form and represented by web pages or log files  Some of the examples showing the flexibility of MapReduce  Distributed Grep
  • 57. MapReduce Programming Model (cont.)  The grep operation performs the recognition of patterns within text streams  Performed across a wide set of files  MapReduce is leveraged to provide a parallel and faster execution of this operation  The input file is a plain text file  The map function emits a line into the output each time it recognizes the given pattern  The reduce task aggregates all the lines emitted by the map tasks into a single file  Count of URL-Access Frequency  MapReduce is used to distribute the execution of web- server log parsing  The map function takes as input the log of a web server
  • 58. MapReduce Programming Model (cont.)  Emits into the output file a key-value pair <URL,1> for each page access recorded in the log  The reduce function aggregates all these lines by the corresponding URL  Summing the single accesses and outputs a <URL, total-count> pair  Reverse Web-Link Graph  The Reverse web-link graph keeps track of all the possible web pages that lead to a given link  Input files are simple HTML pages that are scanned by map tasks  Emitting <target, source> pairs for each of the links found given in the web page source  The reduce task will collate all the pairs that have the same target into a <target, list(source)> pair
  • 59. MapReduce Programming Model (cont.)  The final result is given one or more files containing these mappings  Term-Vector per Host  A term vector recaps the most important words occurring in a set of documents in the form of list(<word, frequency>)  The number of occurrences of a word is taken as a measure of its importance  MapReduce is used to provide a mapping between the origin of a set of documents, obtained as the host component of the URL of a document, and the corresponding term vector  The map task creates a pair <host, term-vector> for each text document retrieved  The reduce task aggregates the term vectors of documents retrieved from the same host
  • 60. MapReduce Programming Model (cont.)  Inverted Index  The inverted index contains information about the presence of words in documents  Useful to allow fast full text searches if compared to direct document scans  In this case, the map task takes as input a document  For each document it emits a collection of <word, document-id>  The reduce function aggregates the occurrences of the same word, producing a pair <word, list(document- id)>  Distributed Sort  MapReduce is used to parallelize the execution of a sort operation over a large number of records
  • 61. MapReduce Programming Model (cont.)  This application mostly rely on the properties of the MapReduce runtime rather than in the operations performed in the map and reduce tasks  The runtime sorts and creates partitions of the intermediate files  The map task extracts the key from a record and emits a <key, record> pair for each record  The reduce task will simply copy through all the pairs  The actual sorting process is performed by the MapReduce runtime  Emit and partition the key value pair by ordering them according to the key  The above mostly concerned with text based processing
  • 62. MapReduce Programming Model (cont.)  MapReduce can also be used to solve a wider range of problems  With some adaptation  An interesting example is its application in the field of machine learning  Statistical algorithms such as Support Vector Machines (SVM), Linear Regression (LR), Naïve Bayes (NB), and Neural Network (NN)  Expressed in the form of map and reduce functions  Other interesting applications can be found in the field of compute intensive applications  e.g., The computation of Pi with high degree of precision
  • 63. MapReduce Programming Model (cont.)  The Yahoo Hadoop cluster has been used to compute the 1015 + 1 bit of Pi  Hadoop is an open source implementation of the MapReduce platform  Any computation can be represented in the terms of MapReduce computation  Can be expressed in the form of two major stages  Analysis  This phase operates directly to the data input file  Corresponds to the operation performed by the map task  The computation at this stage is expected to be embarrassingly parallel since map tasks are executed without any sequencing or ordering
  • 64. MapReduce Programming Model (cont.)  Aggregation  This phase operates on the intermediate results  Characterized by operations aimed at aggregating, summing, and elaborating the data obtained at the previous stage to present it into its final form  The task performed by the reduce function  Adaptations to this model are mostly concerned with  Identifying what are the appropriate keys  How to create reasonable keys when the original problem does not have such a model  How to partition the computation between map and reduce functions
  • 65. MapReduce Programming Model (cont.)  More complex algorithms can be decomposed into multiple MapReduce programs  The output of one program constitutes the input of the following program  The abstraction proposed by MapReduce provides developers with a very minimal interface  Strongly focused on the algorithm to implement  Rather than the infrastructure on which it is executed  A very effective approach but at the same time demands a lot of common tasks
  • 66. MapReduce Programming Model (cont.)  Important in the management of a distributed application to the MapReduce runtime  Allowing the user only to specify configuration parameters to control the behavior of applications  These tasks are the management of data transfer  The scheduling of map and reduce tasks over a distributed infrastructure  A more complete overview of a MapReduce infrastructure  According to the implementation proposed by Google  The user submits the execution of MapReduce jobs by using the client libraries
  • 67. MapReduce Programming Model (cont.)  In charge of submitting the input data files  Registering the map and reduce functions  Returning the control to the user once the job is completed  A generic distributed infrastructure, i.e. a cluster, can be used to run MapReduce applications  Equipped with job scheduling capabilities and distributed storage  Two different kinds of processes are run on the distributed infrastructure  A master process and a worker process  The master process is in charge of controlling the execution of map and reduce tasks  Partition and reorganize the intermediate output produced by the map task to feed the reduce tasks
  • 68. MapReduce Programming Model (cont.)  The worker processes are used to host the execution of map and reduce tasks  Provide basic I/O facilities that are used to interface the map and reduce tasks with input and output files  In a MapReduce computation, input files are initially divided into splits  Generally 16 to 64 MB  Stored in the distributed file system  The master process generates the map tasks  Assigns input splits to each of them by balancing the load  Worker processes have input and output buffers  Used to optimize the performance of map and reduce tasks
  • 69. MapReduce Programming Model (cont.)  Output buffer for map tasks are periodically dumped to disk to create intermediate files  Intermediate files are partitioned by using a user- defined function to split evenly the output of map tasks  The locations of these pairs are notified to the master process  Forwards this information to the reduce task able to collect the required input via a remote procedure call to read from the map tasks’ local storage  The key range is sorted and all the same keys are grouped together  The reduce task is executed to produce the final output, stored into the global file system
  • 71. MapReduce Programming Model (cont.)  The above process is completely automatic  Users may control it through configuration parameters  Allow specifying the number of map tasks, the number of partitions into which separate the final output, and the partition function for the intermediate key range  Also the map and reduce function  Besides orchestrating the execution of map and reduce tasks  The MapReduce runtime ensures a reliable execution of applications by providing a fault- tolerant infrastructure  Failures of both master and worker processes are handled
  • 72. MapReduce Programming Model (cont.)  As well as machine failures that make intermediate outputs not accessible  Worker failures are handled by rescheduling map tasks somewhere else  Also the technique that is used to address machine failures  Since the valid intermediate output of map tasks has become inaccessible  Master process failure is instead addressed by using check-pointing  Allow restarting the MapReduce job with a minimum loss of data and computation
  • 73. Variations and Extensions of MapReduce  MapReduce constitute a simplified model for processing large quantities of data  Imposes constraints on how distributed algorithms should be organized to run over a MapReduce infrastructure  The model can be applied to several different problem scenarios  The abstractions provided to process data are simple  Complex problems may require considerable effort to be represented in terms of map and reduce functions  A series of extensions and variations to the original MapReduce model are proposed  Extending MapReduce application space  Providing an easier interface to developers for designing distributed algorithms
  • 74. Hadoop  Apache Hadoop is a collection of software projects for reliable and scalable distributed computing  The entire collection is an open source implementation of the MapReduce framework  Supported by a GFS-like distributed file system  The initiative consists of mostly two projects  Hadoop Distributed File System (HDFS)  Hadoop MapReduce  The former is an implementation of the Google File System  The latter provides the same features and abstractions of Google MapReduce
  • 75. Hadoop (cont.)  Initially developed and supported by Yahoo  Hadoop constitutes now the most mature and large data Cloud application  Has a very robust community of developers and users supporting it  Yahoo runs now the world largest Hadoop cluster  Composed by 40000 machines and more than 300 thousands cores  Made available to global academic institutions  There is a collection of other projects somehow related to it  Providing services for distributed computing
  • 76. Map-Reduce-Merge  Map-Reduce-Merge is an extension to the MapReduce model  Introducing a third phase to the standard MapReduce pipeline  The Merge phase allows efficiently merging data already partitioned and sorted (or hashed) by map and reduce modules  Simplifies the management of heterogeneous related datasets  Provides an abstraction able to express the common relational algebra operators as well as several join algorithms
  • 77. Aneka MapReduce Programming  Aneka provides an implementation of the MapReduce abstractions  By following the reference model introduced by Google and implemented by Hadoop  MapReduce is supported as one of the available programming models  Can be used to develop distributed applications
  • 78. Introducing the MapReduce Programming Model  The MapReduce Programming Model defines the abstractions and the runtime support  For developing MapReduce applications on top of Aneka  An overview of the infrastructure supporting MapReduce in Aneka  A MapReduce job in Google MapReduce or Hadoop corresponds to the execution of a MapReduce application in Aneka  The application instance is specialized with components that identify the map and reduce functions to use
  • 79. Introducing the MapReduce Programming Model (cont.)  These functions are expressed in the terms of a Mapper and Reducer class  Extended from the AnekaMapReduce APIs  The runtime support is constituted by three main elements  MapReduce Scheduling Service  Plays the role of the master process in the Google and Hadoop implementation  MapReduce Execution Service  Plays the role of the worker process in the Google and Hadoop implementation  A specialized distributed file system  Used to move data files
  • 80. Introducing the MapReduce Programming Model (cont.)  Client components are used to submit the execution of a MapReduce job, upload data files, and monitor it  Namely, the MapReduceApplication class  The management of data files is transparent  Local data files are automatically uploaded to Aneka  Output files are automatically downloaded to the client machine if requested
  • 82. Programming Abstractions  Aneka executes any piece of user code within the context of a distributed application  Maintained even in the MapReduce model  There is a natural mapping between the concept of MapReduce job used in Google MapReduce and Hadoop and the Aneka application concept  The task creation is not responsibility of the user but of the infrastructure  Once the user has defined the map and reduce functions  The Aneka MapReduce APIs provide developers with base classes for developing Mapper and Reducer types
  • 83. Programming Abstractions (cont.)  Use a specialized type of application class, MapReduceApplication, that better supports the needs of this programming model  An overview of the client components defining the MapReduce programming model  Three classes are of interest for application development  Mapper<K,V>, Reducer<K,V>, and MapReduceApplication<M,R>  The other classes are internally used to implement all the functionalities required by the model  Expose simple interfaces that require minimum amount of coding for implementing the map and reduce functions and controlling the job submission
  • 84. Programming Abstractions (cont.)  Mapper<K,V> and Reducer<K,V> constitute the starting point of the application design and implementation  Template specialization is used to keep track of keys and values types on which the two functions operate  Generic types provide a more natural approach in terms of object manipulation from within the map and reduce methods  Simplify the programming by removing the necessity of casting and other type check operations  The submission and execution of a MapReduce job is performed through the class MapReduceApplication<M,R>  Provides the interface to the Aneka Cloud to support MapReduce programming model
  • 86. Programming Abstractions (cont.)  The above class exposes two generic class types  M and R  These two placeholders identify the specific class types of Mapper<K,V> and Reducer<K,V> that will be used by the application  The definition of the Mapper<K,V> class and of the related types for implementing the map function  To implement a specific mapper, necessary to inherit this class and provide actual types for key K and the value V  The map operation is implemented by overriding the abstract method void Map(IMapInput<K,V> input)
  • 88. Programming Abstractions (cont.)  The other methods are internally used by the framework  IMapInput<K,V> provides access to the input key- value pair on which the map operation is performed  The implementation of the Mapper<K,V> component for the Word Counter sample  Counts the frequency of words into a set of large text files  The text files are divided into lines  Each of the lines will become the value component of a key-value pair  The key will be represented by the offset in the file where the line begins  The mapper is specialized by using a long integer as key type and a string for the value
  • 92. Programming Abstractions (cont.)  To count the frequency of words, the map function will emit a new key-value pair for each of the word contained in the line  By using the word as the key and the number 1 as value  This implementation will emit two pairs for the same word if the word occurs twice in the line  The reducer sums appropriately all these occurrences  The definition of the Reducer<K,V> class  The implementation of a specific reducer requires specializing the generic class  Overriding the abstract method: Reduce (IReduceInputEnumerator<V> input)
  • 95. Programming Abstractions (cont.)  The reduce operation is applied to a collection of values that are mapped to the same key  The IReduceInputEnumerator<V> allows developers to iterate over such collection  How to implement the reducer function for the word counter sample  The Reducer<K,V> class is specialized using a string as a key type and an integer as a value  The reducer simply iterates over all the values  Accessible through the enumerator and sum them  As the iteration is completed, the sum is dumped to file  There is a link between the types used to specialize the mapper and those used to specialize the reducer
  • 99. Programming Abstractions (cont.)  The key and value types used in the reducer are those defining the key-value pair emitted by the mapper  In this case, the mapper generates a key-value pair (string,int)  The reducer is of type Reducer<string,int>  The Mapper<K,V> and Reducer<K,V> classes provides facilities  For defining the computation performed by a MapReduce job  Aneka provides the MapReduceApplication <M,R> class  To submit, execute, and monitor the progress  Represents the local view of distributed applications on top of Aneka
  • 101. Programming Abstractions (cont.)  Such class provides limited facilities mostly concerned with starting the MapReduce job and waiting for its completion  Due to the simplicity of the MapReduce model  The interface of MapReduceApplication <M,R>  The interface of the class exhibits only the MapReduce specific settings  The control logic is encapsulated in the ApplicationBase<M> class  Possible to set the behavior of MapReduce for the current execution  The parameters that can be controlled
  • 106. Programming Abstractions (cont.)  Partitions  Stores an integer number containing the number of partitions into which divide the final results  Determines also the number of reducer tasks that will be created by the runtime infrastructure  The default value is 10  Attempts  Contains the number of times that the runtime will retry to execute a task before declaring it failed.  The default value is 3  UseCombiner  Stores a boolean value indicating if the MapReduce runtime should add a combiner phase to the map task execution to reduce the number of intermediate files that are passed to the reduce task
  • 107. Programming Abstractions (cont.)  The default value is set to true  SynchReduce  Stores a boolean value indicating whether to synchronize the reducers or not  The default value is set to true  Not used to determine the behavior of MapReduce  IsInputReady  A boolean property indicating whether the input files are already stored into the distributed file system  Or have to uploaded by the client manager before executing the job  The default value is set to false  FetchResults
  • 108. Programming Abstractions (cont.)  A boolean property indicating whether the client manager needs to download to the local computer the result files produced by the execution of the job  The default value is set to true  LogFile  Contains a string defining the name of the log file used to store the performance statistics recorded during the execution of the MapReduce job  The default value is mapreduce.log  The core control logic governing the execution of a MapReduce job  Resides within the MapReduceApplicationManager <M,R> class  Interacts with the MapReduce runtime
  • 109. Programming Abstractions (cont.)  Developers can control the application by using the methods and the properties  Exposed by the ApplicationBase<M> class  The collection of methods of interest in the above class for executing MapReduce jobs  The constructors and the common properties that are of interest of all the applications  Two different overloads of the InvokeAndWait method  The first one simply starts the execution of the MapReduce job and returns upon its completion  The second one executes a client supplied callback at the end of the execution
  • 110. Programming Abstractions (cont.)  Most commonly used to execute MapReduce jobs  The use of InvokeAndWait is blocking  Not possible to stop the application by calling StopExecution within the same thread  If necessary to implement a more sophisticated management of the MapReduce job  Possible to use the SubmitExecution method  Submits the execution of the application and returns without waiting for its completion  The MapReduce implementation will automatically upload all the files  Found in the Configuration.Workspace directory  Ignore the files added by the AddSharedFile methods
  • 116. Programming Abstractions (cont.)  How to create a MapReduce application for running the Word Counter sample  Defined by the previous WordCounterMapper and WordCounterReducer classes  The lines of interest are those put in evidence in the try { … } catch { … } finally { …} block  The execution of a MapReduce Job requires only three lines of code  The user reads the configuration file  Creates a MapReduceApplication<M,R> instance and Configures it  Starts the execution  All the rest of the code is mostly concerned with setting up the logging and handling exceptions
  • 120. Runtime Support  The runtime support for the execution of MapReduce jobs is constituted by the collection of services  Deal with the scheduling and the execution of MapReduce tasks  MapReduce Scheduling Service and the MapReduce Execution Service  Integrated with the existing services of the framework to provide persistence, application accounting  Also with the features available for the applications developed with other programming models  Job and Task Scheduling  The scheduling of jobs and tasks is responsibility of the MapReduce Scheduling Service
  • 121. Runtime Support (cont.)  Covers the same role of the master process in the Google MapReduce implementation  The architecture of the scheduling service is organized into two major components  The MapReduceSchedulerService and the MapReduceScheduler  The former is a wrapper around the scheduler implementing the interfaces required by Aneka to expose a software component as a service  The latter controls the execution of jobs and schedules tasks  The main role of the service wrapper is to translate messages coming from the Aneka runtime or the client applications  Into calls or events directed to the scheduler component and vice versa
  • 122. Runtime Support (cont.)  The relationship of two components  The core functionalities for job and tasks scheduling are implemented in the MapReduceScheduler class  The scheduler manages multiple queues for several operations  Uploading input files into the distributed file system  Initializing jobs before scheduling  Scheduling map and reduce tasks  Keeping track of unreachable nodes  Re-submitting failed tasks  Reporting execution statistics  All are performed asynchronously and triggered by events happening in the Aneka middleware
  • 124. Runtime Support (cont.)  Task Execution  The execution of tasks is controlled by the MapReduce Execution Service  Plays the role of the worker process in the Google MapReduce implementation.  The service manages the execution of map and reduce tasks  Also performs other operations such as sorting and merging of intermediate files  The internal organization of the above service  Three major components that coordinate together for executing tasks
  • 125. Runtime Support (cont.)  MapReduceExecutorService, ExecutorManager, and MapReduceExecutor  The MapReduceExecutorService interfaces the ExecutorManager with the Aneka middleware  The ExecutorManager is in charge of keeping track of the tasks being executed  By demanding the specific execution of a task to the MapReduceExecutor  Also of sending the statistics about the execution back to the scheduler service  Possible to configure more than one MapReduceExecutor instances  Helpful in case of multi-core nodes where more than one task can be executed at the same time
  • 127. Distributed File System Support  Differently from the other programming models supported by Aneka  The MapReduce model does not leverage the default Storage Service for storage and data transfer  But uses a distributed file system implementation  The requirements in terms of file management are significantly different to the other models  MapReduce has been designed to process large quantities of data  Stored in files of large dimensions  The support provided by a distributed file system is more appropriate
  • 128. Distributed File System Support (cont.)  Can leverage multiple nodes for storing data  Guarantee high availability and a better efficiency by means of replication and distribution  The original MapReduce implementation assumes the existence of a distributed and reliable storage  The use of a distributed file system for implementing the storage layer is natural  Aneka provides the capability of interfacing with different storage implementations  Maintains the same flexibility for the integration of a distributed file system
  • 129. Distributed File System Support (cont.)  The level of integration required by MapReduce requires the ability to perform  Retrieving the location of files and file chunks  Useful to the scheduler for optimizing the scheduling of map and reduce tasks according to the location of data  Accessing a file by means of a stream  Required for the usual I/O operations to and from data files  The stream might also access the network  If the file chunk is not stored on the local node  Aneka provides interfaces that allow performing such operations
  • 130. Distributed File System Support (cont.)  Also the capability to plug different file system behind the operations by providing the appropriate implementation  The current implementation provides bindings to HDFS  On top of these low level interfaces  The MapReduce programming model offers classes to read from and write to files in a sequential manner  The classes SeqReader and SeqWriter  Provide sequential access for reading and writing key- value pairs  The above two classes expect a specific file format
  • 131. Distributed File System Support (cont.)  An Aneka MapReduce file is composed by a header used to identify the file and a sequence of record blocks  Each of them storing a key-value pair  The header is composed by 4 bytes  The first three bytes represent the character sequence SEQ  The fourth byte identifies the version of the file  The record block is composed as follows  The first 8 bytes are used to store two integers representing the length of the rest of the block and the length of the key section  The remaining part of the block stores the data of the value component of the pair
  • 132. Distributed File System Support (cont.)  The SeqReader and SeqWriter classes are designed to read and write files in this format  By transparently handling the file format information and translating key and value instances to and from their binary representation  All the .NET built-in types are supported  MapReduce jobs operates very often with data represented in common text files  A specific version of the SeqReader and SeqWriter classes have been designed  To read and write text files as a sequence of key-value pairs  In the case of the read operation
  • 133. Distributed File System Support (cont.)
  • 134. Distributed File System Support (cont.)  Each value of the pair is represented by a line in the text file  The key is automatically generated and assigned to the position in bytes where the line starts in the file  In case of the write operation  The writing of the key is skipped and the values are saved as single lines  The interface of the SeqReader and SeqWriter classes  The SeqReader class provides an enumerator based approach  Possible to access to the key and the value sequentially by calling the NextKey() and the NextValue() methods respectively
  • 135. Distributed File System Support (cont.)  Also possible to access the raw byte data of keys and values by using the NextRawKey() and NextRawValue()  The HasNext() returns a Boolean indicating whether there are more pairs to read or not  The SeqWriter class exposes different versions of the Append method  A practical use of the SeqReader class  By implementing the callback used in the Word Counter sample  To visualize the results of the application, use the SeqReader class to read the content of the output files  Dump it into a proper textual form
  • 136. Distributed File System Support (cont.)
  • 137. Distributed File System Support (cont.)
  • 138. Distributed File System Support (cont.)
  • 139. Distributed File System Support (cont.)
  • 140. Distributed File System Support (cont.)
  • 141. Distributed File System Support (cont.)
  • 142. Distributed File System Support (cont.)  Can be visualized with any text editor, such as the Notepad application  The OnDone callback checks whether the application has terminated successfully  In case there are no errors, iterates over the result files downloaded in the workspace output directory  By default, the files are saved in the output subdirectory of the workspace directory  For each of the result files  Opens a SeqReader instance on it  Dumps the content of the key-value pair into a text file, which can be opened by any text editor
  • 143. Distributed File System Support (cont.)
  • 144. Distributed File System Support (cont.)
  • 145. Distributed File System Support (cont.)
  • 146. Distributed File System Support (cont.)
  • 147. Distributed File System Support (cont.)
  • 148. Example Application  MapReduce is a very useful model for processing large quantities of data  In many cases are maintained in a semi- structured form like logs or web pages  To demonstrate how to program real applications with Aneka MapReduce  Consider a very common task: log parsing  Design a MapReduce application that processes the logs produced by the Aneka container  To extract some summary information about the behavior of the Cloud  Design the Mapper and Reducer classes  Used to execute log parsing and data extraction operations
  • 149. Parsing Aneka Logs  Aneka components produce a lot of information  Stored in the form of log files  e.g., daemons, container instances, and services  The most relevant information is stored into the container instances logs  Store the information about the applications that are executed on the Cloud  Parse these logs to extract useful information about the execution of applications and the usage of services in the Cloud  The entire framework leverages the log4net library
  • 150. Parsing Aneka Logs (cont.)  For collecting and storing the log information  The system is configured to produce a log file  Partitioned in chunks every time the container instance restarts  The information contained in the log file can be customized in their appearance  Currently the default layout is DD MMM YY hh:mm:ss level –message  Some examples of formatted log messages  The message part of almost all the log lines exhibit a similar structure
  • 151. Parsing Aneka Logs (cont.)  Start with the information about the component that enters the log line  Can be easily extracted by looking at the first occurrence of the ‘:’ character  Following a sequence of characters that do not contain spaces 15 Mar 2011 10:30:07 DEBUG - SchedulerService:HandleSubmitApplication - SchedulerService: … 15 Mar 2011 10:30:07 INFO - SchedulerService: Scanning candidate storage … 15 Mar 2011 10:30:10 INFO - Added [WU: 51d55819-b211-490f-b185-8a25734ba705, 4e86fd02… 15 Mar 2011 10:30:10 DEBUG - StorageService:NotifyScheduler - Sending FileTransferMessage… 15 Mar 2011 10:30:10 DEBUG - IndependentSchedulingService:QueueWorkUnit–Queueing… 15 Mar 2011 10:30:10 INFO - AlgorithmBase::AddTasks[64] Adding 1 Tasks 15 Mar 2011 10:30:10 DEBUG - AlgorithmBase:FireProvisionResources - Provision Resource: 1
  • 152. Parsing Aneka Logs (cont.)  Possible information intended to extract from such logs  The distribution of log messages according to the level  The distribution of log messages according to the components  This information can be easily extracted and composed into a single view  By creating Mapper tasks that count the occurrences of log levels and component names  Emit one simple key-value pair in the form (level- name, 1) or (component-name, 1) for each of the occurrences
  • 153. Parsing Aneka Logs (cont.)  The Reducer task will simply sum up all the key-value pairs that have the same key  The structure of the map and reduce functions will be the following map: (long, string) => (string, long) reduce: (string, long) => (string, long)  The Mapper class receive a key-value pair  Containing the position of the line inside the file as a key and the log message as the value component  Produce a key-value pair containing a string representing the name of the log level or the component name and 1 as value  The Reducer class will sum up all the key value pairs that have the same name
  • 154. Parsing Aneka Logs (cont.)  Perform both of the analyses simultaneously  Not developing two different MapReduce jobs  The operation performed by the Reducer class is the same in both cases  The operation of the Mapper class changes  The type of the key-value pair that is generated is the same for the two jobs  Possible to combine the two tasks performed by the map function into one single Mapper class to produce two key-value pairs for each input line  By differentiating the name of Aneka components from the log level names by using an initial underscore character  Very easy to post process the output of the reduce function in order to present and organize data
  • 155. Mapper Design and Implementation  The operation performed by the map function is a very simple text extraction  Identifies the level of the logging and the name of the component entering the information in the log  Once this information is extracted, a key-value pair (string, long) is emitted by the function  To combine the two MapReduce jobs into one single job, each map task will at most emit two key-value pair  Some of the log lines do not record the name of the component that entered the line  For these lines, only the key-value pair corresponding to the log level will be emitted
  • 156. Mapper Design and Implementation (cont.)  The implementation of the mapper class for the log parsing task  The Map method simply locates the position of the log level label into the line and extracts it  Emits a corresponding key-value pair (label, 1)  Then tries to locate the position of the name of the Aneka component that enters the log line  By looking for a sequence of characters limited by a colon  If such a sequence does not contain spaces, it represents the name of the Aneka component  Another key-value pair (component-name, 1) is emitted  To differentiate from the log level labels, an underscore is pre-fixed to the name of the key in this second case
  • 157. Mapper Design and Implementation (cont.)
  • 158. Mapper Design and Implementation (cont.)
  • 159. Mapper Design and Implementation (cont.)
  • 160. Reducer Design and Implementation  The implementation of the reduce function is even more straightforward  The only operation that needs to be performed is to add all the values associated to the same key  Emit a key-value pair with the total sum  The infrastructure will already aggregate all the values that have been emitted for a given key  Simply need to iterate the over the collection of values and sum them up  Actually the same for both of the two different key- value pairs extracted from the log lines  Responsibility of the driver program to differentiate among the different type of information assembled in the output files
  • 161. Reducer Design and Implementation (cont.)
  • 162. Reducer Design and Implementation (cont.)
  • 163. Driver Program  The LogParsingMapper and LogParsingReducer constitute the core functionality of the MapReduce job  Only requires to be properly configured in the main program to process and produce text files  The mapper component extracts two different types of information  Another task performed in the driver application is the separation of these two statistics into two different files for further analysis  The implementation of the driver program  Three things to be noticed  The configuration of the MapReduce job
  • 164. Driver Program (cont.)  The post processing of the result files  The management of errors  The configuration of the job is performed in the Initialize method  Reads the configuration file from the local file system  Ensures that the input and output formats of files are set to text  MapReduce jobs can be configured by using a specific section of the configuration file  Named MapReduce  Two sub-sections control the properties of input and output files  Named Input and Output respectively
  • 165. Driver Program (cont.)  The input and output sections may contain the following properties  Format (string)  Defines the format of the input file  If this property is set, the only supported value is text  Filter (string)  Defines the search pattern to be used to filter the input files to process in the workspace directory  Only applies for the Input Properties group  NewLine (string)  Defines the sequence of characters used to detect (or write) a new line in the text stream  This value is meaningful when the input/output format is set to text
  • 166. Driver Program (cont.)  The default value is selected from the execution environment if not set  Separator (character)  Only present in the Output section  Defines the character that need to be used to separate the key from the value in the output file  This value is meaningful when the input/output format is set to text  Possible to control other parameters of a MapReduce job  These parameters are defined in the main MapReduce configuration section  Instead of a programming approach for the initialization of the configuration
  • 172. Driver Program (cont.)  Also possible to embed these settings into the standard Aneka configuration file  Possible to open a <Group name="MapReduce”> …</Group/> tag  Enter all the properties required for the execution  The Aneka configuration file is based on a flexible framework that allows to simply enter groups of name-value properties  The Aneka.Property and Aneka.PropertyGroup classes also provide facilities  For converting the strings representing the value of a property into a corresponding built-in type if possible  Simplifies the task of reading from and writing to configuration objects
  • 174. Driver Program (cont.)  The second element is to deal with the post processing of the output files.  Performed in the OnDone method  Its invocation is triggered either if an error occurs during the execution of the MapReduce job  Or if it completes successfully  Separates the occurrences of log level labels from the occurrences of Aneka component names  By saving them into two different files: loglevels.txt and components.txt under the workspace directory  Deletes the output directory where the output files of the reduce phase have been downloaded  These above two files contain the aggregated result of the analysis
  • 175. Driver Program (cont.)  Can be used to extract statistic information about the content of the log files and display graphically  A final aspect is the management of errors  Aneka provides a collection of APIs contained in the Aneka.Util library  Represent utility classes for automating tedious tasks  e.g., The appropriate collection of stack trace information associated to an exception or the information about the types of the exception thrown  The reporting features triggered upon exceptions are implemented into the ReportError method  Utilizes the facilities offered by the IOUtil class to dump a simple error report to both the console and a log file named using the pattern: error.YYYY-MM-DD_ hh-mm-ss.log
  • 180. Running the Application  Aneka produces a considerable amount of logging information  The default configuration of the logging infrastructure creates a new log file for each activation of the Container process  Or as soon as the dimension of the log file goes beyond 10 MB  By simply keep running an Aneka Cloud for a few days, quite easy to collect enough data to mine for the above sample application  Also constitutes a real case study for MapReduce  One of its most common practical applications is extracting semi-structured information from logs and traces of execution
  • 181. Running the Application (cont.)  In the execution of the test  Uses a distributed infrastructure consisting of 7 worker nodes and one master node  Interconnected through a LAN  Processes 18 log files of several sizes for a total aggregate size of 122 MB  The execution of the MapReduce job over the collected data produces the results  Stored in the loglevels.txt and components.txt files  Represented graphically  There is a considerable amount of unstructured information in the log files produced by the Container processes
  • 182. Running the Application (cont.)  About 60% of the log content is skipped during the classification  This content is more likely due to the result of stack trace dumps into the log file  As a result of ERROR and WARN entries, a sequence of lines that are not recognized are produced  The distribution among the components that make use of the logging APIs  This distribution is computed over the data recognized as a valid log entry  Just about 20% of these entries have not been recognized by the parser implemented in the map function
  • 183. Running the Application (cont.) DEBUG, 12% ERROR, 4% INFO, 22% SKIPPED, 62% WARN, 0% DEBUG ERROR INFO SKIPPED WARN
  • 184. Running the Application (cont.)  The meaningful information extracted from the log analysis constitutes about 32% of the entire log data  i.e., 80% of 40% of the total lines parsed  Despite the simplicity of the parsing function implemented in the map task  The Aneka MapReduce programming model can be used to easily perform massive data analysis tasks  Logically and programmatically approach a realistic data analysis case study with MapReduce  Implement it on top of the Aneka APIs
  • 185. Running the Application (cont.) 3 189439 54878 10 180 55523 6 984 4385 5 5 76895 31 10 610 610

Editor's Notes

  1. 1.1 CLOUD COMPUTING IN A NUTSHELL
  2. 1.1 CLOUD COMPUTING IN A NUTSHELL
  3. 1.1 CLOUD COMPUTING IN A NUTSHELL
  4. 1.1 CLOUD COMPUTING IN A NUTSHELL
  5. 1.1 CLOUD COMPUTING IN A NUTSHELL
  6. 1.1 CLOUD COMPUTING IN A NUTSHELL
  7. 1.1 CLOUD COMPUTING IN A NUTSHELL
  8. 1.1 CLOUD COMPUTING IN A NUTSHELL
  9. 1.1 CLOUD COMPUTING IN A NUTSHELL
  10. 1.1 CLOUD COMPUTING IN A NUTSHELL
  11. 1.1 CLOUD COMPUTING IN A NUTSHELL
  12. 1.1 CLOUD COMPUTING IN A NUTSHELL
  13. 1.1 CLOUD COMPUTING IN A NUTSHELL
  14. 1.1 CLOUD COMPUTING IN A NUTSHELL
  15. 1.1 CLOUD COMPUTING IN A NUTSHELL
  16. 1.1 CLOUD COMPUTING IN A NUTSHELL
  17. 1.1 CLOUD COMPUTING IN A NUTSHELL
  18. 1.1 CLOUD COMPUTING IN A NUTSHELL
  19. 1.1 CLOUD COMPUTING IN A NUTSHELL
  20. 1.1 CLOUD COMPUTING IN A NUTSHELL
  21. 1.1 CLOUD COMPUTING IN A NUTSHELL
  22. 1.1 CLOUD COMPUTING IN A NUTSHELL
  23. 1.1 CLOUD COMPUTING IN A NUTSHELL
  24. 1.1 CLOUD COMPUTING IN A NUTSHELL
  25. 1.1 CLOUD COMPUTING IN A NUTSHELL
  26. 1.1 CLOUD COMPUTING IN A NUTSHELL
  27. 1.1 CLOUD COMPUTING IN A NUTSHELL
  28. 1.1 CLOUD COMPUTING IN A NUTSHELL
  29. 1.1 CLOUD COMPUTING IN A NUTSHELL
  30. 1.1 CLOUD COMPUTING IN A NUTSHELL
  31. 1.1 CLOUD COMPUTING IN A NUTSHELL
  32. 1.1 CLOUD COMPUTING IN A NUTSHELL
  33. 1.1 CLOUD COMPUTING IN A NUTSHELL
  34. 1.1 CLOUD COMPUTING IN A NUTSHELL
  35. 1.1 CLOUD COMPUTING IN A NUTSHELL
  36. 1.1 CLOUD COMPUTING IN A NUTSHELL
  37. 1.1 CLOUD COMPUTING IN A NUTSHELL
  38. 1.1 CLOUD COMPUTING IN A NUTSHELL
  39. 1.1 CLOUD COMPUTING IN A NUTSHELL
  40. 1.1 CLOUD COMPUTING IN A NUTSHELL
  41. 1.1 CLOUD COMPUTING IN A NUTSHELL
  42. 1.1 CLOUD COMPUTING IN A NUTSHELL
  43. 1.1 CLOUD COMPUTING IN A NUTSHELL
  44. 1.1 CLOUD COMPUTING IN A NUTSHELL
  45. 1.1 CLOUD COMPUTING IN A NUTSHELL
  46. 1.1 CLOUD COMPUTING IN A NUTSHELL
  47. 1.1 CLOUD COMPUTING IN A NUTSHELL
  48. 1.1 CLOUD COMPUTING IN A NUTSHELL
  49. 1.1 CLOUD COMPUTING IN A NUTSHELL
  50. 1.1 CLOUD COMPUTING IN A NUTSHELL
  51. 1.1 CLOUD COMPUTING IN A NUTSHELL
  52. 1.1 CLOUD COMPUTING IN A NUTSHELL
  53. 1.1 CLOUD COMPUTING IN A NUTSHELL
  54. 1.1 CLOUD COMPUTING IN A NUTSHELL
  55. 1.1 CLOUD COMPUTING IN A NUTSHELL
  56. 1.1 CLOUD COMPUTING IN A NUTSHELL
  57. 1.1 CLOUD COMPUTING IN A NUTSHELL
  58. 1.1 CLOUD COMPUTING IN A NUTSHELL
  59. 1.1 CLOUD COMPUTING IN A NUTSHELL
  60. 1.1 CLOUD COMPUTING IN A NUTSHELL
  61. 1.1 CLOUD COMPUTING IN A NUTSHELL
  62. 1.1 CLOUD COMPUTING IN A NUTSHELL
  63. 1.1 CLOUD COMPUTING IN A NUTSHELL
  64. 1.1 CLOUD COMPUTING IN A NUTSHELL
  65. 1.1 CLOUD COMPUTING IN A NUTSHELL
  66. 1.1 CLOUD COMPUTING IN A NUTSHELL
  67. 1.1 CLOUD COMPUTING IN A NUTSHELL
  68. 1.1 CLOUD COMPUTING IN A NUTSHELL
  69. 1.1 CLOUD COMPUTING IN A NUTSHELL
  70. 1.1 CLOUD COMPUTING IN A NUTSHELL
  71. 1.1 CLOUD COMPUTING IN A NUTSHELL
  72. 1.1 CLOUD COMPUTING IN A NUTSHELL
  73. 1.1 CLOUD COMPUTING IN A NUTSHELL
  74. 1.1 CLOUD COMPUTING IN A NUTSHELL
  75. 1.1 CLOUD COMPUTING IN A NUTSHELL
  76. 1.1 CLOUD COMPUTING IN A NUTSHELL
  77. 1.1 CLOUD COMPUTING IN A NUTSHELL
  78. 1.1 CLOUD COMPUTING IN A NUTSHELL
  79. 1.1 CLOUD COMPUTING IN A NUTSHELL
  80. 1.1 CLOUD COMPUTING IN A NUTSHELL
  81. 1.1 CLOUD COMPUTING IN A NUTSHELL
  82. 1.1 CLOUD COMPUTING IN A NUTSHELL
  83. 1.1 CLOUD COMPUTING IN A NUTSHELL
  84. 1.1 CLOUD COMPUTING IN A NUTSHELL
  85. 1.1 CLOUD COMPUTING IN A NUTSHELL
  86. 1.1 CLOUD COMPUTING IN A NUTSHELL
  87. 1.1 CLOUD COMPUTING IN A NUTSHELL
  88. 1.1 CLOUD COMPUTING IN A NUTSHELL
  89. 1.1 CLOUD COMPUTING IN A NUTSHELL
  90. 1.1 CLOUD COMPUTING IN A NUTSHELL
  91. 1.1 CLOUD COMPUTING IN A NUTSHELL
  92. 1.1 CLOUD COMPUTING IN A NUTSHELL
  93. 1.1 CLOUD COMPUTING IN A NUTSHELL
  94. 1.1 CLOUD COMPUTING IN A NUTSHELL
  95. 1.1 CLOUD COMPUTING IN A NUTSHELL
  96. 1.1 CLOUD COMPUTING IN A NUTSHELL
  97. 1.1 CLOUD COMPUTING IN A NUTSHELL
  98. 1.1 CLOUD COMPUTING IN A NUTSHELL
  99. 1.1 CLOUD COMPUTING IN A NUTSHELL
  100. 1.1 CLOUD COMPUTING IN A NUTSHELL
  101. 1.1 CLOUD COMPUTING IN A NUTSHELL
  102. 1.1 CLOUD COMPUTING IN A NUTSHELL
  103. 1.1 CLOUD COMPUTING IN A NUTSHELL
  104. 1.1 CLOUD COMPUTING IN A NUTSHELL
  105. 1.1 CLOUD COMPUTING IN A NUTSHELL
  106. 1.1 CLOUD COMPUTING IN A NUTSHELL
  107. 1.1 CLOUD COMPUTING IN A NUTSHELL
  108. 1.1 CLOUD COMPUTING IN A NUTSHELL
  109. 1.1 CLOUD COMPUTING IN A NUTSHELL
  110. 1.1 CLOUD COMPUTING IN A NUTSHELL
  111. 1.1 CLOUD COMPUTING IN A NUTSHELL
  112. 1.1 CLOUD COMPUTING IN A NUTSHELL
  113. 1.1 CLOUD COMPUTING IN A NUTSHELL
  114. 1.1 CLOUD COMPUTING IN A NUTSHELL
  115. 1.1 CLOUD COMPUTING IN A NUTSHELL
  116. 1.1 CLOUD COMPUTING IN A NUTSHELL
  117. 1.1 CLOUD COMPUTING IN A NUTSHELL
  118. 1.1 CLOUD COMPUTING IN A NUTSHELL
  119. 1.1 CLOUD COMPUTING IN A NUTSHELL
  120. 1.1 CLOUD COMPUTING IN A NUTSHELL
  121. 1.1 CLOUD COMPUTING IN A NUTSHELL
  122. 1.1 CLOUD COMPUTING IN A NUTSHELL
  123. 1.1 CLOUD COMPUTING IN A NUTSHELL
  124. 1.1 CLOUD COMPUTING IN A NUTSHELL
  125. 1.1 CLOUD COMPUTING IN A NUTSHELL
  126. 1.1 CLOUD COMPUTING IN A NUTSHELL
  127. 1.1 CLOUD COMPUTING IN A NUTSHELL
  128. 1.1 CLOUD COMPUTING IN A NUTSHELL
  129. 1.1 CLOUD COMPUTING IN A NUTSHELL
  130. 1.1 CLOUD COMPUTING IN A NUTSHELL
  131. 1.1 CLOUD COMPUTING IN A NUTSHELL
  132. 1.1 CLOUD COMPUTING IN A NUTSHELL
  133. 1.1 CLOUD COMPUTING IN A NUTSHELL
  134. 1.1 CLOUD COMPUTING IN A NUTSHELL
  135. 1.1 CLOUD COMPUTING IN A NUTSHELL
  136. 1.1 CLOUD COMPUTING IN A NUTSHELL
  137. 1.1 CLOUD COMPUTING IN A NUTSHELL
  138. 1.1 CLOUD COMPUTING IN A NUTSHELL
  139. 1.1 CLOUD COMPUTING IN A NUTSHELL
  140. 1.1 CLOUD COMPUTING IN A NUTSHELL
  141. 1.1 CLOUD COMPUTING IN A NUTSHELL
  142. 1.1 CLOUD COMPUTING IN A NUTSHELL
  143. 1.1 CLOUD COMPUTING IN A NUTSHELL
  144. 1.1 CLOUD COMPUTING IN A NUTSHELL
  145. 1.1 CLOUD COMPUTING IN A NUTSHELL
  146. 1.1 CLOUD COMPUTING IN A NUTSHELL
  147. 1.1 CLOUD COMPUTING IN A NUTSHELL
  148. 1.1 CLOUD COMPUTING IN A NUTSHELL
  149. 1.1 CLOUD COMPUTING IN A NUTSHELL
  150. 1.1 CLOUD COMPUTING IN A NUTSHELL
  151. 1.1 CLOUD COMPUTING IN A NUTSHELL
  152. 1.1 CLOUD COMPUTING IN A NUTSHELL
  153. 1.1 CLOUD COMPUTING IN A NUTSHELL
  154. 1.1 CLOUD COMPUTING IN A NUTSHELL
  155. 1.1 CLOUD COMPUTING IN A NUTSHELL
  156. 1.1 CLOUD COMPUTING IN A NUTSHELL
  157. 1.1 CLOUD COMPUTING IN A NUTSHELL
  158. 1.1 CLOUD COMPUTING IN A NUTSHELL
  159. 1.1 CLOUD COMPUTING IN A NUTSHELL
  160. 1.1 CLOUD COMPUTING IN A NUTSHELL
  161. 1.1 CLOUD COMPUTING IN A NUTSHELL
  162. 1.1 CLOUD COMPUTING IN A NUTSHELL
  163. 1.1 CLOUD COMPUTING IN A NUTSHELL
  164. 1.1 CLOUD COMPUTING IN A NUTSHELL
  165. 1.1 CLOUD COMPUTING IN A NUTSHELL
  166. 1.1 CLOUD COMPUTING IN A NUTSHELL
  167. 1.1 CLOUD COMPUTING IN A NUTSHELL
  168. 1.1 CLOUD COMPUTING IN A NUTSHELL
  169. 1.1 CLOUD COMPUTING IN A NUTSHELL
  170. 1.1 CLOUD COMPUTING IN A NUTSHELL
  171. 1.1 CLOUD COMPUTING IN A NUTSHELL
  172. 1.1 CLOUD COMPUTING IN A NUTSHELL
  173. 1.1 CLOUD COMPUTING IN A NUTSHELL
  174. 1.1 CLOUD COMPUTING IN A NUTSHELL
  175. 1.1 CLOUD COMPUTING IN A NUTSHELL
  176. 1.1 CLOUD COMPUTING IN A NUTSHELL
  177. 1.1 CLOUD COMPUTING IN A NUTSHELL
  178. 1.1 CLOUD COMPUTING IN A NUTSHELL
  179. 1.1 CLOUD COMPUTING IN A NUTSHELL
  180. 1.1 CLOUD COMPUTING IN A NUTSHELL
  181. 1.1 CLOUD COMPUTING IN A NUTSHELL
  182. 1.1 CLOUD COMPUTING IN A NUTSHELL
  183. 1.1 CLOUD COMPUTING IN A NUTSHELL