Big Data Analytics Chapter3-6@2021.pdf

COLLEGE OF COMPUTING AND INFORMATICS
MSc. in Information Technology
Processing Big Data
Nov, 2021
Assosa, Ethiopia

Contents
• Introduction
• Integrating disparate data stores
• Employing Hadoop MapReduce
• Building blocks of Hadoop MapReduce

Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Processing
–Explain the building blocks of Hadoop MapReduce
–Understand the concepts of YARN
–Explain Big Data with traditional data

Introduction
• Data processing, manipulation of data by a computer.
– It includes the conversion of raw data to machine-readable form, flow of
data through the CPU and memory to output devices, and formatting or
transformation of output.
– Any use of computers to perform defined operations on data can be
included under data processing
• Big data processing is a set of techniques or programming models
to access large-scale data to extract useful information for
supporting and providing decisions.
• This is because, Big Data helps companies to generate valuable insights.
– Companies use Big Data to refine their marketing campaigns and techniques.
– Companies use it in machine learning projects to train machines, predictive modeling, and
other advanced analytics applications

Algorithm to analyse:
• Association
• Classification
• Integration

Integrating disparate data stores

Types of Data Integration Tools includes

Types of Data Integration Tools includes..

Types of Data Integration Tools includes…

Common Data Integration Approaches

Common Data Integration Approaches..

Hadoop MapReduce
• Big data processing is not handled by a single machine. Due to this
Mapreduce is used for handling processing of big data in a parallel and
distributed manner.
• A framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
• A processing technique and a program model for distributed computing
based on java.

Hadoop MapReduce….
• The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
– Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
– Reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce
task is always performed after the map job.
• Easy to scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called
mappers and reducers.

MapReduce-Advantages
Parallel Processing

MapReduce-Advantages..
Data Locality- Processing to Storage

MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting

ElectionVote Counting-TraditionalWay

ElectionVote Counting- MapReduceWay

• MapReduce program executes in three stages, namely map
stage, shuffle stage, and reduce stage.
–Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS).
–The input file is passed to the mapper function line by line.The
mapper processes the data and creates several small chunks of
data.
–Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage.The Reducer’s job is to
process the data that comes from the mapper.After processing, it
produces a new set of output, which will be stored in the HDFS.
MapReduce-Execution Stages

MapReduce-Example word count Process

• How Hadoop runs MapReduce Jobs?

Introduction to MapReduce…..

• Input reader
– The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB). Each data block is
associated with a Map function.
– Once input reads the data, it generates the corresponding key-value pairs.
– The input files reside in HDFS.The input data can be in any form.
• Map function
– The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.The map input and output type may
be different from each other.
Data Flow In Mapreduce (Phases)

• Partition function
– The partition function assigns the output of each Map function to the
appropriate reducer.The available key and value provide this function. It
returns the index of reducers.
• Shuffling and Sorting
– The data are shuffled between/within nodes so that it moves out from
the map and get ready to process for reduce function. Sometimes, the
shuffling of data can take much computation time.
– The sorting operation is performed on input data for Reduce function.
Here, the data is compared using comparison function and arranged in
a sorted form.
Data Flow In Mapreduce..

• Reduce function
– The Reduce function is assigned to each unique key.These keys are
already arranged in sorted order.The values associated with the keys
can iterate the Reduce and generates the corresponding output.
• Output writer
– Once the data flow from all the above phases, Output writer executes.
The role of Output writer is to write the Reduce output to the stable
storage.
Data Flow in Mapreduce…

• MapReduce Mapper Class
– In MapReduce, the role of the Mapper class is to map the input key-value
pairs to a set of intermediate key-value pairs.
– It transforms the input records into intermediate records.
– These intermediate records associated with a given output key and passed to
Reducer for the final output.
• MapReduce Reducer Class
– to reduce the set of intermediate values.
– Its implementations can access the Configuration for the job via the
JobContext.getConfiguration() method.
• MapReduce Job Class
– The Job class is used to configure the job and submits it. It also controls the
execution and query the state. Once the job is submitted, the set method
throws IllegalStateException.
MapReduce API

• Yet Another Resource Manager takes programming to the next level beyond Java,
and makes it interactive to let another application Hbase, Spark etc. to work on
it.
• Different Yarn applications can co-exist on the same cluster so MapReduce,
Hbase, Spark all can run at the same time bringing great benefits for
manageability and cluster utilization.
• Jobtracker & Tasktracker were used in previous version of Hadoop,
which were responsible for handling resources and checking progress
management.
• However, Hadoop 2.0 has Resource manager and NodeManager to
overcome the shortfall of Jobtracker & Tasktracker.
Overview of YARN-Component of Hadoop 2.0

Limitations of Hadoop 1.0 (MR 1)

• Client: For submitting MapReduce jobs.
• Resource Manager: To manage the use of resources across the
cluster
• Node Manager: For launching and monitoring the computer
containers on machines in the cluster.
• Map Reduce Application Master: Checks tasks running the
MapReduce job.
– The application master and the MapReduce tasks run in containers that
are scheduled by the resource manager, and managed by the node
managers.
Components of YARN

YARN Application Workflow in MapReduce

• There are mainly 3 types of Schedulers in Hadoop:
–FIFO (First In First Out) Scheduler
–Capacity Scheduler
–Fair Scheduler
Types of Scheduling

• Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, butYarn is designed for 10,000 nodes and 1 lakh
tasks.
• Utilization: Node Manager manages a pool of resources, rather
than a fixed number of the designated slots thus increasing the
utilization.
• Multitenancy: Different version of MapReduce can run onYARN,
which makes the process of upgrading MapReduce more manageable.
Benefits of YARN

Tools &Techniques to Analyze
Big Data
Dec, 2021
Assosa, Ethiopia

Contents
• Introduction
• Abstracting Hadoop MapReduce jobs with Pig
• Performing ad hoc Big Data querying with Hive
• Creating business value from extracted data

Objectives
–Identify different tools and techniques for big data
–Understand the concepts of pig
–Understand the concepts of Hive

Introduction
• Abstracting Hadoop MapReduce jobs with Pig
–Communicating with Hadoop in Pig Latin
–Executing commands using the Grunt Shell
–Streamlining high-level processing
• Performing ad hoc Big Data querying with Hive
–Persisting data in the Hive MegaStore
–Performing queries with HiveQL
–Investigating Hive file formats
• Creating business value from extracted data
–Mining data with Mahout
–Visualizing processed results with reporting tools, BI
–Querying in real time with Impala

Abstracting Hadoop MapReduce jobs with Pig
• Pig was Initially developed byYahoo to get ease in programming.
– Apache Pig has the capability to process an extensive dataset as it works on
top of the Hadoop. It is used for analyzing more massive datasets by
representing them as dataflow.
– Apache Pig also raises the level of abstraction for processing enormous
datasets.
– Pig Latin is the scripting language that the developer uses for working on
the Pig framework that runs on Pig runtime.
• Features of Pig:
– EasyTo Programme
– Rich set of operators
– Ability to handle various kind of data
– Extensibility

Performing ad hoc Big Data querying with Hive

Performing ad hoc Big Data querying with Hive..
Why??

Tools for big data
The tools are in Categories
• The first row is NoSQL storage
• The second row is common big data storage and management

Developing a Big Data Strategy
Dec, 2021
Assosa, Ethiopia

Contents
• Introduction
• Overview of Big Data Strategy
• Defining Big Data Strategy
• Enabling analytical innovation

Objectives
–Understand the concepts of big data Strategy
–Explain a Big Data strategy and considerations

Introduction
• Strategy is a plan of action or policy designed to achieve an overall aim.
• Big Data is Worthless Without a Big Data Strategy.
• However, it cannot be seen as something separate from the
organizational strategy, and should be firmly embedded.
• When we say a Big Data strategy, this effectively means a business
strategy that includes Big Data.
• It defines and lays out a comprehensive vision across the enterprise and
sets a foundation for the organization to employ data-related or data-
dependent capabilities.
• A well-defined and comprehensive Big Data strategy makes the benefits
or Big Data actionable for the organization.
• It sets out the steps that an organization should execute in order to
become a “Data Driven Enterprise”.

Introduction..
• The Big Data strategy incorporates some guiding principles to
accomplish
– the data-driven vision,
– directs the organization to select specific business goals and is the starting
point for data driven planning across the enterprise.
• Big data holds many promises, such as
– gaining valuable customer insights
– predict future
– generate new revenue streams etc
• So effective big data strategy is very necessary.

Big Data Considerations
• You can’t process the amount of data that you want to because of the
limitations of your current platform.
• You can’t include new/contemporary data sources (example, social media,
RFID, Sensory, Web, GPS, textual data) because it does not comply with
the data storage schema.
• You need to (or want to) integrate data as quickly as possible to be
current on your analysis.
• You want to work with a schema-on-demand data storage paradigm
because the variety of data types involved.
• The data is arriving so fast at your organization’s doorstep that your
traditional analytics platform cannot handle it

Critical Success Factors for Big Data Analytics
• A clear business need (alignment with the vision and the strategy).
• Strong, committed sponsorship (executive champion)
• Alignment between the business and I T strategy.
• A fact-based decision-making culture.
• A strong data infrastructure.
• The right analytics tools.
• Right people with right skills

Business Problems Addressed by Big Data
Analytics
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities

What is Big Data Objectives?
• The technologies and concepts behind big data allow
organizations to achieve a variety of objectives.
• Like many new information technologies,
–big data can bring about
• dramatic cost reductions,
• substantial improvements in the time required to perform a
computing task,
• or new product and service offerings

Defining a Big Data Strategy
• A good Big Data Strategy will explore following subject
domain, and align it to their organizational objectives:
1. Identify an Opportunity & EconomicValue of Data
2. Defining Big Data Architecture
3. Selecting Big DataTechnologies
4. Understanding Big Data Science
5. Developing Big Data Analytics
6. Institutionalize Big Data

1. Identify an Opportunity & EconomicValue of Data
–Catalog existing data sources available inside the organization, tapped
or untapped.
–Invent new ways of capturing data, integrate your data sources with
external communities. Develop semantics and metadata for
association, clustering, classification and trending.
–Identify and create opportunities to integrate and fuse data with
partner's dataset in industry likeTelecom,Travel, Financial, Healthcare,
and Entertainment Industries etc.
–Conceptualize the data insights and possible data sciences to extract
valuable data, e.g. associations, simulation, regression, correlation,
segmentation, trending, and predictive etc.

Identify an Opportunity & EconomicValue of Data…
–Identify the scope of data access. i.e.Who can explore data? Who gets
access to Data Insights.
–Identify possibility of monetizing data to generate revenue from Data
Insights gained, like generating leads, campaigns, upsell/cross sell
opportunity, data streaming, data API, improving staff productivity &
customer service etc.
–Identify ethical and legal code associated with data under exploration
with respect to industry standards, organizational culture, data
policies, data privacy, regulatory and legal requirements
–Data Requirement likeWhat type of data do you need? Is it diverse
enough? How will you source it and store it?

2. Defining Big Data Architecture
– Defining Business Problems & Classification of associated data, such as,
Market Sentiment Analysis, Churn Predictions, or Fraud Detection.
– Defining Data Acquisition Strategy.
– Selecting a Hadoop Framework
– Big Data Life Cycle Management Framework.
– Choosing Big Data stores, traditional or noSQL and Polyglot
persistence.
– Defining Big Data Infrastructure & PlatformTaxonomy
– Identifying Big Data Analytics Frameworks, and associated Machine
Learning Sciences.
– develop Data Monetization Strategy to exploit its value internally
within the enterprise, or externally

3. Selecting Big DataTechnologies
• Having the appropriate infrastructure in place to support the data you need is
essential.
• Be sure to consider the four layers of data including collecting, storing,
processing/analysing and communicating insights from the data
– InternetTechnologies
– Machine learning
– Commodity Hardware
– Distributed processing
– Leverage Cloud based approach to reduce time to market, reduce risk
and gain better SLA out of the box

4. Understanding Big Data Science
– Data Science is the ongoing process of discovering information from data. It is a
process that never stops, and often one question leads to another new question.
It focuses on real-world problems, and tries to explain it
– Machine Learning
• Supervised
• Unsupervised, hybride
– Common Algorithms
• Classification
• Clustering
• Associations & Correlations
• Text Mining
• Linear Regression

5. Developing Big Data Analytics
• Big Data applications varies based on industry. Businesses are trying to find
value in monetizing data, or use it to improve efficiencies and customer
experience.
• Considering the different types of big data analytics is important
– Descriptive
– Diagnostic
– Predictive
– Prescriptive

6. Institutionalize Big Data
• Each enterprise will tailor the Big Data to meet the objectives of their
particular vision.
– Discovery (Opportunity, Requirements, Best Fit)
– Proof of Concept (to Evaluate BusinessValue)
– Provision Infra (here Big Data Elasticity comes to play)
– Ingest (Source the data)
– Process (Transform,Analyze, Data Science)
– Publish (Share the learnings)
• Governance: the current state of data quality, security, access,
ownership, ethics and data privacy within the organization.
• Considering Skill and capacity is also required

Generally:
–Establishing your Big Data needs
–Meeting business goals with timely data
–Evaluating commercial Big Data tools
–Managing organizational expectations is very important when we
define a big data strategy

Enabling analytical innovation in a Big Data
• Data can drive innovation in two ways.
– Data can motivate ideation, development, execution and
evaluation of new innovations.
– And it can underpin, or be a central component of new products,
services, operations or business models.
• Recent advances in machine learning and the vast amount of digitalized
data.
– For the first time, a machine powered by analytics was able to win against
the best human player in the world in the game “Go.”
• Self-driving cars that rely on the large number of digitized images that
improve vision recognition systems dramatically.

Enabling analytical innovation in a Big Data..
How big data fuel innovation?
• “Analytics is really great at finding linkages or hidden patterns we
may not easily observe by mining through a ton of data.”
• “Analytics can really drive the creation of recombination’, or
combining a diverse set of existing technologies in a new way.”
• “We can use lessons learned from past generations of IT and
analytics technologies to inform us about what the future could look
like.”
– Focusing on business importance
– Framing the problem
– Selecting the correct tools
– Achieving timely results

Implementing a Big Data
Solution
Jan, 2022
Assosa, Ethiopia

Contents
• Introduction
• Selecting suitable vendors and hosting options
• Balancing costs against business value
• Keeping ahead of the curve

Objectives
–Understand the concepts and criterion of Selecting suitable
vendors and hosting options
– Explain Balancing costs against business value

Introduction
• To be sure, big data solutions are in great demand.
• Today, enterprise leaders know that their big data is one of
their most valuable resources and one they can’t afford to
ignore.
• As a result, they are looking for hardware and software that
can
–help them store, manage and analyze their big data.
• Experts suggest that a good way to start the process of
selecting a big data solution is
–to determine exactly what kind of solution you need.

Big Data Software Market Shares

Top Big Data Software Provider Companies
• SAP
• Splunk
• Oracle
• IBM
• Microsoft

Big Data Professional Service Market Share

Top Big Data Professional Service Provider
Companies
• IBM
• Accenture
• Palantir
• Teradata

Big Data centered Industry Landscape
Big Data
IoT Cloud
Mobile Bio

Reading
• What are the criterion to select suitable vendors
and hosting options?
• Hardware
• Software
• Professional service
• How to consider balancing of costs against business
value generated from big data
• The knowledge and skill required to Keep ahead of
the curve

Common Types of Big Data Solution
• Enterprise vendors offer a wide array of different types of big
data solutions.
• The kind of big data application that is right for organizations
will depend on their goals
• The best approach is to define the goals clearly at the outset
and then go looking for products that will help to reach those
goals.

A Big Data Solution Hosting Options
• On-PremiseVs Cloud-Based Big Data Applications
–Want to host a big data software in organization data center or use a
cloud-based solution.
• ProprietaryVs Open Source Big Data Applications
–Is the organization have skill professional, open source solutions up
and running and configured for their needs
–need to purchase support or consulting services (consider those
expenses when figuring out total cost of ownership)
• BatchVs Streaming Big Data Applications
– Does organizations want to analyze data in real-time or batch?
–Both real-time and batch data processing Lambda architecture

Selection Criterion Or Success Factors
• Integration with Legacy Technology
• Performance
• Scalability
• Usability
• Visualization
• Flexibility
• Security
• Support
–Even experienced IT professionals sometimes find it
difficult to deploy, maintain and use complex big data
applications.

• Ecosystem: a big data platform that integrates with a lot of
other popular tools and a vendor with strong partnerships with
other providers
• Self-Service Capabilities
• Total Cost of Ownership
• Estimated time to value
• Artificial Intelligence and Machine Learning
– how innovative the various big data solution vendors are?
–AI and machine learning research are advancing at an incredible rate
and becoming a mainstream part of big data analytics solution
Selection Criterion Or Success Factors..

105
https://www.datamation.com/big-data/how-to-select-a-big-data-application/

Big Data Analytics Chapter3-6@2021.pdf

More Related Content

Similar to Big Data Analytics Chapter3-6@2021.pdf

Recently uploaded

Big Data Analytics Chapter3-6@2021.pdf