COLLEGE OF COMPUTING AND INFORMATICS
MSc. in Information Technology
Processing Big Data
Nov, 2021
Assosa, Ethiopia
Contents
• Introduction
• Integrating disparate data stores
• Employing Hadoop MapReduce
• Building blocks of Hadoop MapReduce
Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Processing
–Explain the building blocks of Hadoop MapReduce
–Understand the concepts of YARN
–Explain Big Data with traditional data
Introduction
• Data processing, manipulation of data by a computer.
– It includes the conversion of raw data to machine-readable form, flow of
data through the CPU and memory to output devices, and formatting or
transformation of output.
– Any use of computers to perform defined operations on data can be
included under data processing
• Big data processing is a set of techniques or programming models
to access large-scale data to extract useful information for
supporting and providing decisions.
• This is because, Big Data helps companies to generate valuable insights.
– Companies use Big Data to refine their marketing campaigns and techniques.
– Companies use it in machine learning projects to train machines, predictive modeling, and
other advanced analytics applications
Algorithm to analyse:
• Association
• Classification
• Integration
Integrating disparate data stores
Types of Data Integration Tools includes
Types of Data Integration Tools includes..
Types of Data Integration Tools includes…
Examples of Data Integration
Common Data Integration Approaches
Common Data Integration Approaches..
Hadoop MapReduce
• Big data processing is not handled by a single machine. Due to this
Mapreduce is used for handling processing of big data in a parallel and
distributed manner.
• A framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
• A processing technique and a program model for distributed computing
based on java.
Hadoop MapReduce..
Hadoop MapReduce…
Hadoop MapReduce….
• The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
– Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
– Reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce
task is always performed after the map job.
• Easy to scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called
mappers and reducers.
MapReduce-Advantages
Parallel Processing
MapReduce-Advantages..
Data Locality- Processing to Storage
MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting
MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting-TraditionalWay
MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting- MapReduceWay
• MapReduce program executes in three stages, namely map
stage, shuffle stage, and reduce stage.
–Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS).
–The input file is passed to the mapper function line by line.The
mapper processes the data and creates several small chunks of
data.
–Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage.The Reducer’s job is to
process the data that comes from the mapper.After processing, it
produces a new set of output, which will be stored in the HDFS.
MapReduce-Execution Stages
MapReduce Process
Anatomy of MapReduce Program
MapReduce: Map-Shuffle-Reduce
MapReduce-Example word count Process
• How Hadoop runs MapReduce Jobs?
• How Hadoop runs MapReduce Jobs?
• How Hadoop runs MapReduce Jobs?
Introduction to MapReduce…..
• Input reader
– The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB). Each data block is
associated with a Map function.
– Once input reads the data, it generates the corresponding key-value pairs.
– The input files reside in HDFS.The input data can be in any form.
• Map function
– The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.The map input and output type may
be different from each other.
Data Flow In Mapreduce (Phases)
• Partition function
– The partition function assigns the output of each Map function to the
appropriate reducer.The available key and value provide this function. It
returns the index of reducers.
• Shuffling and Sorting
– The data are shuffled between/within nodes so that it moves out from
the map and get ready to process for reduce function. Sometimes, the
shuffling of data can take much computation time.
– The sorting operation is performed on input data for Reduce function.
Here, the data is compared using comparison function and arranged in
a sorted form.
Data Flow In Mapreduce..
• Reduce function
– The Reduce function is assigned to each unique key.These keys are
already arranged in sorted order.The values associated with the keys
can iterate the Reduce and generates the corresponding output.
• Output writer
– Once the data flow from all the above phases, Output writer executes.
The role of Output writer is to write the Reduce output to the stable
storage.
Data Flow in Mapreduce…
MapReduc- Characterstics
• MapReduce Mapper Class
– In MapReduce, the role of the Mapper class is to map the input key-value
pairs to a set of intermediate key-value pairs.
– It transforms the input records into intermediate records.
– These intermediate records associated with a given output key and passed to
Reducer for the final output.
• MapReduce Reducer Class
– to reduce the set of intermediate values.
– Its implementations can access the Configuration for the job via the
JobContext.getConfiguration() method.
• MapReduce Job Class
– The Job class is used to configure the job and submits it. It also controls the
execution and query the state. Once the job is submitted, the set method
throws IllegalStateException.
MapReduce API
• Yet Another Resource Manager takes programming to the next level beyond Java,
and makes it interactive to let another application Hbase, Spark etc. to work on
it.
• Different Yarn applications can co-exist on the same cluster so MapReduce,
Hbase, Spark all can run at the same time bringing great benefits for
manageability and cluster utilization.
• Jobtracker & Tasktracker were used in previous version of Hadoop,
which were responsible for handling resources and checking progress
management.
• However, Hadoop 2.0 has Resource manager and NodeManager to
overcome the shortfall of Jobtracker & Tasktracker.
Overview of YARN-Component of Hadoop 2.0
Limitations of Hadoop 1.0 (MR 1)
Needs of YARN
YARN as a Solution
• Client: For submitting MapReduce jobs.
• Resource Manager: To manage the use of resources across the
cluster
• Node Manager: For launching and monitoring the computer
containers on machines in the cluster.
• Map Reduce Application Master: Checks tasks running the
MapReduce job.
– The application master and the MapReduce tasks run in containers that
are scheduled by the resource manager, and managed by the node
managers.
Components of YARN
Components of YARN
YARN Application Workflow in MapReduce
YARN Application Workflow in MapReduce
YARN Application Workflow in MapReduce
• There are mainly 3 types of Schedulers in Hadoop:
–FIFO (First In First Out) Scheduler
–Capacity Scheduler
–Fair Scheduler
Types of Scheduling
• Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, butYarn is designed for 10,000 nodes and 1 lakh
tasks.
• Utilization: Node Manager manages a pool of resources, rather
than a fixed number of the designated slots thus increasing the
utilization.
• Multitenancy: Different version of MapReduce can run onYARN,
which makes the process of upgrading MapReduce more manageable.
Benefits of YARN
Tools &Techniques to Analyze
Big Data
Dec, 2021
Assosa, Ethiopia
Contents
• Introduction
• Abstracting Hadoop MapReduce jobs with Pig
• Performing ad hoc Big Data querying with Hive
• Creating business value from extracted data
Objectives
• At the end of this chapter, you are able to:
–Identify different tools and techniques for big data
–Understand the concepts of pig
–Understand the concepts of Hive
Introduction
• Abstracting Hadoop MapReduce jobs with Pig
–Communicating with Hadoop in Pig Latin
–Executing commands using the Grunt Shell
–Streamlining high-level processing
• Performing ad hoc Big Data querying with Hive
–Persisting data in the Hive MegaStore
–Performing queries with HiveQL
–Investigating Hive file formats
• Creating business value from extracted data
–Mining data with Mahout
–Visualizing processed results with reporting tools, BI
–Querying in real time with Impala
Big Data Hadoop Projects
Abstracting Hadoop MapReduce jobs with Pig
• Pig was Initially developed byYahoo to get ease in programming.
– Apache Pig has the capability to process an extensive dataset as it works on
top of the Hadoop. It is used for analyzing more massive datasets by
representing them as dataflow.
– Apache Pig also raises the level of abstraction for processing enormous
datasets.
– Pig Latin is the scripting language that the developer uses for working on
the Pig framework that runs on Pig runtime.
• Features of Pig:
– EasyTo Programme
– Rich set of operators
– Ability to handle various kind of data
– Extensibility
Performing ad hoc Big Data querying with Hive
Performing ad hoc Big Data querying with Hive..
Why??
Tools for big data
The tools are in Categories
• The first row is NoSQL storage
• The second row is common big data storage and management
Tools for big data..
Tools for big data…
Tools for big data….
Tools for big data-Advantages
Developing a Big Data Strategy
Dec, 2021
Assosa, Ethiopia
Contents
• Introduction
• Overview of Big Data Strategy
• Defining Big Data Strategy
• Enabling analytical innovation
Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Strategy
–Explain a Big Data strategy and considerations
Introduction
• Strategy is a plan of action or policy designed to achieve an overall aim.
• Big Data is Worthless Without a Big Data Strategy.
• However, it cannot be seen as something separate from the
organizational strategy, and should be firmly embedded.
• When we say a Big Data strategy, this effectively means a business
strategy that includes Big Data.
• It defines and lays out a comprehensive vision across the enterprise and
sets a foundation for the organization to employ data-related or data-
dependent capabilities.
• A well-defined and comprehensive Big Data strategy makes the benefits
or Big Data actionable for the organization.
• It sets out the steps that an organization should execute in order to
become a “Data Driven Enterprise”.
Introduction..
• The Big Data strategy incorporates some guiding principles to
accomplish
– the data-driven vision,
– directs the organization to select specific business goals and is the starting
point for data driven planning across the enterprise.
• Big data holds many promises, such as
– gaining valuable customer insights
– predict future
– generate new revenue streams etc
• So effective big data strategy is very necessary.
Big Data Considerations
• You can’t process the amount of data that you want to because of the
limitations of your current platform.
• You can’t include new/contemporary data sources (example, social media,
RFID, Sensory, Web, GPS, textual data) because it does not comply with
the data storage schema.
• You need to (or want to) integrate data as quickly as possible to be
current on your analysis.
• You want to work with a schema-on-demand data storage paradigm
because the variety of data types involved.
• The data is arriving so fast at your organization’s doorstep that your
traditional analytics platform cannot handle it
Critical Success Factors for Big Data Analytics
• A clear business need (alignment with the vision and the strategy).
• Strong, committed sponsorship (executive champion)
• Alignment between the business and I T strategy.
• A fact-based decision-making culture.
• A strong data infrastructure.
• The right analytics tools.
• Right people with right skills
Business Problems Addressed by Big Data
Analytics
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities
What is Big Data Objectives?
• The technologies and concepts behind big data allow
organizations to achieve a variety of objectives.
• Like many new information technologies,
–big data can bring about
• dramatic cost reductions,
• substantial improvements in the time required to perform a
computing task,
• or new product and service offerings
Defining a Big Data Strategy
• A good Big Data Strategy will explore following subject
domain, and align it to their organizational objectives:
1. Identify an Opportunity & EconomicValue of Data
2. Defining Big Data Architecture
3. Selecting Big DataTechnologies
4. Understanding Big Data Science
5. Developing Big Data Analytics
6. Institutionalize Big Data
Defining a Big Data Strategy
1. Identify an Opportunity & EconomicValue of Data
–Catalog existing data sources available inside the organization, tapped
or untapped.
–Invent new ways of capturing data, integrate your data sources with
external communities. Develop semantics and metadata for
association, clustering, classification and trending.
–Identify and create opportunities to integrate and fuse data with
partner's dataset in industry likeTelecom,Travel, Financial, Healthcare,
and Entertainment Industries etc.
–Conceptualize the data insights and possible data sciences to extract
valuable data, e.g. associations, simulation, regression, correlation,
segmentation, trending, and predictive etc.
Defining a Big Data Strategy
Identify an Opportunity & EconomicValue of Data…
–Identify the scope of data access. i.e.Who can explore data? Who gets
access to Data Insights.
–Identify possibility of monetizing data to generate revenue from Data
Insights gained, like generating leads, campaigns, upsell/cross sell
opportunity, data streaming, data API, improving staff productivity &
customer service etc.
–Identify ethical and legal code associated with data under exploration
with respect to industry standards, organizational culture, data
policies, data privacy, regulatory and legal requirements
–Data Requirement likeWhat type of data do you need? Is it diverse
enough? How will you source it and store it?
Defining a Big Data Strategy
2. Defining Big Data Architecture
– Defining Business Problems & Classification of associated data, such as,
Market Sentiment Analysis, Churn Predictions, or Fraud Detection.
– Defining Data Acquisition Strategy.
– Selecting a Hadoop Framework
– Big Data Life Cycle Management Framework.
– Choosing Big Data stores, traditional or noSQL and Polyglot
persistence.
– Defining Big Data Infrastructure & PlatformTaxonomy
– Identifying Big Data Analytics Frameworks, and associated Machine
Learning Sciences.
– develop Data Monetization Strategy to exploit its value internally
within the enterprise, or externally
Defining a Big Data Strategy
3. Selecting Big DataTechnologies
• Having the appropriate infrastructure in place to support the data you need is
essential.
• Be sure to consider the four layers of data including collecting, storing,
processing/analysing and communicating insights from the data
– InternetTechnologies
– Machine learning
– Commodity Hardware
– Distributed processing
– Leverage Cloud based approach to reduce time to market, reduce risk
and gain better SLA out of the box
Defining a Big Data Strategy
4. Understanding Big Data Science
– Data Science is the ongoing process of discovering information from data. It is a
process that never stops, and often one question leads to another new question.
It focuses on real-world problems, and tries to explain it
– Machine Learning
• Supervised
• Unsupervised, hybride
– Common Algorithms
• Classification
• Clustering
• Associations & Correlations
• Text Mining
• Linear Regression
Defining a Big Data Strategy
5. Developing Big Data Analytics
• Big Data applications varies based on industry. Businesses are trying to find
value in monetizing data, or use it to improve efficiencies and customer
experience.
• Considering the different types of big data analytics is important
– Descriptive
– Diagnostic
– Predictive
– Prescriptive
Defining a Big Data Strategy
6. Institutionalize Big Data
• Each enterprise will tailor the Big Data to meet the objectives of their
particular vision.
– Discovery (Opportunity, Requirements, Best Fit)
– Proof of Concept (to Evaluate BusinessValue)
– Provision Infra (here Big Data Elasticity comes to play)
– Ingest (Source the data)
– Process (Transform,Analyze, Data Science)
– Publish (Share the learnings)
• Governance: the current state of data quality, security, access,
ownership, ethics and data privacy within the organization.
• Considering Skill and capacity is also required
Defining a Big Data Strategy
Generally:
–Establishing your Big Data needs
–Meeting business goals with timely data
–Evaluating commercial Big Data tools
–Managing organizational expectations is very important when we
define a big data strategy
Enabling analytical innovation in a Big Data
• Data can drive innovation in two ways.
– Data can motivate ideation, development, execution and
evaluation of new innovations.
– And it can underpin, or be a central component of new products,
services, operations or business models.
• Recent advances in machine learning and the vast amount of digitalized
data.
– For the first time, a machine powered by analytics was able to win against
the best human player in the world in the game “Go.”
• Self-driving cars that rely on the large number of digitized images that
improve vision recognition systems dramatically.
Enabling analytical innovation in a Big Data..
How big data fuel innovation?
• “Analytics is really great at finding linkages or hidden patterns we
may not easily observe by mining through a ton of data.”
• “Analytics can really drive the creation of recombination’, or
combining a diverse set of existing technologies in a new way.”
• “We can use lessons learned from past generations of IT and
analytics technologies to inform us about what the future could look
like.”
– Focusing on business importance
– Framing the problem
– Selecting the correct tools
– Achieving timely results
Implementing a Big Data
Solution
Jan, 2022
Assosa, Ethiopia
Contents
• Introduction
• Selecting suitable vendors and hosting options
• Balancing costs against business value
• Keeping ahead of the curve
Objectives
• At the end of this chapter, you are able to:
–Understand the concepts and criterion of Selecting suitable
vendors and hosting options
– Explain Balancing costs against business value
Introduction
• To be sure, big data solutions are in great demand.
• Today, enterprise leaders know that their big data is one of
their most valuable resources and one they can’t afford to
ignore.
• As a result, they are looking for hardware and software that
can
–help them store, manage and analyze their big data.
• Experts suggest that a good way to start the process of
selecting a big data solution is
–to determine exactly what kind of solution you need.
Big Data Market Share
Big Data Market Share..
Big Data Software Market Shares
Top Big Data Software Provider Companies
• SAP
• Splunk
• Oracle
• IBM
• Microsoft
Big Data Professional Service Market Share
Top Big Data Professional Service Provider
Companies
• IBM
• Accenture
• Palantir
• Teradata
Big Data centered Industry Landscape
Big Data
IoT Cloud
Mobile Bio
Reading
• What are the criterion to select suitable vendors
and hosting options?
• Hardware
• Software
• Professional service
• How to consider balancing of costs against business
value generated from big data
• The knowledge and skill required to Keep ahead of
the curve
Common Types of Big Data Solution
• Enterprise vendors offer a wide array of different types of big
data solutions.
• The kind of big data application that is right for organizations
will depend on their goals
• The best approach is to define the goals clearly at the outset
and then go looking for products that will help to reach those
goals.
A Big Data Solution Hosting Options
• On-PremiseVs Cloud-Based Big Data Applications
–Want to host a big data software in organization data center or use a
cloud-based solution.
• ProprietaryVs Open Source Big Data Applications
–Is the organization have skill professional, open source solutions up
and running and configured for their needs
–need to purchase support or consulting services (consider those
expenses when figuring out total cost of ownership)
• BatchVs Streaming Big Data Applications
– Does organizations want to analyze data in real-time or batch?
–Both real-time and batch data processing Lambda architecture
Selection Criterion Or Success Factors
• Integration with Legacy Technology
• Performance
• Scalability
• Usability
• Visualization
• Flexibility
• Security
• Support
–Even experienced IT professionals sometimes find it
difficult to deploy, maintain and use complex big data
applications.
• Ecosystem: a big data platform that integrates with a lot of
other popular tools and a vendor with strong partnerships with
other providers
• Self-Service Capabilities
• Total Cost of Ownership
• Estimated time to value
• Artificial Intelligence and Machine Learning
– how innovative the various big data solution vendors are?
–AI and machine learning research are advancing at an incredible rate
and becoming a mainstream part of big data analytics solution
Selection Criterion Or Success Factors..
105
https://www.datamation.com/big-data/how-to-select-a-big-data-application/

Big Data Analytics Chapter3-6@2021.pdf

  • 1.
    COLLEGE OF COMPUTINGAND INFORMATICS MSc. in Information Technology Processing Big Data Nov, 2021 Assosa, Ethiopia
  • 2.
    Contents • Introduction • Integratingdisparate data stores • Employing Hadoop MapReduce • Building blocks of Hadoop MapReduce
  • 3.
    Objectives • At theend of this chapter, you are able to: –Understand the concepts of big data Processing –Explain the building blocks of Hadoop MapReduce –Understand the concepts of YARN –Explain Big Data with traditional data
  • 4.
    Introduction • Data processing,manipulation of data by a computer. – It includes the conversion of raw data to machine-readable form, flow of data through the CPU and memory to output devices, and formatting or transformation of output. – Any use of computers to perform defined operations on data can be included under data processing • Big data processing is a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions. • This is because, Big Data helps companies to generate valuable insights. – Companies use Big Data to refine their marketing campaigns and techniques. – Companies use it in machine learning projects to train machines, predictive modeling, and other advanced analytics applications
  • 5.
    Algorithm to analyse: •Association • Classification • Integration
  • 6.
  • 7.
    Types of DataIntegration Tools includes
  • 8.
    Types of DataIntegration Tools includes..
  • 9.
    Types of DataIntegration Tools includes…
  • 10.
    Examples of DataIntegration
  • 11.
  • 12.
  • 13.
    Hadoop MapReduce • Bigdata processing is not handled by a single machine. Due to this Mapreduce is used for handling processing of big data in a parallel and distributed manner. • A framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. • A processing technique and a program model for distributed computing based on java.
  • 14.
  • 15.
  • 16.
    Hadoop MapReduce…. • TheMapReduce algorithm contains two important tasks, namely Map and Reduce. – Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). – Reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. • As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. • Easy to scale data processing over multiple computing nodes. • Under the MapReduce model, the data processing primitives are called mappers and reducers.
  • 17.
  • 18.
  • 19.
    MapReduce- Traditional VsMapReduce Way ElectionVote Counting
  • 20.
    MapReduce- Traditional VsMapReduce Way ElectionVote Counting-TraditionalWay
  • 21.
    MapReduce- Traditional VsMapReduce Way ElectionVote Counting- MapReduceWay
  • 23.
    • MapReduce programexecutes in three stages, namely map stage, shuffle stage, and reduce stage. –Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). –The input file is passed to the mapper function line by line.The mapper processes the data and creates several small chunks of data. –Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage.The Reducer’s job is to process the data that comes from the mapper.After processing, it produces a new set of output, which will be stored in the HDFS. MapReduce-Execution Stages
  • 24.
  • 25.
  • 26.
  • 27.
  • 29.
    • How Hadoopruns MapReduce Jobs?
  • 30.
    • How Hadoopruns MapReduce Jobs?
  • 31.
    • How Hadoopruns MapReduce Jobs?
  • 32.
  • 33.
    • Input reader –The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB). Each data block is associated with a Map function. – Once input reads the data, it generates the corresponding key-value pairs. – The input files reside in HDFS.The input data can be in any form. • Map function – The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs.The map input and output type may be different from each other. Data Flow In Mapreduce (Phases)
  • 34.
    • Partition function –The partition function assigns the output of each Map function to the appropriate reducer.The available key and value provide this function. It returns the index of reducers. • Shuffling and Sorting – The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce function. Sometimes, the shuffling of data can take much computation time. – The sorting operation is performed on input data for Reduce function. Here, the data is compared using comparison function and arranged in a sorted form. Data Flow In Mapreduce..
  • 35.
    • Reduce function –The Reduce function is assigned to each unique key.These keys are already arranged in sorted order.The values associated with the keys can iterate the Reduce and generates the corresponding output. • Output writer – Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write the Reduce output to the stable storage. Data Flow in Mapreduce…
  • 36.
  • 37.
    • MapReduce MapperClass – In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of intermediate key-value pairs. – It transforms the input records into intermediate records. – These intermediate records associated with a given output key and passed to Reducer for the final output. • MapReduce Reducer Class – to reduce the set of intermediate values. – Its implementations can access the Configuration for the job via the JobContext.getConfiguration() method. • MapReduce Job Class – The Job class is used to configure the job and submits it. It also controls the execution and query the state. Once the job is submitted, the set method throws IllegalStateException. MapReduce API
  • 38.
    • Yet AnotherResource Manager takes programming to the next level beyond Java, and makes it interactive to let another application Hbase, Spark etc. to work on it. • Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great benefits for manageability and cluster utilization. • Jobtracker & Tasktracker were used in previous version of Hadoop, which were responsible for handling resources and checking progress management. • However, Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker. Overview of YARN-Component of Hadoop 2.0
  • 42.
  • 44.
  • 45.
    YARN as aSolution
  • 47.
    • Client: Forsubmitting MapReduce jobs. • Resource Manager: To manage the use of resources across the cluster • Node Manager: For launching and monitoring the computer containers on machines in the cluster. • Map Reduce Application Master: Checks tasks running the MapReduce job. – The application master and the MapReduce tasks run in containers that are scheduled by the resource manager, and managed by the node managers. Components of YARN
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
    • There aremainly 3 types of Schedulers in Hadoop: –FIFO (First In First Out) Scheduler –Capacity Scheduler –Fair Scheduler Types of Scheduling
  • 53.
    • Scalability: MapReduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, butYarn is designed for 10,000 nodes and 1 lakh tasks. • Utilization: Node Manager manages a pool of resources, rather than a fixed number of the designated slots thus increasing the utilization. • Multitenancy: Different version of MapReduce can run onYARN, which makes the process of upgrading MapReduce more manageable. Benefits of YARN
  • 54.
    Tools &Techniques toAnalyze Big Data Dec, 2021 Assosa, Ethiopia
  • 55.
    Contents • Introduction • AbstractingHadoop MapReduce jobs with Pig • Performing ad hoc Big Data querying with Hive • Creating business value from extracted data
  • 56.
    Objectives • At theend of this chapter, you are able to: –Identify different tools and techniques for big data –Understand the concepts of pig –Understand the concepts of Hive
  • 57.
    Introduction • Abstracting HadoopMapReduce jobs with Pig –Communicating with Hadoop in Pig Latin –Executing commands using the Grunt Shell –Streamlining high-level processing • Performing ad hoc Big Data querying with Hive –Persisting data in the Hive MegaStore –Performing queries with HiveQL –Investigating Hive file formats • Creating business value from extracted data –Mining data with Mahout –Visualizing processed results with reporting tools, BI –Querying in real time with Impala
  • 58.
  • 59.
    Abstracting Hadoop MapReducejobs with Pig • Pig was Initially developed byYahoo to get ease in programming. – Apache Pig has the capability to process an extensive dataset as it works on top of the Hadoop. It is used for analyzing more massive datasets by representing them as dataflow. – Apache Pig also raises the level of abstraction for processing enormous datasets. – Pig Latin is the scripting language that the developer uses for working on the Pig framework that runs on Pig runtime. • Features of Pig: – EasyTo Programme – Rich set of operators – Ability to handle various kind of data – Extensibility
  • 60.
    Performing ad hocBig Data querying with Hive
  • 61.
    Performing ad hocBig Data querying with Hive.. Why??
  • 62.
    Tools for bigdata The tools are in Categories • The first row is NoSQL storage • The second row is common big data storage and management
  • 63.
  • 64.
  • 65.
    Tools for bigdata….
  • 66.
    Tools for bigdata-Advantages
  • 67.
    Developing a BigData Strategy Dec, 2021 Assosa, Ethiopia
  • 68.
    Contents • Introduction • Overviewof Big Data Strategy • Defining Big Data Strategy • Enabling analytical innovation
  • 69.
    Objectives • At theend of this chapter, you are able to: –Understand the concepts of big data Strategy –Explain a Big Data strategy and considerations
  • 70.
    Introduction • Strategy isa plan of action or policy designed to achieve an overall aim. • Big Data is Worthless Without a Big Data Strategy. • However, it cannot be seen as something separate from the organizational strategy, and should be firmly embedded. • When we say a Big Data strategy, this effectively means a business strategy that includes Big Data. • It defines and lays out a comprehensive vision across the enterprise and sets a foundation for the organization to employ data-related or data- dependent capabilities. • A well-defined and comprehensive Big Data strategy makes the benefits or Big Data actionable for the organization. • It sets out the steps that an organization should execute in order to become a “Data Driven Enterprise”.
  • 71.
    Introduction.. • The BigData strategy incorporates some guiding principles to accomplish – the data-driven vision, – directs the organization to select specific business goals and is the starting point for data driven planning across the enterprise. • Big data holds many promises, such as – gaining valuable customer insights – predict future – generate new revenue streams etc • So effective big data strategy is very necessary.
  • 72.
    Big Data Considerations •You can’t process the amount of data that you want to because of the limitations of your current platform. • You can’t include new/contemporary data sources (example, social media, RFID, Sensory, Web, GPS, textual data) because it does not comply with the data storage schema. • You need to (or want to) integrate data as quickly as possible to be current on your analysis. • You want to work with a schema-on-demand data storage paradigm because the variety of data types involved. • The data is arriving so fast at your organization’s doorstep that your traditional analytics platform cannot handle it
  • 73.
    Critical Success Factorsfor Big Data Analytics • A clear business need (alignment with the vision and the strategy). • Strong, committed sponsorship (executive champion) • Alignment between the business and I T strategy. • A fact-based decision-making culture. • A strong data infrastructure. • The right analytics tools. • Right people with right skills
  • 75.
    Business Problems Addressedby Big Data Analytics • Process efficiency and cost reduction • Brand management • Revenue maximization, cross-selling/up-selling • Enhanced customer experience • Churn identification, customer recruiting • Improved customer service • Identifying new products and market opportunities • Risk management • Regulatory compliance • Enhanced security capabilities
  • 76.
    What is BigData Objectives? • The technologies and concepts behind big data allow organizations to achieve a variety of objectives. • Like many new information technologies, –big data can bring about • dramatic cost reductions, • substantial improvements in the time required to perform a computing task, • or new product and service offerings
  • 77.
    Defining a BigData Strategy • A good Big Data Strategy will explore following subject domain, and align it to their organizational objectives: 1. Identify an Opportunity & EconomicValue of Data 2. Defining Big Data Architecture 3. Selecting Big DataTechnologies 4. Understanding Big Data Science 5. Developing Big Data Analytics 6. Institutionalize Big Data
  • 78.
    Defining a BigData Strategy 1. Identify an Opportunity & EconomicValue of Data –Catalog existing data sources available inside the organization, tapped or untapped. –Invent new ways of capturing data, integrate your data sources with external communities. Develop semantics and metadata for association, clustering, classification and trending. –Identify and create opportunities to integrate and fuse data with partner's dataset in industry likeTelecom,Travel, Financial, Healthcare, and Entertainment Industries etc. –Conceptualize the data insights and possible data sciences to extract valuable data, e.g. associations, simulation, regression, correlation, segmentation, trending, and predictive etc.
  • 79.
    Defining a BigData Strategy Identify an Opportunity & EconomicValue of Data… –Identify the scope of data access. i.e.Who can explore data? Who gets access to Data Insights. –Identify possibility of monetizing data to generate revenue from Data Insights gained, like generating leads, campaigns, upsell/cross sell opportunity, data streaming, data API, improving staff productivity & customer service etc. –Identify ethical and legal code associated with data under exploration with respect to industry standards, organizational culture, data policies, data privacy, regulatory and legal requirements –Data Requirement likeWhat type of data do you need? Is it diverse enough? How will you source it and store it?
  • 80.
    Defining a BigData Strategy 2. Defining Big Data Architecture – Defining Business Problems & Classification of associated data, such as, Market Sentiment Analysis, Churn Predictions, or Fraud Detection. – Defining Data Acquisition Strategy. – Selecting a Hadoop Framework – Big Data Life Cycle Management Framework. – Choosing Big Data stores, traditional or noSQL and Polyglot persistence. – Defining Big Data Infrastructure & PlatformTaxonomy – Identifying Big Data Analytics Frameworks, and associated Machine Learning Sciences. – develop Data Monetization Strategy to exploit its value internally within the enterprise, or externally
  • 81.
    Defining a BigData Strategy 3. Selecting Big DataTechnologies • Having the appropriate infrastructure in place to support the data you need is essential. • Be sure to consider the four layers of data including collecting, storing, processing/analysing and communicating insights from the data – InternetTechnologies – Machine learning – Commodity Hardware – Distributed processing – Leverage Cloud based approach to reduce time to market, reduce risk and gain better SLA out of the box
  • 82.
    Defining a BigData Strategy 4. Understanding Big Data Science – Data Science is the ongoing process of discovering information from data. It is a process that never stops, and often one question leads to another new question. It focuses on real-world problems, and tries to explain it – Machine Learning • Supervised • Unsupervised, hybride – Common Algorithms • Classification • Clustering • Associations & Correlations • Text Mining • Linear Regression
  • 83.
    Defining a BigData Strategy 5. Developing Big Data Analytics • Big Data applications varies based on industry. Businesses are trying to find value in monetizing data, or use it to improve efficiencies and customer experience. • Considering the different types of big data analytics is important – Descriptive – Diagnostic – Predictive – Prescriptive
  • 84.
    Defining a BigData Strategy 6. Institutionalize Big Data • Each enterprise will tailor the Big Data to meet the objectives of their particular vision. – Discovery (Opportunity, Requirements, Best Fit) – Proof of Concept (to Evaluate BusinessValue) – Provision Infra (here Big Data Elasticity comes to play) – Ingest (Source the data) – Process (Transform,Analyze, Data Science) – Publish (Share the learnings) • Governance: the current state of data quality, security, access, ownership, ethics and data privacy within the organization. • Considering Skill and capacity is also required
  • 85.
    Defining a BigData Strategy Generally: –Establishing your Big Data needs –Meeting business goals with timely data –Evaluating commercial Big Data tools –Managing organizational expectations is very important when we define a big data strategy
  • 86.
    Enabling analytical innovationin a Big Data • Data can drive innovation in two ways. – Data can motivate ideation, development, execution and evaluation of new innovations. – And it can underpin, or be a central component of new products, services, operations or business models. • Recent advances in machine learning and the vast amount of digitalized data. – For the first time, a machine powered by analytics was able to win against the best human player in the world in the game “Go.” • Self-driving cars that rely on the large number of digitized images that improve vision recognition systems dramatically.
  • 87.
    Enabling analytical innovationin a Big Data.. How big data fuel innovation? • “Analytics is really great at finding linkages or hidden patterns we may not easily observe by mining through a ton of data.” • “Analytics can really drive the creation of recombination’, or combining a diverse set of existing technologies in a new way.” • “We can use lessons learned from past generations of IT and analytics technologies to inform us about what the future could look like.” – Focusing on business importance – Framing the problem – Selecting the correct tools – Achieving timely results
  • 88.
    Implementing a BigData Solution Jan, 2022 Assosa, Ethiopia
  • 89.
    Contents • Introduction • Selectingsuitable vendors and hosting options • Balancing costs against business value • Keeping ahead of the curve
  • 90.
    Objectives • At theend of this chapter, you are able to: –Understand the concepts and criterion of Selecting suitable vendors and hosting options – Explain Balancing costs against business value
  • 91.
    Introduction • To besure, big data solutions are in great demand. • Today, enterprise leaders know that their big data is one of their most valuable resources and one they can’t afford to ignore. • As a result, they are looking for hardware and software that can –help them store, manage and analyze their big data. • Experts suggest that a good way to start the process of selecting a big data solution is –to determine exactly what kind of solution you need.
  • 92.
  • 93.
  • 94.
    Big Data SoftwareMarket Shares
  • 95.
    Top Big DataSoftware Provider Companies • SAP • Splunk • Oracle • IBM • Microsoft
  • 96.
    Big Data ProfessionalService Market Share
  • 97.
    Top Big DataProfessional Service Provider Companies • IBM • Accenture • Palantir • Teradata
  • 98.
    Big Data centeredIndustry Landscape Big Data IoT Cloud Mobile Bio
  • 99.
    Reading • What arethe criterion to select suitable vendors and hosting options? • Hardware • Software • Professional service • How to consider balancing of costs against business value generated from big data • The knowledge and skill required to Keep ahead of the curve
  • 100.
    Common Types ofBig Data Solution • Enterprise vendors offer a wide array of different types of big data solutions. • The kind of big data application that is right for organizations will depend on their goals • The best approach is to define the goals clearly at the outset and then go looking for products that will help to reach those goals.
  • 102.
    A Big DataSolution Hosting Options • On-PremiseVs Cloud-Based Big Data Applications –Want to host a big data software in organization data center or use a cloud-based solution. • ProprietaryVs Open Source Big Data Applications –Is the organization have skill professional, open source solutions up and running and configured for their needs –need to purchase support or consulting services (consider those expenses when figuring out total cost of ownership) • BatchVs Streaming Big Data Applications – Does organizations want to analyze data in real-time or batch? –Both real-time and batch data processing Lambda architecture
  • 103.
    Selection Criterion OrSuccess Factors • Integration with Legacy Technology • Performance • Scalability • Usability • Visualization • Flexibility • Security • Support –Even experienced IT professionals sometimes find it difficult to deploy, maintain and use complex big data applications.
  • 104.
    • Ecosystem: abig data platform that integrates with a lot of other popular tools and a vendor with strong partnerships with other providers • Self-Service Capabilities • Total Cost of Ownership • Estimated time to value • Artificial Intelligence and Machine Learning – how innovative the various big data solution vendors are? –AI and machine learning research are advancing at an incredible rate and becoming a mainstream part of big data analytics solution Selection Criterion Or Success Factors..
  • 105.