data analytics lecture4.pptx

Faculty Name: Namrata
Sharma/Arjun S. Parihar
Year/Branch:3rd/CSE
Subject Code:CS-503(A)
Subject Name:Data Analytics

In this session you will learn about:
• Hadoop Ecosystem
• Data discovery
• Open source technology for Big Data Analytics
• cloud and Big Data
Learning Objectives

 Apache Hadoop is the most powerful tool of Big Data.
 Hadoop ecosystem revolves around three main components-
• HDFS
• MapReduce
• YARN
Apart from these Hadoop Components, there are
some other Hadoop ecosystem components also, that play an important
role to boost Hadoop functionalities.
Hadoop

Hadoop Components
1.1 HDFS
Hadoop Distributed File system (HDFS) is the primary storage system
of Hadoop.
HDFS store very large files running on a cluster of commodity
hardware.
It follows the principle of storing less number of large files rather than
the huge number of small files.
HDFS stores data reliably even in the case of hardware failure.
 it provides high throughput access to the application by accessing in
parallel.

1.11 NameNode –
 It works as Master in Hadoop cluster.
 Namenode stores meta-data i.e. number of blocks, replicas and other
details.
 Meta-data is present in memory in the master.
 NameNode assigns tasks to the slave node.
 It should deploy on reliable hardware as it is the centerpiece of HDFS.
Components of HDFS

1.12 DataNode –
 It works as Slave in Hadoop cluster.
 In Hadoop HDFS, DataNode is responsible for storing actual data
in HDFS.
 DataNode also performs read and write operation as per request
for the clients.
 DataNodes can also deploy on commodity hardware.

1.2 MapReduce
 Hadoop MapReduce is the data processing layer of Hadoop.
 It processes large structured and unstructured data stored in HDFS.
 MapReduce also processes a huge amount of data in parallel.
 It does this by dividing the job (submitted job) into a set of independent
tasks (sub-job).
 In Hadoop, MapReduce works by breaking the processing into phases.

1.3 YARN
Hadoop YARN provides the resource management.
It is the operating system of Hadoop.
So, it is responsible for managing and monitoring workloads,
implementing security controls.
It is a central platform to deliver data governance tools across Hadoop
clusters.
YARN allows multiple data processing engines such as real-time
streaming, batch processing etc.

Resource Manager –
It is a cluster level component and runs on the Master machine.
It manages resources and schedule applications running on the top of
YARN.
It has two components: Scheduler & Application Manager.
Node Manager –
 It is a node level component.
It runs on each slave machine.
It continuously communicate with Resource Manager to remain up-to-date
Components of YARN

Data discovery is the collection and analysis of data from various sources
to gain insight from hidden patterns and trends.
It is the first step in fully harnessing an organization’s data to inform
critical business decisions.
Through the data discovery process, data is gathered, combined, and
analyzed in a sequence of steps.
The goal is to make messy and scattered data clean, understandable, and
user-friendly.
Data discovery

According to Gartner, “Big Data Discovery” is the next big trend in
analytics.
Hottest trends of the last few years in analytics:
Big Data
Data Discovery
Data Science

What are the Benefits of Data Discovery?
Gather Actionable Insights
Save Time
Scale Data Across Teams
Clean and Reuse Data
Data discovery provides a framework for firms to unlock and act upon the
insights contained within their data.
It transforms messy and unstructured data to facilitate and enhance its
analysis. Data discovery allows firms to:

We know we want collect, store, organize,
analyze and share it.
But we have limited resources.

What is Cloud Computing?
25
Cloud computing is a fast-
growing technology that has
established itself in the next
generation of IT industry and
business.

Cloud Service Model
26
Cloud service model typically consists of paas, saas, and laas.

Case Study
 Application
• Call Center surveillance
 Background
• Previously – voice data
 Goal for a new system
• Monitor data & voice
• Multiple data sources
• Advanced correlations

Ever Growing Data
Deeper Correlation
Tight Performance

 Auto start VMs
 Install and configure app components
 Monitor
 Repair
 (Auto) Scale
Managing Big
Data on the cloud
Big Data in the cloud

Reduce the
infrastructure cost
Choose the right
cloud for the job

• Consistent Management
• Automation Through the Entire Stack
Reducing the
operational
complexity

Predictive analytics is the practice of extracting insights from the existing
data set with the help data mining, statistical modeling and machine
learning techniques and using it to predict unobserved/unknown events.
Identifying cause-effect relationships across the variables from the
historical data.
Discovering hidden insights and patterns with the help of data mining
techniques.
Apply observed patterns to unknowns in the Past, Present or Future.
Predictive Analytics

data analytics lecture4.pptx

Recommended

Recommended

More Related Content

Similar to data analytics lecture4.pptx

Similar to data analytics lecture4.pptx (20)

Recently uploaded

Recently uploaded (20)

data analytics lecture4.pptx