3. In this session you will learn about:
• Hadoop Ecosystem
• Data discovery
• Open source technology for Big Data Analytics
• cloud and Big Data
Learning Objectives
4. Apache Hadoop is the most powerful tool of Big Data.
Hadoop ecosystem revolves around three main components-
• HDFS
• MapReduce
• YARN
Apart from these Hadoop Components, there are
some other Hadoop ecosystem components also, that play an important
role to boost Hadoop functionalities.
Hadoop
5.
6. Hadoop Components
1.1 HDFS
Hadoop Distributed File system (HDFS) is the primary storage system
of Hadoop.
HDFS store very large files running on a cluster of commodity
hardware.
It follows the principle of storing less number of large files rather than
the huge number of small files.
HDFS stores data reliably even in the case of hardware failure.
it provides high throughput access to the application by accessing in
parallel.
7.
8. 1.11 NameNode –
It works as Master in Hadoop cluster.
Namenode stores meta-data i.e. number of blocks, replicas and other
details.
Meta-data is present in memory in the master.
NameNode assigns tasks to the slave node.
It should deploy on reliable hardware as it is the centerpiece of HDFS.
Components of HDFS
9.
10. 1.12 DataNode –
It works as Slave in Hadoop cluster.
In Hadoop HDFS, DataNode is responsible for storing actual data
in HDFS.
DataNode also performs read and write operation as per request
for the clients.
DataNodes can also deploy on commodity hardware.
11.
12. 1.2 MapReduce
Hadoop MapReduce is the data processing layer of Hadoop.
It processes large structured and unstructured data stored in HDFS.
MapReduce also processes a huge amount of data in parallel.
It does this by dividing the job (submitted job) into a set of independent
tasks (sub-job).
In Hadoop, MapReduce works by breaking the processing into phases.
13.
14. 1.3 YARN
Hadoop YARN provides the resource management.
It is the operating system of Hadoop.
So, it is responsible for managing and monitoring workloads,
implementing security controls.
It is a central platform to deliver data governance tools across Hadoop
clusters.
YARN allows multiple data processing engines such as real-time
streaming, batch processing etc.
15.
16. Resource Manager –
It is a cluster level component and runs on the Master machine.
It manages resources and schedule applications running on the top of
YARN.
It has two components: Scheduler & Application Manager.
Node Manager –
It is a node level component.
It runs on each slave machine.
It continuously communicate with Resource Manager to remain up-to-date
Components of YARN
17. Data discovery is the collection and analysis of data from various sources
to gain insight from hidden patterns and trends.
It is the first step in fully harnessing an organization’s data to inform
critical business decisions.
Through the data discovery process, data is gathered, combined, and
analyzed in a sequence of steps.
The goal is to make messy and scattered data clean, understandable, and
user-friendly.
Data discovery
18.
19. According to Gartner, “Big Data Discovery” is the next big trend in
analytics.
Hottest trends of the last few years in analytics:
Big Data
Data Discovery
Data Science
20.
21. What are the Benefits of Data Discovery?
Gather Actionable Insights
Save Time
Scale Data Across Teams
Clean and Reuse Data
Data discovery provides a framework for firms to unlock and act upon the
insights contained within their data.
It transforms messy and unstructured data to facilitate and enhance its
analysis. Data discovery allows firms to:
24. We know we want collect, store, organize,
analyze and share it.
But we have limited resources.
25. What is Cloud Computing?
25
Cloud computing is a fast-
growing technology that has
established itself in the next
generation of IT industry and
business.
28. Case Study
Application
• Call Center surveillance
Background
• Previously – voice data
Goal for a new system
• Monitor data & voice
• Multiple data sources
• Advanced correlations
35. • Consistent Management
• Automation Through the Entire Stack
Reducing the
operational
complexity
Big Data in the cloud
36. Predictive analytics is the practice of extracting insights from the existing
data set with the help data mining, statistical modeling and machine
learning techniques and using it to predict unobserved/unknown events.
Identifying cause-effect relationships across the variables from the
historical data.
Discovering hidden insights and patterns with the help of data mining
techniques.
Apply observed patterns to unknowns in the Past, Present or Future.
Predictive Analytics