What is Big Data?
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.
…to be the
What they want to do??
how data is
and be a data
Who is a Data Scientist?
Data science incorporates varying
elements and builds on techniques and
theories from many fields, including
mathematics, statistics, data
engineering, pattern recognition and
learning, advanced computing,
visualization, uncertainty modeling,
data warehousing, and high
performance computing with the goal
of extracting meaning from data and
creating data products. A practitioner of
data science is called a data scientist
MapReduce & HDFS
• MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster (a group of
connected computers that work together so that in many respects
they can be viewed as a single system)
• A MapReduce program comprises a Map() procedure that performs
filtering & sorting and a Reduce() procedure that performs a
• The "MapReduce System" orchestrates by marshaling the
distributed servers, running the various tasks in parallel, managing
all communications and data transfers between the various parts of
the system, providing for redundancy and fault tolerance, and
overall management of the whole process.
• HDFS is a distributed, scalable, and portable file system written in
Java for the Hadoop framework.
• Apache Hadoop is an open-source software framework that
supports data-intensive distributed applications, licensed under the
Apache v2 license.
• Effectively, it implements MapReduce & provides a distributed file
• It supports the running of applications on large clusters of
• Hadoop makes it possible to run applications on systems with
thousands of nodes involving thousands of terabytes. Its distributed
file system facilitates rapid data transfer rates among nodes and
allows the system to continue operating uninterrupted in case of a
node failure. This approach lowers the risk of catastrophic system
failure, even if a significant number of nodes become inoperative.