Big data and hadoop introduction

What is Big Data?
What makes data, “Big”
Data?

 No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…

 “Big Data” is similar to ‘small data’ but bigger in size.
 But having data bigger it require different approach
- Techniques, Tools and architecture.
 An aim to solve new problems or old problems in
better way.
 Big Data generates value from the storage and
processing of very large quantities of digital
information that can’t be analyzed with traditional
computing techniques.

 A typical PC might have had 10 GB of storage in 2000.
 There are around 6000 tweets every second, which calculates to
over 350,000 tweets per minute and 50 million tweets per day.
 Facebook has over 1.55 billion active users per month and around
1.39 billion mobile active users. Every minute on Facebook, 510
comments are posted, 293,000 statuses are updated and 136,000
photos are uploaded.
 The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environment, location
and other information, including video.

 Clickstreams and ad impressions capture user behavior at
millions of event per second.
 High frequency stock trading algorithm reflect market changes
within microseconds.
 Machine to machine processes exchange data between billions of
device.
 Infrastructure and sensor generate massive log data in real time.
 On-line gaming systems support million of concurrent users,
each producing multiple inputs per second.

 Big Data isn't just numbers, dates and strings. Big data is
also geospatial data, audio and video and unstructured
text, including log files and social media.
 Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.
 Big data analysis includes different type of data.

• Automatically generated by machine
(Sensor embedded in an engine)
• Typically an entirely new source of data
(use of internet)
• Not designed to be friendly
(Text Streams)
• May not have much values
(Need to be focus on important parts)

Analysis of data is a process of
inspecting, cleaning, transforming,
and modeling data with the goal of
discovering useful information,
suggesting conclusions, and
supporting decision-making.
Data analysis has multiple facets
and approaches, encompassing
diverse techniques under a variety
of names, in different business,
science, and social science
domains.

Traditional ETL
Vs
Big Data ETL

• Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting,
who was working at Yahoo! at the time, named it after his son’s toy
elephant.
• It was originally developed to support distribution for the Nutch search
engine project.
• After six years of gestation, Hadoop reaches 1.0.0 and this include
support for security
 HBase (append/hsynch/hflush, and security)
 webhdfs (with full support for security)
 performance enhanced access to local files for HBase
 other performance enhancements, bug fixes, and features

 Apache Hadoop is an open-source software framework written in
Java with some native code in C and command line utilities written
as shell scripts for distributed storage and distributed processing
of very large data sets.
 Computer clusters built from commodity hardware.
 The core of Apache Hadoop consists of a storage part, known as
Hadoop Distributed File System (HDFS), and a processing part
called Map Reduce.

• On July 6, 2015 Apache hadoop comes up with its latest stable release
2.7.1.
• Hadoop 2.7.1 comes after the 131 bug fixes and patches since the
previous release 2.7.0. Please look at the 2.7.0 section below for the
list of enhancements enabled by this first stable release of 2.7.x and
bug fixes.
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-
common/releasenotes.html

Cluster is the set of commodity hardware or nodes. Multiple nodes forms a racks. This is the
hardware part of the infrastructure.
HDFS (Hadoop Distributed File System) provide the space of data storage with some
replication factor. File System of Hadoop.
Map Reduce is a programming model to process large set of data. It has
Map Phase
Practitioner Phase
Sort Phase
Shuffle Phase
Combiner Phase
Reducer Phase
YARN Infrastructure (Yet Another Resource Negotiator) is the framework responsible for
providing the computational resources (e.g., CPUs, memory, etc.) needed for application
executions. Two important elements are:

Big data and hadoop introduction

Big data and hadoop introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Big data and hadoop introduction

Similar to Big data and hadoop introduction (20)

Recently uploaded

Recently uploaded (20)

Big data and hadoop introduction