Intro to Big Data Hadoop

Dr. Sandeep G. Deshmukh
Introduction to

Contents
❑ Big Data
❑ Distributed Systems
❑ Hadoop
➢ Hadoop Distributed File System (HDFS)
➢ MapReduce
2

Big data is data that exceeds the processing capacity of
conventional database systems.
The data is too big, moves too fast, or doesn’t fit the strictures of
your database architectures.
To gain value from this data, you must choose an alternative way
to process it.
https://www.oreilly.com/ideas/what-is-big-data
Definition

Quantity of data
Data sets too large to store and analyze
using traditional databases
Volume

Velocity
Speed at which data is generated
Speed at which data is moving around
and analyzed
Analyze data while it is being generated
without even putting it into databases

Variety
Different types of data that we can use

Veracity
Messiness or trustworthiness of the data
Volume makes up for quality
Eg. Tweets with spelling mistakes, short
words ( u -> you, thr-> there)

Value
Getting value out of Big Data!!!

Definition
“Big data” is
high-volume, -velocity and -variety information assets
that demand cost-effective, innovative forms of information processing
for enhanced insight and decision making
By Gartner

Definition
Big data is a term for
data sets that are so large or complex that traditional data processing applications
are inadequate
Challenges include analysis, capture, data curation, search,sharing, storage,
transfer, visualization, querying, updating and information privacy.
The term often refers simply to the use of predictive analytics or certain other
advanced methods to extract value from data, and seldom to a particular size of
data set.
Accuracy in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced
risk.
Wikipedia

Use Case: Big Data in Oil & Gas Drilling
http://analytics-magazine.org/images/stories/novdec12/big-data.jpg

Use Case: Uber - Pay Surge Pricing if Battery is Low

● A Brief History of Big Data Everyone Should Read
● Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity
● What is big data? - OpenSource.com
● What is big data? - O’Reilly
● 5 Big Data Use Cases To Watch
● Best Big Data Analytics Use Cases
● The 5 game changing big data use cases
● Big Data - The 5 Vs Everyone Must Know
● Top SlideShare Presentations on Big Data
Further Reading

A distributed system is a collection of independent computers that appears to
its users as a single coherent system.
Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006
http://www.mypearsonstore.com/bookstore/distributed-systems-principles-and-paradigms-9780132392273?xid=PSED
Definition

Distributed Systems: Principles and Paradigms, 2nd Edition, Andrew S. Tanenbaum, Maarten Van Steen, 2006

Transparency Description
Access Hide differences in data representation and how a resource is accessed
Location Hide where a resource is located
Migration Hide that a resource may move to another location
Relocation Hide that a resource may be moved to another location while in use
Replication Hide that a resource is replicated
Concurrency Hide that a resource may be shared by several competitive users
Failure Hide the failure and recovery of a resource
Forms of Transparency in Distributed Systems

● A distributed system consists of components (i.e., computers) that are autonomous
● Users (be they people or programs) think they are dealing with a single system. This means that one way or
the other the autonomous components need to collaborate. How to establish this collaboration lies at the
heart of developing distributed systems.

A distributed system is a model in which components located on networked
computers communicate and coordinate their actions by passing messages.
The components interact with each other in order to achieve a common goal.
Three significant characteristics of distributed systems are: concurrency of
components, lack of a global clock, and independent failure of components.
Wikipedia
https://www.oreilly.com/ideas/what-is-big-data
Definition

● Distributed Computing - Wikipedia
● Distributed computing
● Characteristics of distributed system
Further Reading

Big Data Primers: Size does matter

Big Data Primers: Vertical Vs Horizontal Scaling
Vertical Scaling Horizontal Scaling

Big Data Primers: The scale of infrastructure

Intro to Big Data Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Intro to Big Data Hadoop

More from Apache Apex

Recently uploaded

Intro to Big Data Hadoop