Big data and hadoop

A Glimpse on Big Data and Hadoop
Outline :
• Introduction to Big Data
• Big data Architecture- Tools and Technologies
• What is Hadoop?
• Key Distinctions of Hadoop
• Core Hadoop components

What is Big Data?
• Big Data is a term for collection of data sets so large and complex that
it becomes difficult to process using on-hand database management
tools or any traditional approach.
• Lots of data
• Combination of structured and unstructured

Big Data by four words:
• Data Volume
• Data Velocity
• Data Variety
• Data Veracity

Challenges:
• data capture
• storage
• search
• Sharing
• analytics
• and visualization etc.

Big data Architecture- Tools and Technologies
Hadoop
• Low cost, reliable scale-
out architecture
• Distributed computing
Proven success in
Fortune 500 companies
• Exploding interest
NoSQL Databases
• Huge horizontal scaling
and high availability
• Highly optimized for
retrieval and appending
• Types
• Document stores
• Key Value stores
• Graph databases
Analytic RDBMS
• Optimized for bulk-
load and fast
aggregate query
workloads
• Types
• Column-
oriented
• MPP
• In-memory
Hadoop
NoSQL Databases
Analytic Databases

What is Hadoop?
• Apache Hadoop is an open source framework for distributed storage and
processing of large sets of data on commodity hardware. Hadoop enables
businesses to quickly gain insight from massive amounts of structured and
unstructured data.
• Hadoop was created by Doug Cutting and Mike cafarella
• It is designed to scale up from a single server to thousands of machines
• Hadoop provides reliable shared storage and analysis system

Hadoop History

Why we move to Hadoop
Hadoop is red-hot as it:
 Allows distributed processing of large data sets across clusters of
computers using simple programming model.
 Is cheaper to use in comparison to other traditional proprietary
technologies such as Oracle , IBM, etc.. It can run on low cost
commodity hardware.
 Has become de facto standard for storing , processing and
analyzing hundreds of terabytes and petabytes of data.
 Can handle all types of data from disparate systems such as
server logs, emails , sensors , images , etc..

Hadoop core components:
• Hadoop is a system for large scale data processing
• It has two main components:
 Hadoop Distributed File System:
Distributed across “nodes”
Natively redundant
Namenode track locations
 MapReduce:
Splits a tasks across processors
Shuffle and sort
Clustered storage

Hadoop Distributed File System
• HDFS is the primary distributed storage used by Hadoop applications.
• HDFS was designed to be a scalable, fault-tolerant, distributed storage
system that works closely with MapReduce.
• supports shell-like commands to interact with HDFS directly
• Features of HDFS are:
 Rack Awareness
 Minimal data motion
 Utilities
 Highly operable

MapReduce:
• MapReduce is a framework for processing parallelizable problems
across huge datasets
• Uses clusters to process data Or grid to process data
• MapReduce’s key benefits are:
 Simplicity
 Scalability
 Speed
 Built-in recovery
 Minimal data motion

Open
Discussion

References:
http://en.wikipedia.org/wiki/Apache_Hadoop - Apache Hadoop Wiki
http://hadoop.apache.org/ -Apache Hadoop Project
http://www-01.ibm.com/software/data/infosphere/hadoop/ - IBM’s
Definition for Big Data and Hadoop
http://hortonworks.com/hadoop/ - Hadoop Sandbox

Thank you
Join me at:
Presented by:
Prashanth Yennampelli
pyennamp@gmail.com

Big data and hadoop

More Related Content

What's hot

Viewers also liked

Similar to Big data and hadoop

Recently uploaded

Big data and hadoop

Editor's Notes