BIG DATA ANALYTICS WITH HADOOP

AGENDA
 INTRODUCTION
 BIG DATA REQUIREMENTS?
 DATA GROWTH
 HADOOP
 HADOOP HISTORY
 HADOOP DEVELOPMENT HISTORY
 HADOOP TOOLS
 REFERENCES

INTRODUCTION
Big Data..What does it means?
 Volume
- Big data comes in large scale, Its in TB even PB.
- Records, Transactions, Tables, files.
 Velocity
- Data Flown continues, time sensitive, streaming flow
- Batch, real time, Steams, Historic
 Variety
- Big data includes structured, semi- structured, unstructured and all variety
- Text, files, logs, xml, audios, videos, stream, flat files etc.
 Veracity
- Quality, consistency, reliability, and provenance of data
- Good, bad, in-complete, undefined
 Ratio?
- 20% of structured
- 80% of un-structured

Big Data Requirements
 Technology Requirement?
- No technology stack required
- Fresher or experienced doesn’t matter
- Better to Know (But not required)
- Java & Linux
 Hardware Requirement?
- No need to purchase anything
- Better to Have – 64 bit Machine

Hadoop
 A large scale distributed processing apache framework
 Creator of Hadoop: Doug Cutting
 No high end server/machine required only commodity server
 Open source framework & implementation of Google Map reduce
 Efficient, reliable, Easy to use
 Store & process large amount of data
 Performance, Storage, processing scale linearly
 Simple core, modular and extensible
 Manageable and heal self
 Hardware cost effective

Hadoop History
 2002-2004 – Doug Cutting started working with Nutch
 2003-2004 – Google publish GFS and Map Reduce white papers
 2004 – Doug Cutting adds DFS and Map Reduce support to Nutch
 Yahoo! Hires Doug Cutting build team to develop Hadoop
 2007 – NY times converts 4 TB of archive over 100 TB EC2 of Hadoop
 Web Scale deployment at Yahoo, Facebook, twitter
 May 2009 – Yahoo does fastest sort of 1 TB in 62 Sec over 146 Nodes

HADOOP TOOLS
 APACHE HIVE
 HDFS
 SQOOP
 MAPREDUCE

APACHE HIVE:SQL ON HADOOP
 OSS data warehouse built on top of Hadoop
 First Apache Hive released in 2009
 Initial goal was to write Map Reduce jobs in SQL
-Most query ran from minutes to hours
-primary used for batch processing

Hive-Single tool for all SQL use cases

HDFS-master and slaves
 slaves(Data Nodes) stores blocks of data and serve the Master (Name
Node) manages all metadata information to the client

SQOOP
 Couldn’t I just do this with a shell script
– What year you live 2001? No there is a better way
 Structured data already captured in databases should be used with
unstructured data in Hadoop
 Tedious “glue” code necessary to wrap database records for
consumption in Hadoop
 Large amount of log data to process
 Apache open source software
 Bulk data transfer tool
– Import/Export from/to relational databases,
– enterprise data warehouses, and NoSQL systems
– Populate tables in HDFS, Hive, and HBase
– Integrate with Oozie

MAP REDUCE
 Map Reduce is a programming model and software framework first
developed by Google (Google’s Map Reduce paper submitted in 2004)
Map Reduce is a programming model and software framework first
developed by Google (Google’s Map Reduce paper submitted in 2004)
 Intended to facilitate and simplify the processing of vast amounts of
data in parallel on large clusters of commodity hardware in a reliable,
fault-tolerant manner
-Petabytes of data
-Thousands of nodes
-Computational processing occurs on both:
Unstructured data : file system
Structured data : database

REFFERNCES
 https://www.google.co.in/search?q=HADOOP&ie=utf-8&oe=utf-
8&gws_rd=cr&ei=13jTVcm2E8m3uQSkzp3wBg
 https://hadoop.apache.org/
 http://www.slideshare.net/linuxpham/hadoop-at-gnt-2012
 http://www.slideshare.net/narangv43/seminar-presentation-hadoop
 https://www.google.co.in/search?q=hadoop+books&ie=utf-8&oe=utf-
8&gws_rd=cr&ei=nXnTVeSAMY2-uASdoo7ACA
 http://www.fromdev.com/2014/07/Best-Hadoop-Books.html

BIG DATA ANALYTICS WITH HADOOP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BIG DATA ANALYTICS WITH HADOOP

Similar to BIG DATA ANALYTICS WITH HADOOP (20)

Recently uploaded

Recently uploaded (20)

BIG DATA ANALYTICS WITH HADOOP