Big Data and Hadoop
Big Data and Hadoop Introduction
The Solution (Hadoop Evolution)
so the processing with RDBMS is Impossible
Challenges In Big data
• Storage -- PB
• Processing – In a timely manner
• Variety of data -- S/SS/US
To overcome Big Data Challenges
• Cost Effective – Commodity HW
• Big Cluster – (1000 Nodes) --- Provides Storage n Processing
• Parallel Processing – Map reduce
• Big Storage – Memory per node * no of Nodes / RF
• Fail over mechanism – Automatic Failover
• Data Distribution
• Map Reduce Framework
• Moving Code to data
• Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of
any memory and CPU configuration)
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to
• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute
Other Hadoop-related projects at Apache include:
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• ZooKeeper™: A high-performance coordination service for distributed applications.
1 TB File
Based on GFS
HDFS : Use Cases
• Very large file.
• Reading/Streaming Data Access.
Read data in large volume
Write once and Read frequent
• Expensive Hardware.
• Low latency Access.
• Lots of small files
• Parallel write/ Arbitrary Read
HDFS Building Blocks
Default Block Size
1GB file = 1024 MB/128 MB = 8 Blocks
For Small File Size
100 MB File < Block Size (128 MB) : Optimize for storage = 1 Block of
HDFS of size 100 MB
HDFS Daemon Services
• Name Node
• Secondary Name Node
• Data Node
GFS (Master/Slave Architecture)
Copying Data from one Cluster to another
Parallel copying using distcp
hadoop distcp hdfs://uat:54311/user/rajkrrsingh/input hdfs://prod:54311/user/rajkrrsingh/input