Hadoop presentation
Upcoming SlideShare
Loading in...5

Hadoop presentation



A short compilation about Hadoop form various books and other resources. This is just for leaning..,,,,,,,

A short compilation about Hadoop form various books and other resources. This is just for leaning..,,,,,,,



Total Views
Views on SlideShare
Embed Views



5 Embeds 59

http://www.linkedin.com 31
http://number9javascript.blogspot.in 12
https://www.linkedin.com 12
http://number9javascript.blogspot.sg 2
http://number9javascript.blogspot.com 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hadoop presentation Hadoop presentation Presentation Transcript

  • BIG DATA and HadoopByChandra Sekhar
  • Contents
  • Introduction to BigDataWhat is hadoop?What hadoop is used for and is not?Top level Hadoop ProjectsDifferences between RDBMS and Hbase.Facebook server model.
  • BigData- The Data AgeBig data is a collection of datasets so large andcomplex that it becomes difficult to process using on-hand database management tools or traditional dataprocessing applications.The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization.The data that is getting generated by differentcompanies has an inherent value, can be used fordifferent use cases in their analytics and predictions.
  • A new approachAs per Moores Law, which was true for the past 40 years.1) Processing power doubles every two years2) Processing speed is no longer the problem.Getting the data to the processors becomes the bottleneck.Average time taken to transfer 100GB of data takes 22 min, ifthe disk transfer rate is 75 MB/secSo, the new approach is to move processing of the data to thedata side in a distributed way, and need to satisfy the differentrequirements like : Data Recoverability, Component Recovery,Consistency, Reliability and Scalability.The answer is the Googles File System(GFS) andMapReduce, which is now Hadoop called HDFS andMapReduce.
  • Hadoop used for.Hadoop is recommended to coexist with your RDBMS as adata ware house.It is not a replacement to any of the RDBMS.Processing over TB and PB of data is specified to take hoursof time with traditional methods, with Hadoop and and its eco-system it would take a few minutes with the power ofdistribution.Many related tools integrate with Hadoop –Data"analysis”Data"visualization"Database"integration"Workflow"management"Cluster"management"
  • ➲ Distributed File system and parallel processing for large scaledata operations using HDFS and MapReduce.➲ Plus the infrastructure needed to make them work, includeFilesystem utilitiesJob scheduling and monitoringWeb UIThere are many other projects running around the corecomponents of Hadoop. Pig, Hive, HBase, Flume, Oozie,Sqoop, etc called as Ecosystem.A set of machines running HDFS and MapReduce is knownas Hadoop Cluster.Individual machines are known as nodes – A clustercan have as few as one node, as many as severalthousands , horizontally scalable.More nodes = better performance!Hadoop and EcoSystem
  • Hadoop-ComponentsHDFS and MapReduce-CoreZooKeeper-AdminHive,Pig – SQL and scriptsbased on MapReduceHbase is NoSQL Datastore.Sqoop- import to and exportdata from RDBMS.Avro - Serialization based onJSON. Used for metadatastore.
  • Hadoop Components: HDFSHDFS, the Hadoop Distributed File System, is responsible forstoring data on the cluster. Uses Ext3/Ext4 or xfs file system.HDFS is a file-system designed for storing very large files withstreaming data-acess(write-once, read many time), running onclusters of commodity hardware.Data is split into blocks and distributed across mul/ple nodes in theclusterEach block is typically 64MB or 128MB in sizeEach block is replicated multiple timesDefault is to replicate each block three timesReplicas are stored on different nodesThis ensures both reliability and availability.
  • HDFS and MapReduceNameNode(Master)SecondaryNameNodeMaster FailoverNodeData Nodes (SlaveNodes).JobTrackerJobsTask TrackerTasksMapperReducerCombinerPartitioner
  • HDFS and Nodes
  • Architecture
  • MapReduce
  • HDFS Access•WebHDFS – REST API•Fuse DFS – Mounting HDFS as normaldrive.•Direct Access – Direct HDFS access
  • Hive and PigHive is a powerful SQL language, though notfully supported SQL, can be used to performjoins on top of datasets in HDFS.Used for large batch Programming. At thebackend, hive does the MapReduce Jobs only.Pig is a powerful scripting language, that isbuilt on top of the MapReduce Jobs, thelanguage is called PigLatin.
  • HBASEThe most powerful NoSQL database on earth.Supports Master Active-Active Setup and isbased on the Googles BigTable.Supports Columns and ColumnFamilies, cansupport many billions of rows and manymillions of columns in its datamodel.An excellent Architectural master-piece, as faras the scalability is concerned.A NoSQL database, which can supporttransactions, very fast reads/writes typicallymillions of queries / second.
  • HBASE-ContinuedHbase MasterRegion ServersZooKeepersHDFS
  • ZooKeeper, MahoutZookeeper is a distributed coordinator and canbe used as independent package, in anydistributed servers management.Mahout is a machine learning tool useful forusing it for various Data science techniques.For eg: Data Clustering, Classification andRecommender Systems by using Supervisedand Unsupervised Learning.
  • FlumeFlume is a real time data access mechanismand writes to a data mart.Flume can move large capacity of streamingdata into HDFS and will be used for furtheranalysis.A part from this realtime analysis of the web-log data is also possible along with Flume.Logs of a group of webservers can be writtento HDFS using Flume.
  • Sqoop and OozieSqoop is a data import and export mechanismfrom RDBMS to HDFS or hive and vice-versa.There are lot of free connectors that has beenprepared by various vendors with differentRDBMS, which has really made the datatransfer very fast, as it supports paralleltransfer of stuff.Oozie is a workflow, mechanism of executinga large sequence of MapReduce Jobs, Hive orPig Jobs and Hbase Jobs and any other JavaPrograms. Oozie also has an email job which
  • RDBMS vs HBASEA typical RDBMS scaling story runs this way:Initial Public LaunchService Popular, too many reads hitting database.Service continues to grow in popularity; too many writes hittingthe database.New features increases query complexity; now we have toomany joinsRising popularity swamps the server; things are too slowSome queries are still too slowReads are OK, but writes are getting slower and slower
  • With HbaseEnter HBase, which has the following characteristics:No real indexes.Automatic partitioning/ShardingScale linearly and automatically with new nodesCommodity hardwareFault toleranceBatch processing
  • Facebook Server Architecture