Hadoop presentation


Published on

A short compilation about Hadoop form various books and other resources. This is just for leaning..,,,,,,,

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop presentation

  1. 1. BIG DATA and HadoopByChandra Sekhar
  2. 2. Contents
  3. 3. Introduction to BigDataWhat is hadoop?What hadoop is used for and is not?Top level Hadoop ProjectsDifferences between RDBMS and Hbase.Facebook server model.
  4. 4. BigData- The Data AgeBig data is a collection of datasets so large andcomplex that it becomes difficult to process using on-hand database management tools or traditional dataprocessing applications.The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization.The data that is getting generated by differentcompanies has an inherent value, can be used fordifferent use cases in their analytics and predictions.
  5. 5. A new approachAs per Moores Law, which was true for the past 40 years.1) Processing power doubles every two years2) Processing speed is no longer the problem.Getting the data to the processors becomes the bottleneck.Average time taken to transfer 100GB of data takes 22 min, ifthe disk transfer rate is 75 MB/secSo, the new approach is to move processing of the data to thedata side in a distributed way, and need to satisfy the differentrequirements like : Data Recoverability, Component Recovery,Consistency, Reliability and Scalability.The answer is the Googles File System(GFS) andMapReduce, which is now Hadoop called HDFS andMapReduce.
  6. 6. Hadoop used for.Hadoop is recommended to coexist with your RDBMS as adata ware house.It is not a replacement to any of the RDBMS.Processing over TB and PB of data is specified to take hoursof time with traditional methods, with Hadoop and and its eco-system it would take a few minutes with the power ofdistribution.Many related tools integrate with Hadoop –Data"analysis”Data"visualization"Database"integration"Workflow"management"Cluster"management"
  7. 7. ➲ Distributed File system and parallel processing for large scaledata operations using HDFS and MapReduce.➲ Plus the infrastructure needed to make them work, includeFilesystem utilitiesJob scheduling and monitoringWeb UIThere are many other projects running around the corecomponents of Hadoop. Pig, Hive, HBase, Flume, Oozie,Sqoop, etc called as Ecosystem.A set of machines running HDFS and MapReduce is knownas Hadoop Cluster.Individual machines are known as nodes – A clustercan have as few as one node, as many as severalthousands , horizontally scalable.More nodes = better performance!Hadoop and EcoSystem
  8. 8. Hadoop-ComponentsHDFS and MapReduce-CoreZooKeeper-AdminHive,Pig – SQL and scriptsbased on MapReduceHbase is NoSQL Datastore.Sqoop- import to and exportdata from RDBMS.Avro - Serialization based onJSON. Used for metadatastore.
  9. 9. Hadoop Components: HDFSHDFS, the Hadoop Distributed File System, is responsible forstoring data on the cluster. Uses Ext3/Ext4 or xfs file system.HDFS is a file-system designed for storing very large files withstreaming data-acess(write-once, read many time), running onclusters of commodity hardware.Data is split into blocks and distributed across mul/ple nodes in theclusterEach block is typically 64MB or 128MB in sizeEach block is replicated multiple timesDefault is to replicate each block three timesReplicas are stored on different nodesThis ensures both reliability and availability.
  10. 10. HDFS and MapReduceNameNode(Master)SecondaryNameNodeMaster FailoverNodeData Nodes (SlaveNodes).JobTrackerJobsTask TrackerTasksMapperReducerCombinerPartitioner
  11. 11. HDFS and Nodes
  12. 12. Architecture
  13. 13. MapReduce
  14. 14. HDFS Access•WebHDFS – REST API•Fuse DFS – Mounting HDFS as normaldrive.•Direct Access – Direct HDFS access
  15. 15. Hive and PigHive is a powerful SQL language, though notfully supported SQL, can be used to performjoins on top of datasets in HDFS.Used for large batch Programming. At thebackend, hive does the MapReduce Jobs only.Pig is a powerful scripting language, that isbuilt on top of the MapReduce Jobs, thelanguage is called PigLatin.
  16. 16. HBASEThe most powerful NoSQL database on earth.Supports Master Active-Active Setup and isbased on the Googles BigTable.Supports Columns and ColumnFamilies, cansupport many billions of rows and manymillions of columns in its datamodel.An excellent Architectural master-piece, as faras the scalability is concerned.A NoSQL database, which can supporttransactions, very fast reads/writes typicallymillions of queries / second.
  17. 17. HBASE-ContinuedHbase MasterRegion ServersZooKeepersHDFS
  18. 18. ZooKeeper, MahoutZookeeper is a distributed coordinator and canbe used as independent package, in anydistributed servers management.Mahout is a machine learning tool useful forusing it for various Data science techniques.For eg: Data Clustering, Classification andRecommender Systems by using Supervisedand Unsupervised Learning.
  19. 19. FlumeFlume is a real time data access mechanismand writes to a data mart.Flume can move large capacity of streamingdata into HDFS and will be used for furtheranalysis.A part from this realtime analysis of the web-log data is also possible along with Flume.Logs of a group of webservers can be writtento HDFS using Flume.
  20. 20. Sqoop and OozieSqoop is a data import and export mechanismfrom RDBMS to HDFS or hive and vice-versa.There are lot of free connectors that has beenprepared by various vendors with differentRDBMS, which has really made the datatransfer very fast, as it supports paralleltransfer of stuff.Oozie is a workflow, mechanism of executinga large sequence of MapReduce Jobs, Hive orPig Jobs and Hbase Jobs and any other JavaPrograms. Oozie also has an email job which
  21. 21. RDBMS vs HBASEA typical RDBMS scaling story runs this way:Initial Public LaunchService Popular, too many reads hitting database.Service continues to grow in popularity; too many writes hittingthe database.New features increases query complexity; now we have toomany joinsRising popularity swamps the server; things are too slowSome queries are still too slowReads are OK, but writes are getting slower and slower
  22. 22. With HbaseEnter HBase, which has the following characteristics:No real indexes.Automatic partitioning/ShardingScale linearly and automatically with new nodesCommodity hardwareFault toleranceBatch processing
  23. 23. Facebook Server Architecture