Introduction to BigDataWhat is hadoop?What hadoop is used for and is not?Top level Hadoop ProjectsDifferences between RDBMS and Hbase.Facebook server model.
BigData- The Data AgeBig data is a collection of datasets so large andcomplex that it becomes difficult to process using on-hand database management tools or traditional dataprocessing applications.The challenges include capture, curation, storage,search, sharing, transfer, analysis, and visualization.The data that is getting generated by differentcompanies has an inherent value, can be used fordifferent use cases in their analytics and predictions.
A new approachAs per Moores Law, which was true for the past 40 years.1) Processing power doubles every two years2) Processing speed is no longer the problem.Getting the data to the processors becomes the bottleneck.Average time taken to transfer 100GB of data takes 22 min, ifthe disk transfer rate is 75 MB/secSo, the new approach is to move processing of the data to thedata side in a distributed way, and need to satisfy the differentrequirements like : Data Recoverability, Component Recovery,Consistency, Reliability and Scalability.The answer is the Googles File System(GFS) andMapReduce, which is now Hadoop called HDFS andMapReduce.
Hadoop used for.Hadoop is recommended to coexist with your RDBMS as adata ware house.It is not a replacement to any of the RDBMS.Processing over TB and PB of data is specified to take hoursof time with traditional methods, with Hadoop and and its eco-system it would take a few minutes with the power ofdistribution.Many related tools integrate with Hadoop –Data"analysis”Data"visualization"Database"integration"Workflow"management"Cluster"management"
➲ Distributed File system and parallel processing for large scaledata operations using HDFS and MapReduce.➲ Plus the infrastructure needed to make them work, includeFilesystem utilitiesJob scheduling and monitoringWeb UIThere are many other projects running around the corecomponents of Hadoop. Pig, Hive, HBase, Flume, Oozie,Sqoop, etc called as Ecosystem.A set of machines running HDFS and MapReduce is knownas Hadoop Cluster.Individual machines are known as nodes – A clustercan have as few as one node, as many as severalthousands , horizontally scalable.More nodes = better performance!Hadoop and EcoSystem
Hadoop-ComponentsHDFS and MapReduce-CoreZooKeeper-AdminHive,Pig – SQL and scriptsbased on MapReduceHbase is NoSQL Datastore.Sqoop- import to and exportdata from RDBMS.Avro - Serialization based onJSON. Used for metadatastore.
Hadoop Components: HDFSHDFS, the Hadoop Distributed File System, is responsible forstoring data on the cluster. Uses Ext3/Ext4 or xfs file system.HDFS is a file-system designed for storing very large files withstreaming data-acess(write-once, read many time), running onclusters of commodity hardware.Data is split into blocks and distributed across mul/ple nodes in theclusterEach block is typically 64MB or 128MB in sizeEach block is replicated multiple timesDefault is to replicate each block three timesReplicas are stored on different nodesThis ensures both reliability and availability.
HDFS and MapReduceNameNode(Master)SecondaryNameNodeMaster FailoverNodeData Nodes (SlaveNodes).JobTrackerJobsTask TrackerTasksMapperReducerCombinerPartitioner
HDFS Access•WebHDFS – REST API•Fuse DFS – Mounting HDFS as normaldrive.•Direct Access – Direct HDFS access
Hive and PigHive is a powerful SQL language, though notfully supported SQL, can be used to performjoins on top of datasets in HDFS.Used for large batch Programming. At thebackend, hive does the MapReduce Jobs only.Pig is a powerful scripting language, that isbuilt on top of the MapReduce Jobs, thelanguage is called PigLatin.
HBASEThe most powerful NoSQL database on earth.Supports Master Active-Active Setup and isbased on the Googles BigTable.Supports Columns and ColumnFamilies, cansupport many billions of rows and manymillions of columns in its datamodel.An excellent Architectural master-piece, as faras the scalability is concerned.A NoSQL database, which can supporttransactions, very fast reads/writes typicallymillions of queries / second.
ZooKeeper, MahoutZookeeper is a distributed coordinator and canbe used as independent package, in anydistributed servers management.Mahout is a machine learning tool useful forusing it for various Data science techniques.For eg: Data Clustering, Classification andRecommender Systems by using Supervisedand Unsupervised Learning.
FlumeFlume is a real time data access mechanismand writes to a data mart.Flume can move large capacity of streamingdata into HDFS and will be used for furtheranalysis.A part from this realtime analysis of the web-log data is also possible along with Flume.Logs of a group of webservers can be writtento HDFS using Flume.
Sqoop and OozieSqoop is a data import and export mechanismfrom RDBMS to HDFS or hive and vice-versa.There are lot of free connectors that has beenprepared by various vendors with differentRDBMS, which has really made the datatransfer very fast, as it supports paralleltransfer of stuff.Oozie is a workflow, mechanism of executinga large sequence of MapReduce Jobs, Hive orPig Jobs and Hbase Jobs and any other JavaPrograms. Oozie also has an email job which
RDBMS vs HBASEA typical RDBMS scaling story runs this way:Initial Public LaunchService Popular, too many reads hitting database.Service continues to grow in popularity; too many writes hittingthe database.New features increases query complexity; now we have toomany joinsRising popularity swamps the server; things are too slowSome queries are still too slowReads are OK, but writes are getting slower and slower
With HbaseEnter HBase, which has the following characteristics:No real indexes.Automatic partitioning/ShardingScale linearly and automatically with new nodesCommodity hardwareFault toleranceBatch processing