Big Data Is Everywhere•The Large Hadron Collider (LHC), a particleaccelerator that will revolutionize ourunderstanding of the workings of the Universe,will generate 60 terabytes of data per day –15 petabytes (15 million gigabytes)annually.•Decoding the human genome originally took10 years to process; now it can be achievedin one week.•12 terabytes of Tweets created each day•100 terabytes of data uploadeddaily to Facebook .•Walmart handles more than 1 million customertransactions every hour, which is imported intodatabases estimated to contain more than2.5 petabytes of data.•Convert 350 billion annual meter readings tobetter predict power consumption.
What Is Big Data?Its LARGE Its COMPLEXIts UNSTRUCTUREDBy David Kellog, “Big data refers to the datasets whose size is beyond the ability of atypical database software tools to capture ,store, manage and analyze.”O’Reilly defines big data the following way: “Big data is data that exceeds theprocessing capacity of conventional database systems. The data is too big, moves toofast, or doesnt fit the strictures of your database architectures.” 
An Obvious Question – How BIG is the BIGDATA ?A common misconception is Big data issolely related to VOLUME.While volume or size is a part of theequation…..What about SPEED at which datais generated ?And about the VARIETY of big datathat variety of sources aregenerating?
Why The Sudden Explosion Of Big Data ?•An Increased number and variety of data sourcesthat generate large quantities of data•Sensors(location, GPS..)•Scientific Computing(CERN, biological research..)•Web 2.0(Twitter, wikis ..)•Realization that data is too valuable to delete•Data analytics and Data Warehousing•Business Intelligence•Dramatic Decline in the cost of hardware,especially storage•decline in price of SSDs
BIG DATA is fuelled by CLOUD•The properties of cloud help us in dealing with the Big data•And the challenges of the Big data drives the Futuredesigns , enhancement and expansion of cloud.•Both are in a Never Ending cycle.
The Value Of Big Data – Why Its So Important?
MANAGING BIG DATATraditional EnterpriseArchitecture VS ClusterArchitectureHadoop – Managing Big data
TRADITIONAL ENTERPRISE ARCHITECTUREConsists of•Servers•SAN (Storage Area Network)•Storage arrays•Servers -a server is a physical computer dedicated to running one or moreservices to serve the needs of the users of other computers on the network.•Storage Arrays-A disk array is a disk storage system which contains multipledisk drives(SATA,SSD).•Storage Area Network - A storage area network (SAN) is a dedicatednetwork that provides access to consolidated, data storage. SANs are primarilyused to make storage devices, such as disk arrays, accessible to servers so thatthe devices appear like locally attached devices to the operating system.
SOME ADVANTAGES AND DISADVANTAGES OFENTERPRISE ARCHITECTUREADVANTAGES•Coupling between Servers and Storage /Disk arrays – Which can be expanded,upgraded or retire independent of eachother•SAN enables services on any of server tohave access of any of storage arrays aslong as they have access permission.•ROBUST and MINIMUM FAILURE rate.•Mainly designed for computingintensive applications which operate on asubset of data.DISADVANTAGES•More Costlier as it expands.•But What about BIG DATA ?It cannot handle Data intensiveoperation like sorting.
What we want is an Architecture that will give -
CLUSTER ARCHITECTUREConsists of•Nodes – each having itsown cores , memory ,disks .•Interconnection via highspeed network(LAN)• consists of a set of loosely connected computers that work together so that inmany respects they can be viewed as a single system.•usually connected to each other through fast local area networks,each node (computer used as a server) running its own instance ofan operating system.•The activities of the computing nodes are orchestrated by "clusteringmiddleware", a software layer that sits atop the nodes and allows the users totreat the cluster as by and large one cohesive computing unit.
Benefits of Using a Cluster Architecture•Modular and Scalable - easier to expand the system without bringing downthe application that runs on top of the cluster.•Data Locality – where data can be processed by the cores collocated insame node or Rack minimizing any transfer over network.•Parallelization - higher degree of parallelism via the simultaneousexecution of separate portions of a program on different processors.•All this with less cost .
But Every Coin has two Sides!•Complexity - Cost of administering a cluster of N machines .•More Storage – As data is replicated to protect from failure.•Data Distribution – How to distribute data evenly across cluster ?•Careful Management and Need of massive parallel processing Design.
Riding the Elephant - HadoopSOLUTION•Open Source Apache Project initiated and led byYahoo.•Apache Hadoop is an open source Java frameworkfor processing and querying vast amounts of dataon large clusters of commodity hardware.•Runs onoLinux, Mac OS/X, Windows, and SolarisoCommodity hardware•Target cluster of commodity PCsoCost-effective bulk computing•Invented by Doug Cutting and funded by Yahoo in2006 and reached to its “web scale capacity” in2008.Doug Cutting
Where Does it All come from ?• underlying technology was invented by Google back in their earlierdays so they could usefully index all the rich textural and structuralinformation they were collecting, and then present meaningful andactionable results to users.•Based on Google’s Map Reduce and Google File System.
What hadoop is ?Hadoop Consists of two core components –1.Hadoop Distributed File System (HDFS)2.Hadoop Distributed ProcessingFramework– Using Map/Reduce metaphor
Hadoop Distributed File System(HDFS)Based on Simple design principles –•To Split•To Scatter•To Replicate•To Manage data across cluster•Files are broken in to large file blockswhich is usually a multiple of storageblocks.Typically 64 MB or higher
Hadoop Distributed File System(HDFS) contd..•File blocks are Replicated to severaldatanodes, for reliability.•Default is 3 replicas, but settable•Blocks are placed (writes arepipelined):•On same node•On same rack•On the other rack•Clients read from closest replica.•If the replication for a block dropsbelow target, it is automatically re-replicated.
Hadoop Distributed File System(HDFS) contd..•Single namespace for entire clustermanaged by a single Name node•Namenode, a master server thatmanages the file system namespace andregulates access to files by clients.•DataNodes: serves read, write requests,performs block creation, deletion, andreplication upon instruction fromNamenode.•When a datanode fails , Namenode•identifies file blocks that havebeen affected•retrieves copy from other healthynodes•finds new node to store anothercopy of them.•Updates information in its tables.
Hadoop Distributed File System(HDFS) contd..•Client talks to both namenode anddatanodes•Data is not sent through thenamenode.•First namenode is connected andthen user can directly connect to datanodeHDFSArchitecture
•ADVANTAGES•Highly fault-tolerant•High throughput•Suitable for applications with large datasets•Streaming access to file system data•Can be built out of commodity hardwareHadoop Distributed File System(HDFS) contd..•2 POINT OF FAILURES•Namenode can become a single point offailure•Cluster rebalancing•SOLUTIONS•Enterprise Editions maintain Backup ofnamenode.•Architecture is compatible with data rebalancingschemes , but its still an area of research.
Hadoop Map/Reduce•Map/Reduce is a programmingmodelfor efficient distributed computing•User submits MapReduce job•System:• Partitions job into lotsof tasks•Schedules tasks onnodes close to data• Monitors tasks• Kills and restarts if theyfail/hang/disappearConsists of two phases1.Mapper Phase2.Reduce Phase
Hadoop Map/Reduce contd …1.Mapper Phase•The data are fed into the map function askey value pairs to produce intermediatekey/value pairs.• Input: key1,value1 pair• Output: key2, value2 pairs•All nodes will do same computation•Uses Data Locality to increaseperformance.•As all data blocks stored in HDFSare of equal size mapper computation canbe equally divided.
Hadoop Map/Reduce contd …Reduce Phase•Once the mapping is done, all the intermediate results from various nodes arereduced to create the final output.•Has 3 Phases• shuffle,•sort and•reduce.•Shuffle - Input to the Reducer is the sorted output of the mappers. In thisphase the framework fetches the relevant partition of the output of all themappers.•Sort - The framework groups Reducer inputs by keys (since differentmappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs arebeing fetched they are merged.•Reduce - In this phase the reduce method is called for each <key, (list ofvalues)> pair in the grouped inputs and will produce final outputs.
Understood or not ? Lets understand it by anExample• Suppose you want to analyze blog entries stored in BigData.txt andcount no of times Hadoop , Big Data, Green Plum words appear in it.•Suppose 3 nodes participate in task . In Mapper Phase , each node will receivean address of file block and pointer to mapper function.•Mapper Function will calculate word –count.
Lets understand it by an Example•Output of mapper function will be set of<key,value >pairs.FINAL OUTPUTOF MAPPER PHASE
Lets understand it by an Example•The Reduce Phase sums and reducesoutput .•A node is selected to performreduce function and other nodes sendtheir output to that node.•After Shuffling of Reduce Phase
Lets understand it by an Example•After sorting phase of Reduce PhaseAnd FINALLY
•JobTracker keeps track of all theMapReduces jobs that are running onvarious nodes.•This schedules the jobs, keeps track of allthe map and reduce jobs running acrossthe nodes.•If any one of those jobs fails, it reallocatesthe job to another node, etc.•TaskTracker performs the map andreduce tasks that are assigned by theJobTracker.•TaskTracker also constantly sends ahearbeat message to JobTracker, whichhelps JobTracker to decide whether todelegate a new task to this particular nodeor not.A bit more on Map/Reduce
Accessibilty and Implementation•HDFS•HDFS provides Java API for application to use.•Python access is also used in many applications.•It provides a command line interface called the FS shell that lets theuser interact with data in the HDFS.•The syntax of the commands is similar to bash.Example: to create a directoryUsage: hadoop dfs -mkdir <paths>hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2•Map/Reduce•Java API which has prebuilt classes and Interfaces.•Python , C++ can also be used.
References Randal E. Bryant , Randy H. Katz , Edward D. Lazowska, “Big-Data Computing:Creating revolutionary breakthroughs in commerce, science, and society”,Version 8: December 22, 2008. Available:http://www.cra.org/ccc/docs/init/Big_Data.pdf [Accessed Sept.9,2012]What is Big Data ?[Online]. Available :http://www-01.ibm.com/software/data/bigdata/ [Accessed Sept.9,2012] A Comprehensive List of Big Data Statistics [Online].Available :http://wikibon.org/blog/big-data-statistics/ [Accessed Sept.9,2012] James Manyika, Michael Chui ,Brad Brown, Jacques Bughin, Richard Dobbs ,CharlesRoxburgh , Angela Hung Byers Big Data: The next frontier for innovation ,competition ,and productivity , McKinskey Global Institute, May 2011.Availabe:http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation[Accessed Sept.10,2012]What Is Big Data? ,O’Reilly Radar, January 11, 2012,[Online].Available :http://radar.oreilly.com/2012/01/what-is-big-data.html[Accessed Sept.10,2102]-Big Data, Wipro,[Online].Available:http://www.slideshare.net/wiprotechnologies/wipro-infographicbig-data[AccessedSept.11,2012]
ReferencesOwan o maley ,”Introduction to Hadoop”[Online].Available : http://wiki.apache.org/hadoop/HadoopPresentations[Accessed Sept .17,2012 ]Hadoop at Yahoo!, Yahoo developer Network[Online].Available:http://developer.yahoo.com/hadoop/ [Accessed Sept .17,2012 ] Elif Dede, Madhusudhan Govindaraju, Dan Gunter, LavanyaRamakrishnan,“Ridingthe elephant: managing ensembles with hadoop”,in MTAGS 11 Proceedings of the 2011 ACM international workshop on Many taskcomputing on grids and supercomputers, Pages 49-58[Online].Available : ACM Digital Library,http://dl.acm.org/citation.cfm?id=2132876.2132888 [Accessed Sept .17,2012 ] HDFS Architecture, Hadoop 0.20 Documentation[Online].Available: http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html[AccessedSep.20,2012]
ReferencesDoug Cutting ,”Hadoop Overview” ,[Online] Available:http://wiki.apache.org/hadoop/HadoopPresentations[Accessed Sept .17,2012 ] Map/Reduce Tutorial, Hadoop 0.20 Documentation,[Online].Available :http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Reducer[Accessed Sept .17,2012 ] Patricia Florissi, Big Ideas : Demystifying Hadoop, [Video].Available : http://www.youtube.com/watch?v=XtLXPLb6EXs&feature=relmfu C/C++ MapReduce Code & build, Hadoop Wiki , C++ word Count, [Online].Available :http://wiki.apache.org/hadoop/C%2B%2BWordCount[Accessed October .1,2012]