Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Glusterfs and Hadoop

Presenting on GlusterFS and Haddop at FUDCon 2015 at MIT College Pune. The slides are available here.

Glusterfs and Hadoop

  1. 1. Shubhendu Tripathi PSE – Red Hat GlusterFS and Hadoop
  2. 2. 06/22/15 2 Agenda ● What is BigData ● Hadoop and its Evolution ● Hadoop Acrchitecture and Components ● Hadoop and GlusterFS (glusterfs-hadoop plugin) ● Advantages of using GlusterFS with Hadoop ● References
  3. 3. 06/22/15 3 What is BigData ● Software solutions mostly capture, maintain and manage data ● Storing data ● Processing data ● Growing data size in current world – big data generators ● Sensors ● CC Cam ● Social networks ● Online shopping portals ● Airlines ● Hospitality
  4. 4. 06/22/15 4 Agenda ● What is BigData ● Hadoop and its Evolution ● Hadoop Acrchitecture and Components ● Hadoop and GlusterFS (glusterfs-hadoop plugin) ● Advantages of using GlusterFS with Hadoop ● References
  5. 5. 06/22/15 5 What is BigData ● 90% of total data today we have, got generated in last 2 years ● 1990 ● HDD: 1-20 GB, RAM: 14-128 MB, Speed: 10kbps ● 2014 ● HDD: 0.5-1 TB, RAM: 1-16 GB, Speed: 100 mbps ● ● 3 Factors which define BigData ● Volume ● Velocity ● Variety (unstructured and semi structured data)
  6. 6. 06/22/15 6 What is BigData ● SAN – Storage Area Network ● One option – Store the data on data centers and get them on need basis and computation performed on them to process ● Computation is processor bound and a limit on the same ● As the size of the data increases we need more and more computation as well and its not possible to perform the same on local machine ● Solution - sending computation to the storage node and get the processed data is better option (size of computation would be small)
  7. 7. 06/22/15 7 Hadoop Evolution ● Started with Google – white papers ● GFS (Google File System) 2003 - Storage ● MapReduce 2004 – Computation ● Yahoo ● HDFS (Hadoop Distributed File System) - 2006,7 ● MapReduce (Computation mechanism) – 2007,8 ● Doug Cutting and Michael Cafarrela from Yahoo ● Logo Elephant ● Apache foundation (2005 Yahoo donated)
  8. 8. 06/22/15 8 Hadoop Architecture / Components ● Framework of tools – not an application in entirety ● Used for supporting running of applications on BigData ● Opensource'd set of tools distributed under Apache license ● Traditional Approach for handling huge data ● Powerful computer with big storage and computation capacity ● Limited by processing power of the computer with growing data ● Hadoop approach ● Break up data into smaller pieces and distribute to multiple computers ● Breaks the computation as well into smaller pieces and distributes them ● Combined results returned back
  9. 9. 06/22/15 9 Hadoop Architecture / Components ● Map Reduce ● Job Tracker ● Task Tracker ● HDFS ● Name Node ● Data Node ● Applications contact the master node, a task is formed and submitted to the Task Tracker ● Task Tracker maintains a queue of the tasks and gets them processed using the Task Tracker and Data Nodes ● Consolidates the result and sends back to the application
  10. 10. 06/22/15 10 Hadoop Architecture / Components ● Hadoop works on a distributed model ● Numerous low cost computers – commodity hardware ● Hadoop components ● Slaves – Task Tracker – process smaller piece of task assigned – Data Node – manage the piece of data distributed to this node ● Master – Job Tracker – tracks the overall task – Name Node – maintains the index of the data blocks stored on different nodes – Task Tracker – Data Node
  11. 11. 06/22/15 11 Hadoop Architecture / Components Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Applications Master Slaves Queue
  12. 12. 06/22/15 12 Hadoop Architecture / Components Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Applications Master Slaves
  13. 13. 06/22/15 13 Hadoop Architecture / Components Task Tracker Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Applications Master Slaves
  14. 14. 06/22/15 14 Hadoop and GlusterFS ● GlusterFS is a general purpose scale-out distributed file- system supporting thousands of clients ● Aggregates storage exports over network interconnect to provide a single unified namespace ● File-system completely in userspace, runs on commodity hardware ● Layered on disk file systems that support extended attributes
  15. 15. 06/22/15 15 ● Hadoop contains set of daemons running in the system ● Name Node – centralized metadata node ● Job Tracker – overall task distribution across data nodes ● Task Tracker – on data nodes to maintain task ● Data Node – to store data ● Hadoop = Map Reduce framework + HDFS ● GlusterFS can be a replacement for HDFS ● glusterfs-hadoop-plugin ● Java module which implements Hadoop file system interface ● Simple a JAR file which could be kept in Hadoop libraries ● Replaces HDFS for glusterfs Hadoop and GlusterFS
  16. 16. 06/22/15 16 Hadoop and GlusterFS ● Data locality is ensured by Job Tracker ● Using glusterfs-hadoop-plugin ensures data locality by getting the gluster volumes mounted as fuse mount ● Effectively no name node involved ● Only clients where map-reduce job runs ● And data nodes to store data ● Glusterfs-hadoop-plugin talks to glusterfs using fuse mounts ● In absence of name node, plugin uses xfattrs mechanism to get the details from volume and consolidates the data using the same ● Reads the data directly from the bricks and bypasses the volume as such for improved performance
  17. 17. 06/22/15 17 Hadoop and GlusterFS ● As simple as to execute map reduce daemon and then submit the hadoop task to use glusterfs as storage ● Analytics uses – using HDFS makes files moving around the nodes whereas glusterfs just need to fuse mount the volume and no moving around the files
  18. 18. 06/22/15 18 Advantages ● Elimination of centralized metadata server (name node) ● Compatibility with MapReduce and Hadoop based applications ● Elimination of code rewrites for Hadoop enablement of glusterfs ● Fault tolerant file system ● Allows co-location of compute and data nodes and ability to run Hadoop jobs across multiple namespaces using multiple glusterfs volumes ● Data access through serveral different mechanisms / protocols (Fuse, NFS, SMB and SWIFT …. and of course Hadoop)
  19. 19. 06/22/15 19 References ● https://github.com/gluster/glusterfs-hadoop ● https://forge.gluster.org/hadoop/pages/Home ● ● shubhendu @ #gluster on freenode
  20. 20. 06/22/15 20 Deployment Scenario
  21. 21. 06/22/15 21 THANK YOU!

×