Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data and hadoop overvew


Published on

This pdf is useful to understand BigData and hadoop overview.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big data and hadoop overvew

  1. 1. Big Data and Hadoop Overview Saurabh Khanna Mob: +91-8147644946
  2. 2. Agenda  Introduction to Big Data  Current market trends and challenges of Big Data  Approach to solve Big Data Problems  Introduction to Hadoop  HDFS & Map Reduce  Hadoop Cluster Introduction & Creation  Hadoop Ecosystems
  3. 3. Introduction to Big Data “Big data is a collection of large and complex data sets that it becomes difficult to process using on-hand database management tools.The challenges include capture, storage, search, sharing, analysis, and visualization” Or Big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies.And it has 3V’s
  4. 4. Some Make it 4V’s
  5. 5. Big Data Source (1/2)
  6. 6. Big Data Source (2/2) The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  7. 7. Big Data Growth
  8. 8. Expectation from Big Data
  9. 9. Current market trends & challenges of Big Data (1/2) We’re generating more data than ever • Financial transactions • Sensor networks • Server logs • Analytic • e-mail and text messages • Social media And we’re generating data faster than ever • Automation • Ubiquitous internet connectivity • User-generated content For example, every day • Twitter processes 340 million messages • Amazon S3 storage adds more than one billion objects • Facebook users generate 17.7 billion comments and “Likes”
  10. 10.  Data isValue and we must process it to extract that value This data has many valuable applications • Marketing analysis • Demand forecasting • Fraud detection • And many, many more…  Data Access is the Bottleneck • Although we can process data more quickly but accessing is very slow and this is true for both reads and writes. For example • Reading a single 3TB disk takes almost four hours • We cannot process the data till we’ve read the data • We’re limited by the speed of a single disk • We’ll see Hadoop’s solution in a few moments  Disk performance has also increased in the last 15 years but unfortunately, transfer rates haven’t kept pace with capacity Year Capacity (GB) Cost per GB (USD) Transfer Rate (MB/S) Disk Read Time 1997 2.1 $157 16.6 126 seconds 2004 2000 $1.05 56.5 59 minutes 2012 3,000 $0.5 210 3 hours,58 minutes Current market trends & challenges of Big Data (2/2)
  11. 11. Approach to solve Big Data problem  Previously explained pain areas lead to below problems • Large-scale data storage • Large-scale data analysis There are following approach we have to solve Big Data problems • Option 1 - Distributed Computing • Option 2 - NoSQL • Option 3 - HDFS
  12. 12. Distributed Computing – Option 1 Typical processing pattern Step 1: Copy input data from storage to compute node Step 2: Perform necessary processing Step 3: Copy output data back to storage This works fine with relatively amounts of data but we have few problems with this approach • That is, where step 2 dominates overall runtime • More time spent copying data than actually processing it • Getting data to the processors is the bottleneck • Grows worse as more compute nodes are added • They’re competing for the same bandwidth • Compute nodes become starved for data • It is not fault tolerance
  13. 13. NoSQL - Option 2  NoSQL (commonly referred to as "Not Only SQL") represents a completely different framework of databases that allows for high-performance, agile processing of information at massive scale. In other words, it is a database infrastructure that has been very well- adapted to the heavy demands of big data.  NoSQL is referring to non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem like Casandra, Mongo DB, Riak, Couch DB, and many others.  NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers.  This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance.  The NoSQL distributed database infrastructure has been the solution to handling some of the biggest data warehouses on the planet – i.e. the likes of Google,Amazon, and the CIA.
  14. 14. Hadoop - Option 3  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  Hadoop is based on a simple data model, any data will fit  Hadoop was started to improve scalability of Apache Nutch • Nutch is an open source Web search engine.
  15. 15. Main Big dataTechnology Hadoop NoSQL Databases Analytic Databases Hadoop • Low cost, reliable scale-out architecture • Distributed computing Proven success in Fortune 500 companies • Exploding interest NoSQL Databases • Huge horizontal scaling and high availability • Highly optimized for retrieval and appending • Types • Document stores • Key Value stores • Graph databases Analytic RDBMS • Optimized for bulk-load and fast aggregate query workloads • Types • Column-oriented • MPP • OLTP • In-memory
  16. 16. Hadoop ? “Apache Hadoop is an open-source software framework for storage and large- scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.” Two Google whitepapers had a major influence on this effort • The Google File System (storage) • Map Reduce (processing)
  17. 17. Design principles of Hadoop  Invented byYahoo (Doug Cutting) • Process internet scale data (search the web, store the web) • Save costs - distributed workload on massively parallel system build with large numbers of inexpensive computers  New way of storing and processing the data: • Let system handle most of the issues automatically: • Failures • Scalability • Reduce communications • Distribute data and processing power to where the data is • Make parallelism part of operating system • Relatively inexpensive hardware ($2 – 4K) • Reliability provided though replication • Large files preferred over small  Bring processing to Data! Hadoop = HDFS + Map / Reduce infrastructure
  18. 18.  Search Yahoo,Amazon, Zvents  Log processing Facebook,Yahoo, ContextWeb. Joost,  Recommendation Systems Facebook  DataWarehouse Facebook,AOL  Video and Image Analysis NewYorkTimes, Eyealike What is Hadoop used for?
  19. 19. Hadoop Users  Banking and financial • JP Morgan and Chase • Bank of America • Commonwealth bank of Australia  Telecom • China Mobile Corporation  Retail • E-bay • Amazon  Manufacturing • IBM • ADOBE  Web & Digital Media • Facebook • Twitter • LinkedIn • NewYorkTimes
  20. 20. Why Hadoop?  Handle partial hardware failures without going down: • If machine fails, we should be switch over to stand by machine • If disk fails – use RAID or mirror disk  Able to recover on major failures: • Regular backups • Logging • Mirror database at different site  Capability: • Increase capacity without restarting the whole system (Pure Scale) • More computing power should equal to faster processing  Result consistency: • Answer should be consistent (independent of something failing) and returned in reasonable amount of time
  21. 21. Consider the example of facebook, Facebook data has grown upto 100TB/day by 2013 and in future shall produce data of a much higher magnitude. They have many web servers and huge MySql (profile,friends etc.) servers to hold the user data. Hadoop solution framework – A practical example (1/2)
  22. 22. Now to run various reports on these huge data For eg: 1) Ratio of men vs. women users for a period. 2) No of users who commented on a particular day. Soln: For this requirement they had scripts written in python which uses ETL processes. But as the size of data increased to this extent these scripts did not work. Hence their main aim at this point of time was to handle data warehousing and their home ground solutions were not working. This is when Hadoop came into the picture.. Hadoop solution framework – A practical example (2/2)
  23. 23. Hadoop Distributed File System (HDFS) Agenda HDFS Definition Architecture HDFS Components
  24. 24. HDFS Definition  The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.  HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework.  It has many similarities with existing distributed file systems.  Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications.  HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high  throughput access to application data and is suitable for applications that have large data sets  HDFS consists of following components (daemons) • Name Node • Data Node • Secondary Name Node
  25. 25. HDFS Components(1/2)  Name node: Name Node, a master server, manages the file system namespace and regulates access to files by clients. It has following properties.  Meta-data in Memory • The entire metadata is in main memory • Types of Metadata • List of files • List of Blocks for each file • List of Data Nodes for each block • File attributes, e.g. creation time, replication factor • ATransaction Log • Records file creations, file deletions. Etc.  Data Node: Data Nodes, a server where we store our actual data it should be one per node and it has following properties • A Block Server • Stores data in the local file system (e.g. ext3) • Stores meta-data of a block (e.g. CRC) • Serves data and meta-data to Clients • Block Report • Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data • Forwards data to other specified Data Nodes
  26. 26.  Secondary Name Node • It is not used as hot stand-by or mirror node. It is just a failover node is in future release. • It is used for housekeeping purpose and in case of NN failure we can take data from this node. • It use to take backup Name Node periodically • Memory requirements are the same as Name Node (big) • Typically on a separate machine in large cluster ( > 10 nodes) • Directory is same as Name Node except it keeps previous checkpoint version in addition to current. • It can be used to restore failed Name Node (just copy current directory to new Name Node) HDFS Components(2/2)
  27. 27. MapReduce Framework Agenda Introduction Application Components Understanding the Processing Logic
  28. 28. Introduction to MapReduce Framework “A programming model for parallel data processing. Hadoop can run map reduce programs in multiple languages like Java, Python, Ruby and C++.“  Map function: • Operate on set of key, value pairs • Map is applied in parallel on input data set • This produces output keys and list of values for each key depending upon the functionality • Mapper output are partitioned per reducer = No. Of reduce task for that job  Reduce function:  Operate on set of key, value pairs  Reduce is then applied in parallel to each group, again producing a collection of key, values.  No of reducers can be set by the user.
  29. 29. How does a map-reduce algorithm work (1/2)
  30. 30. How does a map-reduce algorithm work (2/2)
  31. 31. Map Reduce Components  JobTracker : The Job-Tracker is responsible for accepting jobs from clients, dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes.  TaskTracker : Task-Tracker process that manages the execution of the tasks currently assigned to that node. EachTaskTracker has a fixed number of slots for executing tasks (two maps and two reduces by default).
  32. 32. MapReduce co-located with HDFS Slave node A Client submits MapReduce job JobTracker TaskTracker Slave node B Slave node C TaskTracker TaskTracker NameNode JobTracker and NameNode need not be on same node DataNode TaskTrackers (compute nodes) and DataNodes colocate = high aggregate bandwidth across cluster DataNode DataNode
  33. 33. Understanding processing in a M/R framework (1/2)  User runs a program on the client computer  Program submits a job to HDFS. Job contains: • Input data • Map / Reduce program • Configuration information  Two types of daemons that control job execution: • Job Tracker (master node) • Task Trackers (slave nodes)  Job sent to Job Tracker then Job Tracker communicates with Name Node and assigns parts of job to Task Trackers (Task Tracker is run on each Data Node)  Task is a single MAP or REDUCE operation over piece of data  Hadoop divides the input to MAP / REDUCE job into equal splits  The Job Tracker knows (from Name Node) which node contains the data, and which other machines are nearby.  Task processes send heartbeats to Task Tracker and Task Tracker sends heartbeats to the Job Tracker.
  34. 34.  Any tasks that did not report in certain time (default is 10 min) assumed to be failed and it’s JVM will be killed byTask Tracker and reported to the Job Tracker.  The JobTracker will reschedule any failed tasks (with differentTask Tracker)  If same task failed 4 times all job will fails  AnyTask Tracker reporting high number of failed jobs on particular node will be blacklist the node (remove metadata from Name Node)  JobTracker maintains and manages the status of each job. Results from failed tasks will be ignored Understanding processing in a M/R framework (2/2)
  35. 35. Computing parallelism meet data locality  All map tasks are equivalent; so can run in parallel  All reduce tasks can also run in parallel  Input data on HDFS on can be processed independently  Therefore, run map task on whatever data is local (or closest) to a particular node in HDFS and will be good performance • For map task assignment, JobT racker has an affinity for a particular node which has a replica of the input data • If lots of data does happen to pile up on the same node, nearby nodes will map instead  And improve recovery from partial failure of servers or storage during the operation: if one map or reduce task fails, the work can be rescheduled
  36. 36. Programming using MapReduce WordCount is a simple application that counts the number of occurences of each word in a given input file. Here we divide the entire code into 3 files 1) 2) 3)
  37. 37. import; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text, IntWritable> { private final static IntWritable one = new IntWritable(1); privateText word = new Text(); public void map(LongWritable key,Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  38. 38. For the following standard input mapper does the following Input: I am working forTCS TCS is a great company The Mapper implementation, via the map method, processes one line at a time, as provided by the specifiedTextInputFormat. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. Output: <I,1> <am,1> <working,1> <for,1> <TCS,1> <TCS,1> <is,1> <a,1> <great,1> <company,1>
  39. 39. Sorted mapper output to reducer Hence, the output of each map is passed through a sorting Algorithm which sorts the output of Map according to the keys. Output: <a,1> <am,1> <company,1> <for,1> <great,1> <I,1> <is,1> <TCS,1> <TCS,1> <working,1>
  40. 40. import; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Reducer extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum +=; } output.collect(key, new IntWritable(sum)); } }
  41. 41. Reducer output  The output of Mapper is given to the Reducer, which Sums up the values, which are the occurrence counts for each key ( i.e. words in this example). Output: <a,1> <am,1> <company,1> <for,1> <great,1> <I,1> <is,1> <TCS,2> <working,1>
  42. 42. import; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Basic extends MapReduceBase implements Reducer<Text, IntWritable,Text, IntWritable> { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(Basic.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Mapper.class); conf.setReducerClass(Reducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
  43. 43. Executing the MapReduce program 1)Compile all the 3 java files which will create 3 .class files 2)Add all 3 .class files into 1 single jar file by writing this command jar –cvf file_name.jar *.class 3)Now you just need to execute single jar file by writing this command bin/hadoop jar file_name.jar Basic input_file_name output_file_name
  44. 44. Hadoop Clusters Agenda Cluster Concepts Installing Hadoop Creating a pseudo cluster
  45. 45. Clustering in Hadoop  Clustering in HADOOP can be achieved in the following modes  Local (Standalone) Mode-Used for Debugging: • By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This mode is useful for debugging.  Pseudo-Distributed Mode- Used for Development : • Hadoop can also be run on a single-node in a pseudo-distributed mode where all Hadoop daemon runs in a separate Java process  Fully-Distributed Mode- Used for Debugging, Development, Production : • In this mode all hadoop daemons will be running on separate nodes and it is useful for production.
  46. 46. Pseudo-distributed mode configuration
  47. 47. Executing a pseudo cluster  Format a new distributed-file system: $ bin/hadoop namenode -format  Start the hadoop daemons: $ bin/  Copy the input files into the distributed filesystem: $ bin/hadoop fs –copyFromLocal input1 input  Run some of the examples provided: $ bin/hadoop jar hadoop-examples.jar wordcount input output Examine the output files:  Copy the output files from the distributed file system to the local file system and examine them: $ bin/hadoop fs -copyToLocal output output $ cat output/part-00000  When you're done, stop the daemons with: $ bin/
  48. 48. Questions ?
  49. 49. Hadoop Ecosystems Agenda  Pig Concepts  Hive Concepts  HBase Concepts
  50. 50. Hadoop Ecosystems Apache Hive Apache Pig Apache HBase Sqoop Oozie Hue Flume Apache Whirr Apache Zookeeper SQL-like language and metadata repository High-level language for expressing data analysis programs The Hadoop database. Random, real - time read/write access Highly reliable distributed coordination service Library for running Hadoop in the cloud Distributed service for collecting and aggregating log and event data Browser-based desktop interface for interacting with Hadoop Server-based workflow engine for Hadoop activities Integrating Hadoop with RDBMS
  51. 51. Pig Concepts What is Pig ?  It is an open-source high-level dataflow system and introduced by Yahoo  Provides a simple language for queries and data manipulation, Pig Latin, that is compiled into map- reduce jobs that are run on Hadoop  Pig Latin combines the high-level data manipulation constructs of SQL with the procedural programming of map-reduce Why is it important?  Companies and organizations like Yahoo, Google and Microsoft are collecting enormous data sets in the form of click streams, search logs, and web crawls  Some form of ad-hoc processing and analysis of all of this information is required
  53. 53. Hive Concepts What is Hive ?  It is an open-source DW solution built on top of Hadoop and introduced by Facebook  Support SQL-like declarative language called HiveQL which are compiled into map-reduce jobs executed on Hadoop  Also support custom map-reduce script to be plugged into query.  Includes a system catalog, Hive Metastore for query optimizations and data exploration Why is it important?  It is very easy to learn because of its similar behavior like SQL.  Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  54. 54. Hive execution plan Clients either via CLI/ JBDC/ODBC HIVE-QL Driver Invoke Compiler DAGof Map- Reduces Execution Engine hadoop
  55. 55. Difference between Pig & Hive  Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library  Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data.  If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop".  Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several Map Reduce jobs on Hadoop.  Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems.  Pig lets users express them in a language not unlike a bash or perl script.
  56. 56. HBase  Apache HBase in a few words: “HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable”  HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases.  HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.  However, HBase has many features which supports both linear and modular scaling.  HBase supports an easy to use Java API for programmatic access.
  57. 57. Why is it important?  HBase is a Bigtable clone.  It is open source  It has a good community and promise for the future  It is developed on top of and has good integration for the Hadoop platform, if you are using Hadoop already.  It has a Cascading connector.  No real indexes  Automatic partitioning  Scale linearly and automatically with new nodes  Commodity hardware  Fault tolerance  Batch processing
  58. 58. Difference between HBase and Hadoop/HDFS?  HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.  HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
  59. 59. Questions ?
  60. 60. Thank You