Introduction to Hadoop and MapReduce


Published on

This talk was for GDG Fresno meeting. The demo used Google Compute Engine and Google Cloud Storage. The actual talk was different than the slides. There were a lot of good questions from the audience, and diverted to side topics many times.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • consists of very large volumes of heterogeneous data that is being generated, often, at high speeds.  These data sets cannot be managed and processed using traditional data management tools and applications at hand.  Big Data requires the use of a new set of tools, applications and frameworks to process and manage the data.
  • However, it is not just about the total size of data (volume)It is also about the velocity (how rapidly is the data arriving)What is the structure? Does it have variations?ScienceScientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. SensorsData sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.Social networksI am thinking of Facebook, LinkedIn, Yahoo, GoogleSocial influencersBlog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etcLog filesComputer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliancesPublic Data StoresMicrosoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDbData warehouse appliancesTeradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis Network and in-stream monitoring technologiesPackets in TCP/IP, email, etcLegacy documentsArchives of statements, insurance forms, medical record and customer correspondence VolumeVolume refers to the size of data that we are working with. With the advancement of technology and with the invention of social media, the amount of data is growing very rapidly.  This data is spread across different places, in different formats, in large volumes ranging from Gigabytes to Terabytes, Petabytes, and even more. Today, the data is not only generated by humans, but large amounts of data is being generated by machines and it surpasses human generated data. This size aspect of data is referred to as Volume in the Big Data world.VelocityVelocity refers to the speed at which the data is being generated. Different applications have different latency requirements and in today's competitive world, decision makers want the necessary data/information in the least amount of time as possible.  Generally, in near real time or real time in certain scenarios. In different fields and different areas of technology, we see data getting generated at different speeds. A few examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many others. This speed aspect of data generation is referred to as Velocity in the Big Data world.VarietyVariety refers to the different formats in which the data is being generated/stored. Different applications generate/store the data in different formats. In today's world, there are large volumes of unstructured data being generated apart from the structured data getting generated in enterprises. Until the advancements in Big Data technologies, the industry didn't have any powerful and reliable tools/technologies which can work with such voluminous unstructured data that we see today. In today's world, organizations not only need to rely on the structured data from enterprise databases/warehouses, they are also forced to consume lots of data that is being generated both inside and outside of the enterprise like clickstream data, social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets, relational databases etc., we have a lot of unstructured data stored in the form of images, audio files, video files, web logs, censor data, and many others. This aspect of varied data formats is referred to as Variety in the Big Data world.
  • The Google File System It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key
  • Introduction to Hadoop and MapReduce

    1. 1. Introduction to Hadoop and MapReduce Csaba Toth GDG Fresno Meeting Date: February 6th, 2014 Location: The Hashtag, Fresno
    2. 2. Agenda • • • • • Big Data A little history Hadoop Map Reduce Demo: Hadoop with Google Compute Engine and Google Cloud Storage
    3. 3. Big Data • Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” • Examples: (Wikibon - A Comprehensive List of Big Data Statistics) – 100 Terabytes of data is uploaded to Facebook every day – Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data – Twitter generates 12 Terabytes of data every day – LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature – YouTube users upload 48 hours of new video content every minute of the day – Decoding of the human genome used to take 10 years. Now it can be done in 7 days
    4. 4. Big Data characteristics • Three Vs: Volume, Velocity, Variety • Sources: – – – – Science, Sensors, Social networks, Log files Public Data Stores, Data warehouse appliances Network and in-stream monitoring technologies Legacy documents • Main problems: – Storage Problem – Money Problem – Consuming and processing the data
    5. 5. A Little History Two Seminar papers: • “The Google File System” - October 2003 – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance • “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 – Describes a programming model and an implementation for processing large data sets. 1. 2. map function that processes a key/value pair to generate a set of intermediate key/value pairs reduce function that merges all intermediate values associated with the same intermediate key
    6. 6. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications. • It is written in Java, utilizes JVMs • Named after it’s creator’s (Doug Cutting, Yahoo) son’s toy elephant • Hadoop is managing a cluster of commodity hardware computers. The cluster is composed of a single master node and multiple worker nodes
    7. 7. Hadoop vs RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Interactive and Batch Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate
    8. 8. MapReduce • Hadoop leverages the programming model of map/reduce. It is optimized for processing large data sets. • MapReduce is an essential technique to do distributed computing on clusters of computers/nodes. • The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel. • Hadoop leverages a distributed file system to store the data on various nodes.
    9. 9. MapReduce • It is about two functions: map and reduce 1. Map Step: – it is about dividing the problem into smaller subproblems. A master node has the job of distributing the work to worker nodes. The worker node just does one thing and returns the work back to the master node. 2. Reduce Step: – Once the master gets the work from the worker nodes, the reduce step takes over and combines all the work. By combining the work you can form some answer and ultimately output.
    10. 10. MapReduce – Map step • There is a master node and many slave nodes. • The master node takes the input, divides it into smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure. • The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node. • Each mapping operation is independent of the others, all maps can be performed in parallel.
    11. 11. MapReduce – Reduce step • The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve. • Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.
    12. 12. Map, Shuffle, and Reduce
    13. 13. Word count
    14. 14. Hadoop architecture • • • • Job Tracker Task Tracker Name Node Data Node
    15. 15. Figures • Following: some figures from the book Hadoop: The Definitive Guide, 3rd Edition
    16. 16. A client reading data from HDFS
    17. 17. A client writing data to HDFS
    18. 18. Network distance in Hadoop
    19. 19. MapReduce data flow with a single reduce task
    20. 20. MapReduce data flow with multiple reduce tasks
    21. 21. Hadoop architecture Presentation Layer Web Browser (JS) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Advanced Query Engine (Hive, Pig) Computing Layer (MapReduce) Storage Layer (HDFS) Data Integration Layer Flume Sqoop Log Data RDBMS
    22. 22. Demo • Google Compute Engine + Google Cloud Storage • Using Ubuntu as a remote control host • Following the tutorial of: – – Hadoop on Google Compute Engine for Processing Big Data: • The example hadoop job is an advanced version of word count in perl or python: the words are sorted by length and abc • Showing also Google Developer Tool web interface
    23. 23. References • Google’s tutorial (see github and YouTube link of the Demo) • Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press • Lynn Langit’s various presentations and YouTube videos • Dattatrey Sindol: Big Data Basics - Part 1 - Introduction to Big Data • Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction) • Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know
    24. 24. Thanks for your attention!