Your SlideShare is downloading. ×
Introduction to Hadoop - The Essentials
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Introduction to Hadoop - The Essentials

1,014
views

Published on

Introduction to Hadoop, slides presented at Hadoop User Group UAE meetup on November 25, 2013.

Introduction to Hadoop, slides presented at Hadoop User Group UAE meetup on November 25, 2013.

Published in: Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,014
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • In a nutshell, Hadoop grew out of research at Google, which got adopted by the Open Source community, and supported by heavyweights such as Yahoo!, Facebook and others. It had 6 years to mature.
  • No it’s not Charles Darwin Hadoop was named after the creator’s son’s toy elephant.
  • So What is Hadoop?
  • Brief description of the operation of HDFS. There are 3 main components (daemons) in HDFS: NameNode, DataNode, and Secondary NameNode.
  • There are 2 main components in MapReduce (daemons): JobTracker and TaskTracker
  • MapReduce is composed of Map tasks and Reduce tasks. Those tasks run in parallel and do not depend on each other’s output.
  • The major resources to start learning more about Hadoop. Also recommended is reading the research papers from Google that spurred the whole Hadoop ecosystem (by SanjarGhemawat and Jeff Dean).
  • Q&A with the famous Hadoop elephant mascot
  • Transcript

    • 1. Introduction to Hadoop The Essentials November 25, 2013 Fadi Yousuf
    • 2. About Me • • • • • Founder and Managing Director of Axeldata Systems 13+ years involved in designing data architectures Previous life at Sun, Cisco, Oracle, Google, F5 Networks Working with Hadoop since 2011 Certified as Cloudera Hadoop Developer, Administrator and HBase Specialist • Authorized Cloudera Hadoop trainer • Perspective - Hadoop is the foundation of scalable big data platforms © 2013. Axeldata Systems FZ-LLC 2
    • 3. Why Hadoop? • • • RDBMS technology has served us well for 30+ years Excellent for low-latency, real-time transaction-oriented data processing In the age of big data, RDBMS has many limitations: – Volume: shared-all architecture limits linear scalability and requires fork-lift upgrades of hardware infrastructure when limits are reached – Variety: data has to fit nicely in rows and column, with a rigid schema, suitable for structured data but fails to handle unstructured data – Velocity: ingesting data at speed means you can’t afford the time to shape data into the clean structures of relational databases © 2013. Axeldata Systems FZ-LLC 3
    • 4. A Brief History of Hadoop 1000-node Yahoo! cluster Google publish MapReduce paper Google publish GFS paper Nutch rearchitecture Nutch created 2002 Hadoop subproject 2003 © 2013. Axeldata Systems FZ-LLC 2004 2005 2006 First commercial distribution Top-level Apache Project 2007 2008 4 Hadoop 2.0 Hive, Pig, HBase graduate 2009 Impala, the first real-time query engine Further commercial distributions 2010 2011 2012 2013
    • 5. The Birth of Hadoop “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.” - Doug Cutting, Creator of Hadoop © 2013. Axeldata Systems FZ-LLC 5
    • 6. Hadoop: The Big Data Platform It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models © 2013. Axeldata Systems FZ-LLC 6
    • 7. Core Hadoop Concepts • Applications are written in high level code – Developers don’t need to worry about network programming and dependencies • Minimal communication between the nodes – Shared nothing architecture • Move compute to storage, not the opposite – Computation happens locally on each machine – No need to move data around • Failure is accepted and tolerated – Data is replicated multiple times across different machines © 2013. Axeldata Systems FZ-LLC 7
    • 8. Hadoop Then… • Storage Batch MR – Hadoop Distributed File System (HDFS) Resource Management • Programming Framework Storage Integration © 2013. Axeldata Systems FZ-LLC – MapReduce 8
    • 9. Hadoop Now… SQL Searc h Math & Stats InMemor y • Storage … Security Metadata Batch MR Resource Management • Programming Framework Storage Integration – MapReduce Source: Cloudera © 2013. Axeldata Systems FZ-LLC – Hadoop Distributed File System (HDFS) 9
    • 10. What is HDFS? • Distributed file system • Breaks large files into smaller blocks that are stored on clusters of nodes • Master-Slave architecture • Processes: – NameNode (Master) – Standby NameNode (Master) – DataNode (Slave) © 2013. Axeldata Systems FZ-LLC 10 Namenode Standby NameNode Datanode Datanode Datanode Datanode
    • 11. HDFS Architecture metadata File1 File2 metadata Block 1 2 3 4 5 NameNode Location n1r1 n1r2 n2r2 n1r1 n1r2 n4r2 n2r1 n1r3 n3r3 n4r1 n2r3 n3r3 n3r1 n3r2 n4r2 Blocks 1 2 3 Blocks 4 5 Standby NameNode 64MB node1 1 2 1 2 3 node2 3 1 4 node3 5 5 3 4 node4 4 2 5 Rack1 © 2013. Axeldata Systems FZ-LLC Rack2 11 DataNodes Rack3
    • 12. What is MapReduce (MRv1)? • Programming Framework • Breaks processing into 2 phases: – Map phase – Reduce phase TaskTracker TaskTracker • Master-Slave architecture • Processes: – JobTracker (Master) – TaskTracker (Slave) © 2013. Axeldata Systems FZ-LLC JobTracker TaskTracker TaskTracker TaskTracker 12
    • 13. MapReduce Job JobTracker Task Task Task node1 1 2 1 2 3 node2 3 1 4 node3 5 5 3 4 node4 4 2 5 Rack1 © 2013. Axeldata Systems FZ-LLC Rack2 13 TaskTrackers Rack3
    • 14. MapReduce: The Mapper • Is a function that performs the map phase • Each mapper usually operates on a single HDFS block • Takes a key and value as input can generate multiple keys and values as output • <k1,v1>  list(<k2,v2>) • The output of all mappers are then sorted by key © 2013. Axeldata Systems FZ-LLC 14
    • 15. MapReduce: The Reducer • Is a function that performs the reduce phase • Each reducer operates on a portion of the output of all mappers • Takes a key with a list of all values as input and generates an aggregate of the values for each key • <k2,list(v2)>  list(<k3,v3>) © 2013. Axeldata Systems FZ-LLC 15
    • 16. MapReduce Data Flow Input HDFS sort Split 0 Output HDFS copy Map merge Reduce Part 0 Reduce Part 1 sort Split 1 Map merge sort Split 2 © 2013. Axeldata Systems FZ-LLC Map 16
    • 17. HDFS & MapReduce Example: Word Count Original File I will arise and go now, and go to Innisfree, And a small cabin build there, of clay and wattles made: Nine bean-rows will I have there, a hive for the honey-bee; And live alone in the bee-loud glade. And I shall have some peace there, for peace comes dropping slow, Dropping from the veils of the morning to where the cricket sings; There midnight's all a glimmer, and noon a purple glow, And evening full of the linnet's wings. I will arise and go now, for always night and day I hear lake water lapping with low sounds by the shore; While I stand on the roadway, or on the pavements grey, I hear it in the deep heart's core. © 2013. Axeldata Systems FZ-LLC File on HDFS Mapper I will arise and go now, and go to Innisfree, And a small cabin build there, of clay and wattles made: Nine bean-rows will I have there, a hive for the honey-bee; And live alone in the bee-loud glade. Map And I shall have some peace there, for peace comes dropping slow, Dropping from the veils of the morning to where the cricket sings; There midnight's all a glimmer, and noon a purple glow, And evening full of the linnet's wings. Map I will arise and go now, for always night and day I hear lake water lapping with low sounds by the shore; While I stand on the roadway, or on the pavements grey, I hear it in the deep heart's core. Map Reduce Reduce Reduce 17 Output
    • 18. Demo: Word Count on Hadoop © 2013. Axeldata Systems FZ-LLC 18
    • 19. Querying Data in Hadoop Apache Hive Apache Pig • Developed at Facebook • Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL • Developed at Yahoo! • High-level platform for creating MapReduce programs used with Hadoop • Has a language called PigLatin • Can be extended with UDFs written in Java, Python and other languages © 2013. Axeldata Systems FZ-LLC 19
    • 20. Hadoop Ecosystem • Avro: a data serialization system • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. • HBase: a scalable, distributed database that supports structured data storage for large tables • Mahout: a Scalable machine learning and data mining library • Oozie: a workflow scheduler system to manage Apache Hadoop jobs. • Sqoop: a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Zookeeper: a high-performance coordination service for distributed applications © 2013. Axeldata Systems FZ-LLC 20
    • 21. Yet Another Resource Negotiator (YARN) – Also known as: YARN (MapReduce v2) – New framework that facilitates writing arbitrary distributed processing frameworks and applications. – Splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. – Can run applications that do not follow the MapReduce model © 2013. Axeldata Systems FZ-LLC 21
    • 22. Learn Hadoop • Download the Cloudera QuickStart VM – – – – http://bit.ly/1b00iZj To make it easy for you to get started with Hadoop Cloudera Distribution including Apache Hadoop (CDH) With Cloudera Manager, Cloudera Impala, and Cloudera Search, this virtual machine includes everything you need • Formal training as Developer, Administrator, Analyst and other • Free Courseware on Udacity: Introduction to Hadoop and MapReduce – https://www.udacity.com/course/ud617 © 2013. Axeldata Systems FZ-LLC 22
    • 23. Other Hadoop Resources Apache Project Websites Hadoop: Hive: Pig: Sqoop: Flume: http://hadoop.apache.org/ http://hive.apache.org/ http://pig.apache.org/ http://sqoop.apache.org/ http://flume.apache.org/ Original GFS and MapReduce Papers GFS: http://bit.ly/VZk9VL MapReduce: http://bit.ly/8VDMHO © 2013. Axeldata Systems FZ-LLC 23
    • 24. Community A community of Hadoop professionals and users in the region meetup.com/Hadoop-User-Group-UAE/ © 2013. Axeldata Systems FZ-LLC 24
    • 25. Q&A © 2013. Axeldata Systems FZ-LLC 25
    • 26. fadi@axeldata.com www.axeldata.com Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks are the property of their respective owners.