Hadoop 101

895 views
782 views

Published on

First slide of Hadoop:
* Introduction to Big Data and Hadoop:
- Presenting and defining big data
- Introducing Hadoop and History
- Hadoop - how it works?
- HDFS

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
895
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop 101

  1. 1. Introducing: The Modern Data Operating System
  2. 2. Hadoop is ... A scalable fault tolerant distributed for data storage and processing (open source under the Apache license) - Core Hadoop has two main systems: ● Hadoop Distributed FileSystem (HDFS): self-healing, high-bandwidth clustered storage ● MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
  3. 3. Hadoop Origins >>> HDFS >>> MapReduce GFS Map/Reduce >>> BigTable
  4. 4. Hadoop Chronicles GFS Map/Reduce BigTable Doug Cutting
  5. 5. Etymology ● Hadoop was created in 2004 by "Douglass (Doug) Cutting" ● Implemented Google Filesystem and Big Tables papers ● He aimed it, to index the internet in google style for startup search engine 'Nutch' ● Named it after his son's elephant shaped favourite toy named hadoop
  6. 6. What is Big Data? "In Information Technology, big data is loosely defined term used to describe set so large and complex that they became awkward to work with using on-hand database management tools." Wikipedia
  7. 7. How big is big? ● 2008: Google processes 20PB a day ● 2012: Facebook ingests 500TB of data a day ● 2009: eBay has 6.5 PB user data + 50 TB a day ● 2011: Yahoo! has 180-200 PB of data
  8. 8. Limitations of Existing Analytics Architecture Can't explore original raw data BI Reports + Online Apps RDBMS (aggregated data) ETL (Extract, Transfer & Load) Moving Data from storage to compute doesn't scale! Storage Grid Archiving = Premature death Mostly Append Data Collection Instrumentation (Raw Data Sources)
  9. 9. Why Hadoop? Challenge: Read 1 TB of data 1 Machine - 4 IO channels - Each channel: 100 MB/s ? 45 minutes 10 Machines - 4 IO channels - Each channel: 100 MB/s 4.5 minutes ?
  10. 10. Hadoop and Friends
  11. 11. The Key Benefit: Agility/Flexibility Schema-On-Write (RDBMS) Schema-On-Read (Hadoop) - Schema must be created before any data can be loaded - Data is simply copied to the file store, no transformations are needed - An explicit load operation has to take place which transforms data to DB internal structure - A SerDe (Serializer/Deserializer) is applied during read tume to extract the required column (late binding) - New columns must be be added explicitly before new data for such columns can be loaded into the database - New data can strat flowing anytime and will appear retroactively once the SerDe is updated to parse it - Reads are fast - Standards / Governance - Load is fast - Flexibility / Agility
  12. 12. Hadoop Components Master/Slave Architecture Name Node Data Nodes Job Tracker Task Trackers
  13. 13. r=3 NameNode File metadata: /kenshoo/data1.txt ---> 1,2,3 /kenshoo/data2.txt ---> 4,5 hdfs-site.xml dfs.replication 3 5 3 5 4 5 1 4 1 4 2 2 3 Data Nodes 1 2
  14. 14. Underlying FS options ext3 - released in 2001 - Used by Yahoo! - bootstrap + format slow - set: - noatime - tune2fs (to turn off reserved blocks) ext4 - released in 2008 - Used by Google - Fast as XFS - set: - delayed allocation off -noatime - tune2fs (to turn off reserved blocks) XFS - released in 1993 - Fast - Drawbacks: - deleting large # of files
  15. 15. Sample HDFS shell Commands bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop fs fs fs fs fs fs fs fs fs -ls -mkdir -copyFromLocal -copyToLocal -moveToLocal -rm -tail -chmod -setrep -w 4 -R /dir1/s-dir Mounting using FUSE: hadoop-fuse-dfs dfs://10.73.9.50 /hdfs
  16. 16. Network Topology Yahoo! Installation Name Node Job Tracker HBase Master 2 2 3 3 3 4 4 4 5 Rack 1 2 5 5 Rack 2 Rack 3 - 8 core switches - 100 racks - 40 servers/rack - 1 GBit in rack - 10 GBit among racks -Total 11PB
  17. 17. Rack Awareness NameNode Name Node Job Tracker metadata HBase Master file.txt = A 2 A 7 3 A 8 B 4 5 Rack 1 B Blk A: A DN: 2,7,8 13 B 9 10 Rack 2 12 14 15 Rack 3 Blk B: B DN: 9,12,14
  18. 18. HDFS Writes Client NameNode Core metadata A B C file.txt = A Blk A: DN: 2,7,9 A A 2 3 8 A 4 5 Rack 1 7 9 10 Rack 2
  19. 19. Reading Files File1.txt parts: Blk A: 2,7,8 Blk B: 9,12,14 wanna read file1.txt Client NameNode Core metadata file.txt = Blk A: A DN: 2,7,8 A 2 A 7 3 A 8 B 4 5 Rack 1 B 13 B 9 10 Rack 2 12 14 15 Rack 3 Blk B: B DN: 9,12,14

×