ExplainedSunitha Raghurajan
Data…Data….Data….• We live in a data world ????• Total FaceBook Users:835,525,280 (March  31 st 2012)• The New York Stock ...
Data…is growing ????From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/collat...
Problem??• How do we store and analyze the date???• one terabyte drives the transfer speed is  around 100 MB/s- more than ...
Why can’t we use RDBMS?• An RDBMS is good for point queries or  updates, where the dataset has been indexed  to deliver lo...
Hadoop is the answer!!!!!• Hadoop is an open source project licensed  under the Apache v2 license  http://hadoop.apache.or...
Hadoop History• Hadoop was created by Doug Cutting, who named it  after his sons toy elephant .• 2002-2004 Nutch Open Sour...
Who uses Hadoop ?Amazon       American AirlinesAOL          AppleeBay         Federal Reserve Board of             Governo...
Why Hadoop?• Reliable: The software is fault tolerant, it  expects and handles hardware and software  failures• Scalable: ...
What is MapReduce??? – Programming model used by Google – A combination of the Map and Reduce models   with an associated ...
MapReduce Explained• The basic idea is that you divide the job into  two parts: a Map, and a Reduce.• Map basically takes ...
Distributed Grep       Split data   grep   matches       Split data   grep   matchesVery                                  ...
MAP REDUCE ARCHITURE
How Map and Reduce WorkTogether
Map Reduce                                                   R                               M                   EVery    ...
http://ayende.com/blog/4435/map-reduce-a-visual-explanation
RDBMS compared toMapReduceData        Gigabytes         PetabytesSizeAccess      Interactive and   Batch            batchU...
Hadoop Family  Pig          A platform for manipulating                      large data sets                              ...
When to use Hadoop?•   Complex information processing is needed•   Unstructured data needs to be turned into structured da...
Building Blocks of Hadoop• Running a set of daemons on different servers  on the network  •NameNode  •DataNode  •Secondary...
• Questions????
References• Hadoop in Action By Chuck Lam• Hadoop The Definitive Guide By Tom White• http://hadoop.apache.org/
Upcoming SlideShare
Loading in …5
×

Hadoop by sunitha

789 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
789
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop by sunitha

  1. 1. ExplainedSunitha Raghurajan
  2. 2. Data…Data….Data….• We live in a data world ????• Total FaceBook Users:835,525,280 (March 31 st 2012)• The New York Stock Exchange generates about one terabyte of new trade data per• day.• • Facebook hosts approximately 10 billion photos, taking up one petabyte of storagehttp://www.internetworldstats.com/facebook.htm
  3. 3. Data…is growing ????From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf).
  4. 4. Problem??• How do we store and analyze the date???• one terabyte drives the transfer speed is around 100 MB/s- more than two and a half hours to read all the data off the disk. Writing more slower• We had 100 drives holding one hundredth of the data.• Reliability issues ( failure in hard drive)• Combine data from 100 drives?.• Existing Tools inadequate to process large data sets
  5. 5. Why can’t we use RDBMS?• An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. Longer time to read data CPU Memory Disk
  6. 6. Hadoop is the answer!!!!!• Hadoop is an open source project licensed under the Apache v2 license http://hadoop.apache.org/• Used for processing large datasets in parallel with the use of low level commodity machines.• Hadoop is build on two main parts. An special file system called Hadoop Distributed File System (HDFS) and the Map Reduce Framework.
  7. 7. Hadoop History• Hadoop was created by Doug Cutting, who named it after his sons toy elephant .• 2002-2004 Nutch Open Source web-scale, crawler- based search• 2004-2006 Google File System & MapReduce papers published.Added DFS & MapReduce impl to Nutch• 2006-2008 Yahoo hired Doug Cutting• On February 19, 2008, Yahoo! Inc. launched what it claimed was the worlds largest Hadoop production application• The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.[22]
  8. 8. Who uses Hadoop ?Amazon American AirlinesAOL AppleeBay Federal Reserve Board of Governorsfoursquare Fox Interactive MediaFaceBook StumbleUponGemvara Hewlett-PackardIBM MicroSoftTwitter NYTimesNetFlix Linkedin
  9. 9. Why Hadoop?• Reliable: The software is fault tolerant, it expects and handles hardware and software failures• Scalable: Designed for massive scale of processors, memory, and local attached storage• Distributed: Handles replication. Offers massively parallel programming model, MapReduce
  10. 10. What is MapReduce??? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets
  11. 11. MapReduce Explained• The basic idea is that you divide the job into two parts: a Map, and a Reduce.• Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines – so all the pieces run at the same time.• Reduce takes the results from the sub-parts and combines them back together to get a single answer.
  12. 12. Distributed Grep Split data grep matches Split data grep matchesVery All big Split data grep matches cat matchesdata Split data grep matches
  13. 13. MAP REDUCE ARCHITURE
  14. 14. How Map and Reduce WorkTogether
  15. 15. Map Reduce R M EVery Partitioning A D Result big Function P Udata C E• Map: – Accepts input key/value pair Reduce : – Emits intermediate key/value pair Accepts intermediate key/value* pair Emits output key/value pair
  16. 16. http://ayende.com/blog/4435/map-reduce-a-visual-explanation
  17. 17. RDBMS compared toMapReduceData Gigabytes PetabytesSizeAccess Interactive and Batch batchUpdates Read and write Write once, read many many times timesintegrity High LowScaling Nonlinear LinearStructur Static schema Dynamic schemae
  18. 18. Hadoop Family Pig A platform for manipulating large data sets Scripting Machine Mahout Machine Learning Algorithms Learning Bigtable-like structured storage HBASE for Hadoop HDFS Non-Rel RDBMS HIVE data warehouse system Non-Rel RDBMS Distribute and replicated data HDFS among machines Hadoop common MapReduce Distribute and monitor tasks Zoo Keeper Distributed Contributed Service
  19. 19. When to use Hadoop?• Complex information processing is needed• Unstructured data needs to be turned into structured data• Queries can’t be reasonably expressed using SQL• Heavily recursive algorithms• Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing• Machine learning• Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB)• Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost• Results are not needed in real time• Fault tolerance is critical• Significant custom coding would be required to handle job scheduling• Reference:http://timoelliott.com/blog/2011/09/hadoop-big-data-and- enterprise-business-intelligence.html
  20. 20. Building Blocks of Hadoop• Running a set of daemons on different servers on the network •NameNode •DataNode •Secondary NameNode •JobTracker •TaskTracker
  21. 21. • Questions????
  22. 22. References• Hadoop in Action By Chuck Lam• Hadoop The Definitive Guide By Tom White• http://hadoop.apache.org/

×