Hadoop technology

880 views

Published on

Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
880
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
103
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop technology

  1. 1. 1
  2. 2. ABSTRACTHadoop is a framework for running applications on large clusters built ofcommodity hardware. The Hadoop framework transparently provides applicationsboth reliability and data motion. Hadoop implements a computational paradigmnamed Map/Reduce, where the application is divided into many small fragmentsof work, each of which may be executed or reexecuted on any node in the cluster.In addition, it provides a distributed file system (HDFS) that stores data on thecompute nodes, providing very high aggregate bandwidth across the cluster. BothMap/Reduce and the distributed file system are designed so that node failures areautomatically handled by the framework. 2
  3. 3. Problem Statement: The amount total digital data in the world has exploded in recent years.This has happened primarily due to information (or data) generated by variousenterprises all over the globe. In 2006, the universal data was estimated to be 0.18zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. The problem is that while the storage capacities of hard drives haveincreased massively over the years, access speeds—the rate at which data can beread from drives have not kept up. One typical drive from 1990 could store 1370MB of data and had a transfer speed of 4.4 MB/s, so we could read all the datafrom a full drive in around 300 seconds. In 2010, 1 Tb drives are the standardhard disk size, but the transfer speed is around 100 MB/s, so it takes more thantwo and a half hours to read all the data off the disk. 3
  4. 4. Solution Proposed:Parallelisation: A very obvious solution to solving this problem is parallelisation. Theinput data is usually large and the computations have to be distributed acrosshundreds or thousands of machines in order to finish in a reasonable amount oftime. Reading 1 Tb from a single hard drive may take a long time, but onparallelizing this over 100 different machines can solve the problem in 2 minutes.Apache Hadoop is a framework for running applications on large cluster built ofcommodity hardware. The Hadoop framework transparently provides applicationsboth reliability and data motion. It solves the problem of Hardware Failure through replication.Redundant copies of the data are kept by the system so that in the event of failure,there is another copy available. (Hadoop Distributed File System) The second problem is solved by a simple programming model-MapReduce. This programming paradigm abstracts the problem from dataread/write to computation over a series of keys. Even though HDFS andMapReduce are the most significant features of Hadoop. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. Who Uses Hadoop? 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. Data Everywhere 15
  16. 16. 16
  17. 17. 17
  18. 18. Examples of Hadoop in action•In the Telecommunication industry•In the Media•In the Technology Industry 18
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. Hadoop Architecture•Two Main Components : -Distributed File System -MapReduce Engine 25
  26. 26. 26
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. 30
  31. 31. 31
  32. 32. 32
  33. 33. 33
  34. 34. 34
  35. 35. 35
  36. 36. 36
  37. 37. 37
  38. 38. 38
  39. 39. 39
  40. 40. 40
  41. 41. 41
  42. 42. 42
  43. 43. 43
  44. 44. 44
  45. 45. 45
  46. 46. 46
  47. 47. 47
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52. 52
  53. 53. 53
  54. 54. 54
  55. 55. 55
  56. 56. 56
  57. 57. 57
  58. 58. 58
  59. 59. 59
  60. 60. 60
  61. 61. 61
  62. 62. 62
  63. 63. 63
  64. 64. 64
  65. 65. 65
  66. 66. 66
  67. 67. 67
  68. 68. 68
  69. 69. 69
  70. 70. 70
  71. 71. 71
  72. 72. 72
  73. 73. 73
  74. 74. 74
  75. 75. 75
  76. 76. 76
  77. 77. 77
  78. 78. 78
  79. 79. 79
  80. 80. 80
  81. 81. 81
  82. 82. 82
  83. 83. 83
  84. 84. ConclusionHadoop (MapReduce) is one of the very powerful frameworks that enable easydevelopment on data-intensive application. It objective is help building asupplication with high scalability with thousands of machines. We can seeHadoop is very suitable to data-intensive background application and perfect fit toour project‟s requirements. Apart from running application in parallel, Hadoopprovides some job monitoring features similar to Azure. If any machine crash,the data could be recovered by other machines, and it will take up the jobsautomatically. When we put Hadoop into cloud, we also see the convenience insetting up Hadoop. With a few command lines, we can allocate any number ofclusters to run Hadoop, this may save lot of time and effort. We found thecombination of cloud and Hadoop is surely a common way to setup large scaleapplication with lower cost, but higher elastic property. 84
  85. 85. Resources http://hadoop.apache.orghttp://developer.yahoo.com/hadoophttp://www.cloudera.com/resources 85

×