Hadoop
                               Divide and conquer gigantic data




© Matthew McCullough, Ambient Ideas, LLC
Talk Metadata
Twitter
  @matthewmccull
  #HadoopIntro
Matthew McCullough
  Ambient Ideas, LLC
  matthewm@ambientideas.com
...
MapReduce: Simplified Dat
                                             a Processing                                on Large...
Jeffrey Dean and Sanjay
                                                                        Ghemawat
                 ...
Abstract
                                                                        given da
       MapReduce is a progra    ...
expressible in this mod                          tation wit
   in the paper.                                   e   l, as s...
MapReduce history




“              ing mod    el and
A pr  ogramm
             ion for p rocessing
imp lementat

       ...
Origins
  MapReduce implementation

  Founded by

  OpenSource at
Today
0.20.1 current version

Dozens of companies contributing
Hundreds of companies using
Why Hadoop?
$74
  .85
$74
          .85
    b
g
4
b
          1t
        $74
          .85
    b
g
4
vs
0
                        0
                     ,0
                 $ 10



            vs
        0
     0
 1 ,0
$
vs
ur
                               Y o ut
                           y      o ure
                         Bu ay ail
      ...
Failure is
inevitable

Go Cheap
Sproinnnng!

Bzzzt!

                       Crrrkt!
server Funerals



No pagers go off when machines die
Report of dead machines once a week
 Clean out the carcasses
utes pre vented
  obustness attrib              n code
R
              g into a pplicatio
from  bleedin
          Data red...
Hadoop for what?
Structured
Structured




Unstructured
NOSQL
NOSQL

 Death of the RDBMS is a lie
NOSQL

 Death of the RDBMS is a lie
 NoJOINs
NOSQL

 Death of the RDBMS is a lie
 NoJOINs
 NoNormalization
NOSQL

 Death of the RDBMS is a lie
 NoJOINs
 NoNormalization
 Big-data tools are solving
 different issues than RDBMSes
Applications
Applications

 Protein folding
 (pharmaceuticals)
Applications

 Protein folding
 (pharmaceuticals)


 Search engines
Applications

 Protein folding
 (pharmaceuticals)


 Search engines
 Sorting
Applications

 Protein folding
 (pharmaceuticals)


 Search engines
 Sorting
 Classification
 (government intelligence)
Applications
Applications

 Price search
Applications

 Price search
 Steganography
Applications

 Price search
 Steganography
 Analytics
Applications

 Price search
 Steganography
 Analytics
 Primes
 (code breaking)
Particle Physics
Particle Physics


 Large Hadron Collider
Particle Physics


 Large Hadron Collider
 15 petabytes of data per year
Financial Trends
Financial Trends
 Daily trade performance analysis
Financial Trends
 Daily trade performance analysis
 Market trending
Financial Trends
 Daily trade performance analysis
 Market trending

 Uses employee desktops during off
 hours
Financial Trends
 Daily trade performance analysis
 Market trending

 Uses employee desktops during off
 hours
 Fiscally r...
Contextual Ads
Contextual Ads
30%
of Amazon sales are from
recommendations
Not right now...
Not right now...

 Do you expect to tackle a very
 large problem before you:
   change jobs
   change industries
   retire...
In the next decade, the class (scale) of
problems we are aiming to solve will
grow exponentially.
MapReduce
MapReduce
 map then... um... reduce.
The process
The process
 Every item in dataset is parallel candidate for
 Map
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
 Collects and groups pairs fr...
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
 Collects and groups pairs fr...
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
 Collects and groups pairs fr...
FP For the Grid


 MapReduce
  Functional programming
  on
  a distributed processing platform
The Goal
The Goal


Provide the occurrence count
of each distinct word across
all documents
Start
Map
Grouping
Reduce
MapReduce
  Demo
Have Code,
  Will Travel


 Code travels to the data
 Opposite of traditional systems
Speed Test
Competition
 TeraSort
 Jim Gray, MSFT
  1985 paper
  Derived sort benchmark
  http://sortbenchmark.org/

 209 seconds (200...
Nodes
Processing Nodes


 Anonymous
 “No identity” is good
 Commodity equipment
Master Node

 Master is a special machine
 Use high quality hardware
 Single point of failure
  But recoverable
Hadoop Family
Hadoop
Components
 Pig
 Hive
 Core
  Common
 Chukwa
 HBase
 HDFS
the Players
the PlayAs
the PlayAs



             a                Comm
       Chukw      ZooKeeper       on   HBa
                              ...
HDFS
HDFS Basics
HDFS Basics


 Based on Google BigTable
HDFS Basics


 Based on Google BigTable
 Replicated data store
HDFS Basics


 Based on Google BigTable
 Replicated data store
 Stored in 64MB blocks
Data Overload
Data Overload
Data Overload
Data Overload
Data Overload
HDFS

 Replicating
 Rack-location aware
 Configurable redundancy factor
 Self-healing
HDFS Demo
Pig
Pig Basics

 Yahoo-authored add-on DSL & tool
 Origin: Pig Latin
 Analyzes large data sets
 High-level language for expres...
PIG Questions
 Ask big questions on unstructured
 data
  How many ___?
  Should we ____?
 Decide on the questions you want...
Pig Sample


A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';
Pig demo
HBase
HBase Basics


 Structured data store
 Notice we didn’t say relational
 Relies on ZooKeeper and HDFS
NoSQL

 Voldemort
 Google BigTable
 MongoDB
 HBase
HBase Demo
Hive
Hive Basics

 Authored by
 SQL interface to HBase
 Hive is low-level
 Hive-specific metadata
Sqoop


  Sqoop by             is higher
  level
  Importing from RDBMS to Hive
  sqoop --connect jdbc:mysql://database.ex...
Sync, Async


 RDBMS SQL is realtime
 Hadoop is primarily asynchronous
on
Amazon
 Elastic
  MapReduce
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
  True use of cloud computing
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
  True use of cloud computing
  Easy to set up
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
  True use of cloud computing
  Easy to set up
  Pay per use
EMR Languages
 Supports applications in...

       Java
                  PHP
       Perl
                  R
       Ruby
...
EMR Languages
 Supports applications in...

       Java
                  PHP
       Perl
                  R
       Ruby
...
EMR Pricing
EMR Pricing
EMR Functions
 RunJobFlow: Creates a job flow request, starts
 EC2 instances and begins processing.
 DescribeJobFlows: Pro...
EMR Functions
 RunJobFlow: Creates a job flow request, starts
 EC2 instances and begins processing.
 DescribeJobFlows: Pro...
Final
Thoughts
Ha! Your
   Hadoop is       Shut up!
slower than my   I’m reducing.
    Hadoop!
The RDBMS is not dead
Has new friends, helpers
NoSQL is taking the world by
storm
No more throwing away
perfectly good his...
Failure is acceptable
Failure is acceptable
❖    Failure is inevitable
Failure is acceptable
❖    Failure is inevitable
❖    Go cheap
Failure is acceptable
❖    Failure is inevitable
❖    Go cheap

    Go distributed
Use Hadoop!
Hadoop
          Divide and conquer gigantic data


          Matthew McCullough
Email     matthewm@ambientideas.com
Twitt...
Credits
http://www.fontspace.com/david-rakowski/tribeca
http://www.cern.ch/
http://www.robinmajumdar.com/2006/08/05/google-dalles-...
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Hadoop v0.3.1
Upcoming SlideShare
Loading in...5
×

Hadoop v0.3.1

1,074

Published on

Matthew McCullough's Hadoop presentation to the Tampa JUG

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,074
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop v0.3.1"

  1. 1. Hadoop Divide and conquer gigantic data © Matthew McCullough, Ambient Ideas, LLC
  2. 2. Talk Metadata Twitter @matthewmccull #HadoopIntro Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog http://speakerrate.com/matthew.mccullough
  3. 3. MapReduce: Simplified Dat a Processing on Large Clusters Jeffrey Dean and Sanjay Ghe mawat jeff@google.com, sanjay@goo gle.com Google, Inc. Abstract given day, etc. Most such comp MapReduce is a programming utations are conceptu- model and an associ- ally straightforward. However, ated implementation for proce the input data is usually ssing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mach key/value pair to generate a set ines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distri values associated with the same bute the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this funct As a reaction to this complexity ional style are automati- , we designed a new cally parallelized and executed abstraction that allows us to expre on a large cluster of com- ss the simple computa- modity machines. The run-time tions we were trying to perfo system takes care of the rm but hides the messy de- details of partitioning the input tails of parallelization, fault- data, scheduling the pro- tolerance, data distribution gram’s execution across a set and load balancing in a librar of machines, handling ma- y. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional langu programmers without any ages. We realized that experience with parallel and most of our computations invol distributed systems to eas- ved applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapR compute a set of intermediat educe runs on a large e key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computatio the same key, in order to comb n processes many ter- ine the derived data ap- abytes of data on thousands of propriately. Our use of a funct machines. Programmers ional model with user- find the system easy to use: hund specified map and reduce opera reds of MapReduce pro- tions allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are execu as the primary mechanism for ted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scal e computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clust ers of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hund gives several examples. Secti reds of special-purpose on 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deriv scribes several refinements of ed data, such as inverted the programming model indices, various representations that we have found useful. Secti of the graph structure on 5 has performance of web documents, summaries measurements of our implement of the number of pages ation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experienc es in using it as the basis To appear in OSDI 2004 1
  4. 4. Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay @ google.com Google, Inc. Abstract given day, etc. Most su MapReduce is a progra ch computations are co mming model and an as ally straightforward. Ho nceptu- ated implementation fo soci- wever, the input data is r processing and genera large and the computatio usually data sets. Users specify ting large ns have to be distributed a map function that proc hundreds or thousands across key/value pair to genera esses a of machines in order to te a set of intermediate ke a reasonable amount of finish in pairs, and a reduce func y/value time. The issues of how tion that merges all inter allelize the computatio to par- values associated with th mediate n, distribute the data, an e same intermediate key. failures conspire to obsc d handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amount compu- in the paper. shown s of complex code to de these issues. al with Programs written in this As a reaction to this co functional style are auto mplexity, we designed cally parallelized and ex mati- abstraction that allows us a new ecuted on a large cluste to express the simple co modity machines. The r of com- tions we were trying to mputa- run-time system takes ca perform but hides the m details of partitioning th re of the tails of parallelization, essy de- e input data, scheduling fault-tolerance, data distr gram’s execution across the pro- and load balancing in ibution a set of machines, hand a library. Our abstractio chine failures, and man ling ma- spired by the map and n is in- aging the required inter reduce primitives presen communication. This all -machine and many other functio t in Lisp ows programmers with nal languages. We reali experience with paralle out any most of our computatio zed that l and distributed system ns involved applying a ily utilize the resources s to eas- eration to each logical map op- of a large distributed sy “record” in our input in stem. compute a set of interm order to Our implementation of ediate key/value pairs, MapReduce runs on a and then cluster of commodity m large applying a reduce oper achines and is highly sc ation to all the values th a typical MapReduce co alable: the same key, in order at shared mputation processes m to combine the derived abytes of data on thousa any ter- propriately. Our use of data ap- nds of machines. Progra a functional model with find the system easy to us mmers specified map and redu user- e: hundreds of MapRedu ce operations allows us grams have been implem ce pro- lelize large computatio to paral- ented and upwards of on ns easily and to use re-e sand MapReduce jobs ar e thou- as the primary mechani xecution e executed on Google’s sm for fault tolerance. every day. clusters The major contributions of this work are a simpl powerful interface that e and enables automatic paralle and distribution of large lization 1 Introduction -scale computations, co with an implementatio mbined n of this interface that high performance on lar achieves ge clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many ot basic programming mod Google have implemen hers at gives several examples el and ted hu
  5. 5. Abstract given da MapReduce is a progra ally strai mming model and an a ated implementation fo ssoci- r processing and genera large and data sets. Users specify ting large a map function that pro hundreds key/value pair to genera cesses a te a set of intermediate k a reasona pairs, and a reduce func ey/value tion that merges all inte allelize th values associated with th rmediate e same intermediate key failures co real world tasks are exp . Many ressible in this model, a tation wit in the paper. s shown these issue Programs written in this As a re functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we w run-time system takes c details of partitioning th are of the tails of par e input data, scheduling gram’s execution across the pro- and load b a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many o llows programmers wit experience with paralle hout any most of ou l and distributed system il s to eas-
  6. 6. expressible in this mod tation wit in the paper. e l, as shown these issu Programs written in this As a re functional style are auto cally parallelized and ex mati- abstractio ecuted on a large cluste modity machines. The r o f co m - tions we w run-time system takes c details of partitioning th are of the tails of pa e input data, scheduling gram’s execution across the pro- and load b a set of machines, hand chine failures, and man ling ma- spired by t aging the required inter- communication. This a machine and many llows programmers wit experience with paralle hout any most of ou l and distributed system ily utilize the resources s to eas- eration to e of a large distributed sy stem. compute a Our implementation of MapReduce runs on a cluster of commodity m large applying a achines and is highly s a typical MapReduce c calable: the same ke omputation processes m abytes of data on thous any ter- propriately. ands of machines. Prog find the system easy to u rammers specified m se: hundreds of MapRed grams have been implem uce pro- lelize large ented and upwards of o sand MapReduce jobs a ne thou- as the prima re executed on Google’s every day. clusters The major powerful int and distribut
  7. 7. MapReduce history “ ing mod el and A pr ogramm ion for p rocessing imp lementat ” g large d ata sets and generatin
  8. 8. Origins MapReduce implementation Founded by OpenSource at
  9. 9. Today 0.20.1 current version Dozens of companies contributing Hundreds of companies using
  10. 10. Why Hadoop?
  11. 11. $74 .85
  12. 12. $74 .85 b g 4
  13. 13. b 1t $74 .85 b g 4
  14. 14. vs
  15. 15. 0 0 ,0 $ 10 vs 0 0 1 ,0 $
  16. 16. vs
  17. 17. ur Y o ut y o ure Bu ay ail w F f o vs is re ble lu ta i i Fa ev a p in C he Go
  18. 18. Failure is inevitable Go Cheap
  19. 19. Sproinnnng! Bzzzt! Crrrkt!
  20. 20. server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses
  21. 21. utes pre vented obustness attrib n code R g into a pplicatio from bleedin Data redundancy Node death Retries Data geography Parallelism Scalability
  22. 22. Hadoop for what?
  23. 23. Structured
  24. 24. Structured Unstructured
  25. 25. NOSQL
  26. 26. NOSQL Death of the RDBMS is a lie
  27. 27. NOSQL Death of the RDBMS is a lie NoJOINs
  28. 28. NOSQL Death of the RDBMS is a lie NoJOINs NoNormalization
  29. 29. NOSQL Death of the RDBMS is a lie NoJOINs NoNormalization Big-data tools are solving different issues than RDBMSes
  30. 30. Applications
  31. 31. Applications Protein folding (pharmaceuticals)
  32. 32. Applications Protein folding (pharmaceuticals) Search engines
  33. 33. Applications Protein folding (pharmaceuticals) Search engines Sorting
  34. 34. Applications Protein folding (pharmaceuticals) Search engines Sorting Classification (government intelligence)
  35. 35. Applications
  36. 36. Applications Price search
  37. 37. Applications Price search Steganography
  38. 38. Applications Price search Steganography Analytics
  39. 39. Applications Price search Steganography Analytics Primes (code breaking)
  40. 40. Particle Physics
  41. 41. Particle Physics Large Hadron Collider
  42. 42. Particle Physics Large Hadron Collider 15 petabytes of data per year
  43. 43. Financial Trends
  44. 44. Financial Trends Daily trade performance analysis
  45. 45. Financial Trends Daily trade performance analysis Market trending
  46. 46. Financial Trends Daily trade performance analysis Market trending Uses employee desktops during off hours
  47. 47. Financial Trends Daily trade performance analysis Market trending Uses employee desktops during off hours Fiscally responsible/economical
  48. 48. Contextual Ads
  49. 49. Contextual Ads
  50. 50. 30% of Amazon sales are from recommendations
  51. 51. Not right now...
  52. 52. Not right now... Do you expect to tackle a very large problem before you: change jobs change industries retire die see the heat death of the universe
  53. 53. In the next decade, the class (scale) of problems we are aiming to solve will grow exponentially.
  54. 54. MapReduce
  55. 55. MapReduce map then... um... reduce.
  56. 56. The process
  57. 57. The process Every item in dataset is parallel candidate for Map
  58. 58. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2)
  59. 59. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key
  60. 60. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key Reduce in parallel on each group
  61. 61. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key Reduce in parallel on each group Reduce(k2, list (v2)) -> list(v3)
  62. 62. FP For the Grid MapReduce Functional programming on a distributed processing platform
  63. 63. The Goal
  64. 64. The Goal Provide the occurrence count of each distinct word across all documents
  65. 65. Start
  66. 66. Map
  67. 67. Grouping
  68. 68. Reduce
  69. 69. MapReduce Demo
  70. 70. Have Code, Will Travel Code travels to the data Opposite of traditional systems
  71. 71. Speed Test
  72. 72. Competition TeraSort Jim Gray, MSFT 1985 paper Derived sort benchmark http://sortbenchmark.org/ 209 seconds (2007) 120 seconds (2009)
  73. 73. Nodes
  74. 74. Processing Nodes Anonymous “No identity” is good Commodity equipment
  75. 75. Master Node Master is a special machine Use high quality hardware Single point of failure But recoverable
  76. 76. Hadoop Family
  77. 77. Hadoop Components Pig Hive Core Common Chukwa HBase HDFS
  78. 78. the Players
  79. 79. the PlayAs
  80. 80. the PlayAs a Comm Chukw ZooKeeper on HBa se Hive HDFS
  81. 81. HDFS
  82. 82. HDFS Basics
  83. 83. HDFS Basics Based on Google BigTable
  84. 84. HDFS Basics Based on Google BigTable Replicated data store
  85. 85. HDFS Basics Based on Google BigTable Replicated data store Stored in 64MB blocks
  86. 86. Data Overload
  87. 87. Data Overload
  88. 88. Data Overload
  89. 89. Data Overload
  90. 90. Data Overload
  91. 91. HDFS Replicating Rack-location aware Configurable redundancy factor Self-healing
  92. 92. HDFS Demo
  93. 93. Pig
  94. 94. Pig Basics Yahoo-authored add-on DSL & tool Origin: Pig Latin Analyzes large data sets High-level language for expressing data analysis programs
  95. 95. PIG Questions Ask big questions on unstructured data How many ___? Should we ____? Decide on the questions you want to ask long after you`ve collected the data.
  96. 96. Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out';
  97. 97. Pig demo
  98. 98. HBase
  99. 99. HBase Basics Structured data store Notice we didn’t say relational Relies on ZooKeeper and HDFS
  100. 100. NoSQL Voldemort Google BigTable MongoDB HBase
  101. 101. HBase Demo
  102. 102. Hive
  103. 103. Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata
  104. 104. Sqoop Sqoop by is higher level Importing from RDBMS to Hive sqoop --connect jdbc:mysql://database.example.com/
  105. 105. Sync, Async RDBMS SQL is realtime Hadoop is primarily asynchronous
  106. 106. on
  107. 107. Amazon Elastic MapReduce
  108. 108. Amazon Elastic MapReduce Hosted Hadoop clusters
  109. 109. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing
  110. 110. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing Easy to set up
  111. 111. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing Easy to set up Pay per use
  112. 112. EMR Languages Supports applications in... Java PHP Perl R Ruby C++ Python
  113. 113. EMR Languages Supports applications in... Java PHP Perl R Ruby C++ Python
  114. 114. EMR Pricing
  115. 115. EMR Pricing
  116. 116. EMR Functions RunJobFlow: Creates a job flow request, starts EC2 instances and begins processing. DescribeJobFlows: Provides status of your job flow request(s). AddJobFlowSteps: Adds additional step to an already running job flow. TerminateJobFlows: Terminates running job flow and shutdowns all instances.
  117. 117. EMR Functions RunJobFlow: Creates a job flow request, starts EC2 instances and begins processing. DescribeJobFlows: Provides status of your job flow request(s). AddJobFlowSteps: Adds additional step to an already running job flow. TerminateJobFlows: Terminates running job flow and shutdowns all instances.
  118. 118. Final Thoughts
  119. 119. Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop!
  120. 120. The RDBMS is not dead Has new friends, helpers NoSQL is taking the world by storm No more throwing away perfectly good historical data
  121. 121. Failure is acceptable
  122. 122. Failure is acceptable ❖ Failure is inevitable
  123. 123. Failure is acceptable ❖ Failure is inevitable ❖ Go cheap
  124. 124. Failure is acceptable ❖ Failure is inevitable ❖ Go cheap Go distributed
  125. 125. Use Hadoop!
  126. 126. Hadoop Divide and conquer gigantic data Matthew McCullough Email matthewm@ambientideas.com Twitter @matthewmccull Blog http://ambientideas.com/blog
  127. 127. Credits
  128. 128. http://www.fontspace.com/david-rakowski/tribeca http://www.cern.ch/ http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre- has-serious-cooling-needs/ http://www.greenm3.com/2009/10/googles-secret-to-efficient-data- center-design-ability-to-predict-performance.html http://upload.wikimedia.org/wikipedia/commons/f/fc/ CERN_LHC_Tunnel1.jpg http://www.flickr.com/photos/mandj98/3804322095/ http://www.flickr.com/photos/8583446@N05/3304141843/ http://www.flickr.com/photos/joits/219824254/ http://www.flickr.com/photos/streetfly_jz/2312194534/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/lacklusters/2080288154/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/robryb/14826417/sizes/l/ http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/ http://www.flickr.com/photos/robryb/14826486/sizes/l/ All others, iStockPhoto.com

×