Large ScaleData Analysis Tools              Brad Anderson          brad@scalingdata.com                @boorad
shameless borrowinghttp://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf
I crunch data.
data
data   business value
What the hell isbusiness value?
Business value isanything which makes people more likely to   give us money.
shopping cart  analysis
mobile device  tracking
Business value is anything whichsaves us money.
smart gridsubstations
healthcare
We want to generatemore business value.
ever-growing sources of  big data
web logs
mobile devices
sensors  rfid tagssmart metersocean buoys
parsing terabytes of noiseto get a megabyte of signal   http://www.kaushik.net/avinash/big-data-imperative-driving-big-act...
How did we get here?
your data doesn’t fit  in local memory
your data doesn’t fit   on local disk
your data doesn’t fit  on one machine
scale up
$
SAN
$$
big db iron
$$$
business value.
scale out
move the datato the processors
functiondata              datadata              datadata              datadata              datadata              data
functiondata               datadata               datadata               datadata               datadata               dat...
function       function                                                     data                                         f...
add more machines
shit gets interesting
clusters
load balancers
distributed systems      problems   opportunities
configurationmanagement
What systems do I use?
data shape
query patterns
latency and throughput     requirements
cassandra   riakbigcouch
batch vs. realtime
Hadoop
hdfsmapreduce
ecosystem
Cloudera     IBMAmazon EMR    MapRHortonworks   EMC
data ingest       storagequerying / processing       output
processes                            RDBMS                  batch       Hadoop                                    CacheRaw...
data ingestscribe
data ingestchukwa
data ingestflume
data ingesthomegrown?
storagehdfs
storageMapR
storagehbase
storageopentsdb
querying / processingmapreduce
querying / processingpig
querying / processing
querying / processingExample Pig Script
Equivalent MR Java code
querying / processinghive
querying / processing             Example Hive QueryFROM pv_usersINSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gende...
querying / processingcascading
querying / processingcascalog
querying / processingDatameer
querying / processingMRv2
querying / processing    MRv2 allows    MRv1 (of course)         SparkBulk Synchronous Parallel        Graphs          MPI
querying / processingmachine learning  algorithms
querying / processingmahout
outputflat files
outputrdbms
outputcache
outputhdfs
realtime
Storm
streamsTuple   Tuple      Tuple    Tuple    Tuple     Tuple   Tuple                Unbounded sequence of tuples
spoutsSource of streams
spout examples•Read from Kestrel queue• Read from Twitter streaming API
boltsProcesses input streams and produces new streams
bolts• Functions• Filters• Aggregation• Joins• Talk to databases
topologiesNetwork of spouts and bolts
data   business value
The UnreasonableEffectiveness of Data                 http://bit.ly/x407Ln
Start small
But definitely start!
Please start!
Thank you.
Large Scale Data Analysis Tools
Large Scale Data Analysis Tools
Upcoming SlideShare
Loading in...5
×

Large Scale Data Analysis Tools

3,781

Published on

My talk on Hadoop, Storm, and other big data tools - DevNexus - 3/21/2012

Published in: Technology, Business
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,781
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
126
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • 90 slides - coffee\n\nBig Data guy - Data Scientist?\n\nScaling Data helps our customers tackle this new Big Data space - their whole stack\n
  • if you write applications that are JVM-based and you’re not using Metrics, you are doing it wrong\n\ninstrument your running production code to get real intelligence on what’s going on AS your running production code creates business value\n
  • At scaling data, people give us money for crunching data.\n
  • the reason they pay us so much money is that we crunch data that generates business value.\n
  • I thought this was going to be about big data\n
  • topline\n
  • recommendations for other complimentary products, driving overall spend higher\n\ncustomer classification and scoring - offer good customers deals for repeat business\n\ntransactional retargeting - abandoned shopping carts are mined, and personalized ads are returned to that specific user\n
  • cell tower data used to track where people go for lunch - identify a new restaurant site\n\nwhat roads are used so we can target billboards - demand higher prices\n\nmunicipal planning\n
  • cost cutting\n
  • pattern recognition in the power signature can point to imminent failure for expensive equipment\n
  • imagine a diagnosis that was cured with 17 procedures at immense cost\n\nsame diagnosis was cured with 5 procedures elsewhere\n\nanalyzing patient histories across the country / world can get us here\n\n
  • because we like more money... \n
  • \n
  • \n
  • \n
  • \n
  • We have even more types of data,\nbecoming ever more complex,\ndistributed across multiple existences,\nand we are left with the task of parsing out terabytes of noise to get to a megabyte of signal.\n
  • \n
  • Ever more data to try to find the business value\n\nCurrent tools are straining under the load, (banks) my talk last year\n\nThere is significant pain while using these big data tools - Why are they so hot now?\n\ngetting better\n
  • put it on disk in a database\n
  • SAN\n
  • even with the SAN... so you get a bigger machine\n
  • Oracle loves you for this!\n\n37signals approach - basecamp 1 server\n
  • \n
  • EMC loves you for this!\n\n\n
  • \n
  • IBM, HP, Sun loves (or loved) you for this!\n\nmore processors, more memory, more disk\n\n
  • \n
  • mounting costs are not good for...\n
  • the new approach, starting about 5 years ago\n\nNoSQL?\n\nNewSQL?\n
  • \n
  • \n
  • \n
  • so you’re sold on ‘scale out’\n
  • if you want your ops co-workers to be outside of their happy space, this is the ticket\n
  • lots of commodity hardware boxes ... racks\n
  • haproxy is a good one\n
  • things will break - fault tolerance\n\ndistribution of data - rebalancing\n\ntask coordination - leader election / masterless\n
  • reduce ops headache - Chef, Puppet\n
  • I still have the pain... I want to go forward with this\n
  • Cambrian explosion 530 million years ago\n\nappearance of most major animal phyla\n\ndiversification of organisms as earth warms, forms different climates\n
  • small records/files\n\nfixed schema, semi-structured, totally unstructured\n\ncolumn store, graph store\n
  • how will you ask for the data?\n\nkey lookup\n\ntable scan otherwise\n\nsecondary indices for oft-queried fields? mostly roll-your-own\n\n
  • per-request speed - fast = column db\n\namount of requests - availability of reads/writes under load becomes important\n
  • cassandra - read/write speed impressive\n\ndynamo-based clusters\n\nvery capable data stores\n
  • \n
  • hadoop rules the batch world for massive data sets\n
  • \n
  • probably 40-50 satellite projects that are non-core hadoop\n
  • distributions - should be matched to your use-case\n
  • \n
  • data --> business value\n
  • logging only, from Facebook\n\nkind of old and busted\n\nbut still on every Facebook server (or was at one time), so battle-tested\n
  • near-realtime: minutes\n\nreliability: getting better with recent releases\n\nmgmt: complicated\n\nsupport: apache project\n
  • a more general data ingest tool, although it started with log files\n\nnear-realtime: seconds\n\nreliability: best effort, store+retry on failure, and end-to-end mode \nthat uses acks and a write ahead log.\n\nmgmt: master or masters, then smooth from there\n\nsupport: cloudera\n
  • if you have a realtime component, use Storm\n\nit’s already distributed, reliable, easily manageable.\n
  • big files\n\nrecent performance improvements\n\nships with hadoop\n
  • unique for small files\n\nperformance over hdfs\n\nsnapshotting\n
  • low-latency column store\n\nfast key-based access\n\nalso have MR to do in batch/background\n
  • time series schema for hbase\n\nStumbleupon\n
  • a framework for processing in parallel on large clusters\n\nmap - nodes process local data\n\nreduce - reduces the ‘map output’ in some way (sum, count, etc)\n\n(shuffle & sort are in between M & R)\n
  • high-level language built on top of MR\n\noften favored for data movement, but can be used for querying / processing too\n
  • \n
  • \n
  • \n
  • high-level language built on top of MR\n\nstriving for SQL-like language\n
  • \n
  • high-level language built on top of MR\n\nmultiple MR jobs linked together\n\ncomplex query workflows\n
  • querying DSL written in Clojure\n
  • Excel-like frontend tool on top of Hadoop\n\nspreadsheet-like interface targets business users\n\njoins, data ingest too\n\n
  • released with Hadoop 0.23\n\nsplit JobTracker into:\n - ResourceManager (RM)\n - ApplicationMaster (AM), which does job scheduling/monitoring\n\nyou can run different applications now (next slide)\n
  • \n
  • one of the highest levels of ‘gaining insight’\n\nRecommendation\nClassification\nClustering / Segmentation\nPredictive Analytics\nSimilarity\n
  • loose federation of machine learning algorithms that run on hadoop\n\nHadoop not best system for some of these, although MRv2 is now here\n\nsome algos are better than others - you have been warned\n
  • output targets of Hadoop jobs\n
  • I’m not a hater!\n\nGreat tool for 40 years\n\n
  • mongo, redis\n
  • back into the cluster for use in another MR job\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • sexy, complicated algorithms are very insightful\n\nBUT more data and a shittier / more basic algorithm wins\n\ndata can overcome “known truths” and organizational inertia\n\n
  • for your organization, start small\n\ndon’t bet the farm... maybe 10-15% of your analytics budget\n\nskunkworks projects, hackers, etc.\n
  • \n
  • we need more Big Data people!\n
  • \n
  • Large Scale Data Analysis Tools

    1. 1. Large ScaleData Analysis Tools Brad Anderson brad@scalingdata.com @boorad
    2. 2. shameless borrowinghttp://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf
    3. 3. I crunch data.
    4. 4. data
    5. 5. data business value
    6. 6. What the hell isbusiness value?
    7. 7. Business value isanything which makes people more likely to give us money.
    8. 8. shopping cart analysis
    9. 9. mobile device tracking
    10. 10. Business value is anything whichsaves us money.
    11. 11. smart gridsubstations
    12. 12. healthcare
    13. 13. We want to generatemore business value.
    14. 14. ever-growing sources of big data
    15. 15. web logs
    16. 16. mobile devices
    17. 17. sensors rfid tagssmart metersocean buoys
    18. 18. parsing terabytes of noiseto get a megabyte of signal http://www.kaushik.net/avinash/big-data-imperative-driving-big-action/
    19. 19. How did we get here?
    20. 20. your data doesn’t fit in local memory
    21. 21. your data doesn’t fit on local disk
    22. 22. your data doesn’t fit on one machine
    23. 23. scale up
    24. 24. $
    25. 25. SAN
    26. 26. $$
    27. 27. big db iron
    28. 28. $$$
    29. 29. business value.
    30. 30. scale out
    31. 31. move the datato the processors
    32. 32. functiondata datadata datadata datadata datadata data
    33. 33. functiondata datadata datadata datadata datadata data ship code not data
    34. 34. function function data function function data data function functiondata data data datadata data function functiondata data data datadata datadata data function function data data function data ship code not data
    35. 35. add more machines
    36. 36. shit gets interesting
    37. 37. clusters
    38. 38. load balancers
    39. 39. distributed systems problems opportunities
    40. 40. configurationmanagement
    41. 41. What systems do I use?
    42. 42. data shape
    43. 43. query patterns
    44. 44. latency and throughput requirements
    45. 45. cassandra riakbigcouch
    46. 46. batch vs. realtime
    47. 47. Hadoop
    48. 48. hdfsmapreduce
    49. 49. ecosystem
    50. 50. Cloudera IBMAmazon EMR MapRHortonworks EMC
    51. 51. data ingest storagequerying / processing output
    52. 52. processes RDBMS batch Hadoop CacheRaw NoSQL AppsData processes realtime Storm NoSQL
    53. 53. data ingestscribe
    54. 54. data ingestchukwa
    55. 55. data ingestflume
    56. 56. data ingesthomegrown?
    57. 57. storagehdfs
    58. 58. storageMapR
    59. 59. storagehbase
    60. 60. storageopentsdb
    61. 61. querying / processingmapreduce
    62. 62. querying / processingpig
    63. 63. querying / processing
    64. 64. querying / processingExample Pig Script
    65. 65. Equivalent MR Java code
    66. 66. querying / processinghive
    67. 67. querying / processing Example Hive QueryFROM pv_usersINSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count(DISTINCT pv_users.userid)GROUP BY pv_users.genderINSERT OVERWRITE DIRECTORY /user/facebook/tmp/pv_age_sumSELECT pv_users.age, count(DISTINCT pv_users.userid)GROUP BY pv_users.age;
    68. 68. querying / processingcascading
    69. 69. querying / processingcascalog
    70. 70. querying / processingDatameer
    71. 71. querying / processingMRv2
    72. 72. querying / processing MRv2 allows MRv1 (of course) SparkBulk Synchronous Parallel Graphs MPI
    73. 73. querying / processingmachine learning algorithms
    74. 74. querying / processingmahout
    75. 75. outputflat files
    76. 76. outputrdbms
    77. 77. outputcache
    78. 78. outputhdfs
    79. 79. realtime
    80. 80. Storm
    81. 81. streamsTuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
    82. 82. spoutsSource of streams
    83. 83. spout examples•Read from Kestrel queue• Read from Twitter streaming API
    84. 84. boltsProcesses input streams and produces new streams
    85. 85. bolts• Functions• Filters• Aggregation• Joins• Talk to databases
    86. 86. topologiesNetwork of spouts and bolts
    87. 87. data business value
    88. 88. The UnreasonableEffectiveness of Data http://bit.ly/x407Ln
    89. 89. Start small
    90. 90. But definitely start!
    91. 91. Please start!
    92. 92. Thank you.
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×