Set this “Big Data”technology zoo in orderWednesday, April 24, 13
Pavlo Baron, codecentric AGpavlo.baron@codecentric.de@pavlobaronWednesday, April 24, 13
First, let’s lookat speedWednesday, April 24, 13
realtime = time boundWednesday, April 24, 13
realtime = realtime ANDlatency penaltyWednesday, April 24, 13
realtime = realtime ANDNOT first hitting(spinning) diskWednesday, April 24, 13
realtime = realtime AND<= layer 5, ideally directchannel / busWednesday, April 24, 13
realtime = realtime ANDzero-copyWednesday, April 24, 13
realtime = realtime ANDNOT moving betweenuser and kernel spaceWednesday, April 24, 13
realtime = realtime ANDNOT parsing orreformatting dataWednesday, April 24, 13
realtime = realtime ANDNOT explicit queueingWednesday, April 24, 13
realtime = realtime ANDNOT reestablishingconnectionsWednesday, April 24, 13
realtime = realtime ANDpadding and cache-level(s)optimizationWednesday, April 24, 13
realtime = realtime ANDcircular buffers and (other)non-blocking datastructures/algorithmsWednesday, April 24, 13
realtime = realtime ANDpush instead of pull,especially for outsidedataWednesday, April 24, 13
realtime = realtime ANDNOT distributed withhorizontal dependenciesWednesday, April 24, 13
realtime = realtime ANDlow-level programming withminimal abstractionWednesday, April 24, 13
near realtime = NOTrealtimeWednesday, April 24, 13
near realtime = nearrealtime AND messageor event orientedWednesday, April 24, 13
near realtime = nearrealtime AND windoworiented with fixed time/sizeWednesday, April 24, 13
near realtime = nearrealtime AND doesn’tleave (main) memoryANDkeep lookups in memoryWednesday, April 24, 13
near realtime = near realtimeAND mostly CPU-less I/OWednesday, April 24, 13
near realtime = nearrealtime AND NOT hasexplicit, rich, selfdescribing modelWednesday, April 24, 13
near realtime = nearrealtime AND filterupfrontWednesday, April 24, 13
fast = NOT near realtimeWednesday, April 24, 13
fast = fast AND NOT query orsearch instead of hash-/offsetbased accessWednesday, April 24, 13
fast = fast AND NOTcombine data sourcesWednesday, April 24, 13
fast = fast AND NOTsynchronize or blockWednesday, April 24, 13
fast = fast AND NOTsynchronously preventdisaster throughredundancyWednesday, April 24, 13
fast = fast AND NOT I/Owait, especially(spinning) diskWednesday, April 24, 13
fast = fast AND NOT100% exact instead ofprobabilistic or guessingWednesday, April 24, 13
fast = fast ANDeventually refuse accessto prevent contention orswappingWednesday, April 24, 13
fast = fast AND generallylimiting accessfrequencyWednesday, April 24, 13
batch = slow = NOT fastWednesday, April 24, 13
oh, and what about “big”in Big Data?Wednesday, April 24, 13
“big” as in “highly frequent”:near realtime < big < batchWednesday, April 24, 13
“big” as in “whole lotta”:fast < big <= batchWednesday, April 24, 13
“big” as in “very chaotic”:fast < big <= batchWednesday, April 24, 13
“big” as in “analyticallycomplex”:fast < big <= batchWednesday, April 24, 13
“big” as in “widely spread”:fast < big <= batchWednesday, April 24, 13
So, it looks like it’s almostalways between fast andbatch, but...Wednesday, April 24, 13
dilemma: the morerealtime, the closer tothe bare metal and wiresWednesday, April 24, 13
dilemma: the morerealtime, the lessmachines anddistributionWednesday, April 24, 13
dilemma: the morecomplex the analyticalpart, the less realtimeWednesday, April 24, 13
dilemma: the morechaotic the data, the lessrealtimeWednesday, April 24, 13
dilemma: the bigger thedata, the more garbagein itWednesday, April 24, 13
Now let me shock youWednesday, April 24, 13
the real dilemma:business wants nearrealtime, but withoutpenalties or data loss,with endless scalability,zero-latency and ...
WTF???Wednesday, April 24, 13
Data that’s not immediatelyturned into usefulinformation and thus value isonly of archaeologic,accounting- or compliance-r...
The true market advantageof the future depends on howclose to near realtime you aregaining useful informationout of your l...
HFT people will laugh about itWednesday, April 24, 13
But Big Data people need tolearn from themWednesday, April 24, 13
Before you can use data, youfirst need to be able to takedata. So, let’s consider enexample how to take datareal fastWednes...
data = <some optimizedbinary that ideally fits intoone single MTU of theunderlying protocol(s)>Wednesday, April 24, 13
application.onAnyChange:sendEvent(data)Wednesday, April 24, 13
balancer.onEvent(data):balanceZeroCopy(data)Wednesday, April 24, 13
asyncListener.onData(data):asyncStore(data),asyncProcess(data)Wednesday, April 24, 13
So, now you can take a solidamount of data. Let’s look atthe processingWednesday, April 24, 13
Massively parallelcomputations on incomingdata move you closer to nearrealtimeWednesday, April 24, 13
HPC people will laugh about itWednesday, April 24, 13
But Big Data people need tolearn from themWednesday, April 24, 13
Go with message/eventorientation, VMs with nativesupport for them, or similaron platforms you probablydidn’t think it’s po...
processor = <CPU core/GPUcore(s) bound active,algorithmically trainedcomponent>Wednesday, April 24, 13
processor.onData(data):result = analyze(data),queueResult(result)Wednesday, April 24, 13
result = <some optimizedbinary that ideally fits intoone single MTU of theunderlying protocol(s)>Wednesday, April 24, 13
OK, you can process andqueue results for whoeverlistens to them (semi-time-critical, lower-level queue).Now how to store r...
There is no such thing ashigh-performance, high-load-capable, high-scale,multi-purpose, richmodel, absolutelyreliable and ...
database != data storeWednesday, April 24, 13
Classic databases and evenNoSQL data stores, fordifferent reasons, sometimestend to lose their originalintention / focusWe...
NewSQL world aims tosolve the scale-uplimitations of RDBMSthrough distributionwhile still guaranteeingACIDish transactions...
New ones arriveevery dayWednesday, April 24, 13
Let’s look behind thefacadeWednesday, April 24, 13
You can be real fast justspilling data block-wiseto the disk throughDMA, but beware ofcachesWednesday, April 24, 13
You’ll be a bit slowerwith an in-memory,journaling K/V store -but beware of weakstorage reliabilityWednesday, April 24, 13
You will be slower, but winreliability (and redundancy ifyou wish) when you go with acolumn-oriented or K/V,natively distr...
But you need to be awarethat to make such astore real fast, you’llhave to turn a lot ofinfrastructural nobsbefore your dat...
OK, now it’s in the store,though you probablydidn’t need to store it.But what if you run intothe (slow) batch? Howmake it ...
Go with native, machine-and system-closeextensions instead ofgeneral portabilityWednesday, April 24, 13
Keep it all in memory.Memory of a distributedsystem is alsodistributedWednesday, April 24, 13
Splice your pipes or gowith almost-zero-infrastructure queues ifyou mix technologiesWednesday, April 24, 13
Have the data where youprocess it, don’t move itthere firstWednesday, April 24, 13
Do (in-data-store-)map/reduce with 100%data localityWednesday, April 24, 13
Avoid (heavyweight)abstractionsWednesday, April 24, 13
And what about Big Dataappliances?Wednesday, April 24, 13
Appliances can be fast,real fastWednesday, April 24, 13
But you’re slow if youdon’t give them data as itcomesWednesday, April 24, 13
And what about Big DataClouds or Cloud ingeneral?Wednesday, April 24, 13
Clouds can be fast, realfast. If you can afford itWednesday, April 24, 13
And you’re slow if youdon’t give them data as itcomesWednesday, April 24, 13
There is no single tool aroundthat will do your Big DataWednesday, April 24, 13
Everything that makes youfaster - from hardware overkernel tweaks and networkoptimization to directmemory access and minim...
When you don’t need toretrieve or search, you winWednesday, April 24, 13
It’s all about speed. Sizedoesn’t matter a lotWednesday, April 24, 13
Do we need zoos?Wednesday, April 24, 13
Upcoming SlideShare
Loading in...5
×

Set this Big Data technology zoo in order (@pavlobaron)

850

Published on

Slides of my JAX 2013 talk

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
850
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Set this Big Data technology zoo in order (@pavlobaron)"

  1. 1. Set this “Big Data”technology zoo in orderWednesday, April 24, 13
  2. 2. Pavlo Baron, codecentric AGpavlo.baron@codecentric.de@pavlobaronWednesday, April 24, 13
  3. 3. First, let’s lookat speedWednesday, April 24, 13
  4. 4. realtime = time boundWednesday, April 24, 13
  5. 5. realtime = realtime ANDlatency penaltyWednesday, April 24, 13
  6. 6. realtime = realtime ANDNOT first hitting(spinning) diskWednesday, April 24, 13
  7. 7. realtime = realtime AND<= layer 5, ideally directchannel / busWednesday, April 24, 13
  8. 8. realtime = realtime ANDzero-copyWednesday, April 24, 13
  9. 9. realtime = realtime ANDNOT moving betweenuser and kernel spaceWednesday, April 24, 13
  10. 10. realtime = realtime ANDNOT parsing orreformatting dataWednesday, April 24, 13
  11. 11. realtime = realtime ANDNOT explicit queueingWednesday, April 24, 13
  12. 12. realtime = realtime ANDNOT reestablishingconnectionsWednesday, April 24, 13
  13. 13. realtime = realtime ANDpadding and cache-level(s)optimizationWednesday, April 24, 13
  14. 14. realtime = realtime ANDcircular buffers and (other)non-blocking datastructures/algorithmsWednesday, April 24, 13
  15. 15. realtime = realtime ANDpush instead of pull,especially for outsidedataWednesday, April 24, 13
  16. 16. realtime = realtime ANDNOT distributed withhorizontal dependenciesWednesday, April 24, 13
  17. 17. realtime = realtime ANDlow-level programming withminimal abstractionWednesday, April 24, 13
  18. 18. near realtime = NOTrealtimeWednesday, April 24, 13
  19. 19. near realtime = nearrealtime AND messageor event orientedWednesday, April 24, 13
  20. 20. near realtime = nearrealtime AND windoworiented with fixed time/sizeWednesday, April 24, 13
  21. 21. near realtime = nearrealtime AND doesn’tleave (main) memoryANDkeep lookups in memoryWednesday, April 24, 13
  22. 22. near realtime = near realtimeAND mostly CPU-less I/OWednesday, April 24, 13
  23. 23. near realtime = nearrealtime AND NOT hasexplicit, rich, selfdescribing modelWednesday, April 24, 13
  24. 24. near realtime = nearrealtime AND filterupfrontWednesday, April 24, 13
  25. 25. fast = NOT near realtimeWednesday, April 24, 13
  26. 26. fast = fast AND NOT query orsearch instead of hash-/offsetbased accessWednesday, April 24, 13
  27. 27. fast = fast AND NOTcombine data sourcesWednesday, April 24, 13
  28. 28. fast = fast AND NOTsynchronize or blockWednesday, April 24, 13
  29. 29. fast = fast AND NOTsynchronously preventdisaster throughredundancyWednesday, April 24, 13
  30. 30. fast = fast AND NOT I/Owait, especially(spinning) diskWednesday, April 24, 13
  31. 31. fast = fast AND NOT100% exact instead ofprobabilistic or guessingWednesday, April 24, 13
  32. 32. fast = fast ANDeventually refuse accessto prevent contention orswappingWednesday, April 24, 13
  33. 33. fast = fast AND generallylimiting accessfrequencyWednesday, April 24, 13
  34. 34. batch = slow = NOT fastWednesday, April 24, 13
  35. 35. oh, and what about “big”in Big Data?Wednesday, April 24, 13
  36. 36. “big” as in “highly frequent”:near realtime < big < batchWednesday, April 24, 13
  37. 37. “big” as in “whole lotta”:fast < big <= batchWednesday, April 24, 13
  38. 38. “big” as in “very chaotic”:fast < big <= batchWednesday, April 24, 13
  39. 39. “big” as in “analyticallycomplex”:fast < big <= batchWednesday, April 24, 13
  40. 40. “big” as in “widely spread”:fast < big <= batchWednesday, April 24, 13
  41. 41. So, it looks like it’s almostalways between fast andbatch, but...Wednesday, April 24, 13
  42. 42. dilemma: the morerealtime, the closer tothe bare metal and wiresWednesday, April 24, 13
  43. 43. dilemma: the morerealtime, the lessmachines anddistributionWednesday, April 24, 13
  44. 44. dilemma: the morecomplex the analyticalpart, the less realtimeWednesday, April 24, 13
  45. 45. dilemma: the morechaotic the data, the lessrealtimeWednesday, April 24, 13
  46. 46. dilemma: the bigger thedata, the more garbagein itWednesday, April 24, 13
  47. 47. Now let me shock youWednesday, April 24, 13
  48. 48. the real dilemma:business wants nearrealtime, but withoutpenalties or data loss,with endless scalability,zero-latency and 100%consistencyWednesday, April 24, 13
  49. 49. WTF???Wednesday, April 24, 13
  50. 50. Data that’s not immediatelyturned into usefulinformation and thus value isonly of archaeologic,accounting- or compliance-relevant or even algorith-training interestWednesday, April 24, 13
  51. 51. The true market advantageof the future depends on howclose to near realtime you aregaining useful informationout of your live dataWednesday, April 24, 13
  52. 52. HFT people will laugh about itWednesday, April 24, 13
  53. 53. But Big Data people need tolearn from themWednesday, April 24, 13
  54. 54. Before you can use data, youfirst need to be able to takedata. So, let’s consider enexample how to take datareal fastWednesday, April 24, 13
  55. 55. data = <some optimizedbinary that ideally fits intoone single MTU of theunderlying protocol(s)>Wednesday, April 24, 13
  56. 56. application.onAnyChange:sendEvent(data)Wednesday, April 24, 13
  57. 57. balancer.onEvent(data):balanceZeroCopy(data)Wednesday, April 24, 13
  58. 58. asyncListener.onData(data):asyncStore(data),asyncProcess(data)Wednesday, April 24, 13
  59. 59. So, now you can take a solidamount of data. Let’s look atthe processingWednesday, April 24, 13
  60. 60. Massively parallelcomputations on incomingdata move you closer to nearrealtimeWednesday, April 24, 13
  61. 61. HPC people will laugh about itWednesday, April 24, 13
  62. 62. But Big Data people need tolearn from themWednesday, April 24, 13
  63. 63. Go with message/eventorientation, VMs with nativesupport for them, or similaron platforms you probablydidn’t think it’s possible onWednesday, April 24, 13
  64. 64. processor = <CPU core/GPUcore(s) bound active,algorithmically trainedcomponent>Wednesday, April 24, 13
  65. 65. processor.onData(data):result = analyze(data),queueResult(result)Wednesday, April 24, 13
  66. 66. result = <some optimizedbinary that ideally fits intoone single MTU of theunderlying protocol(s)>Wednesday, April 24, 13
  67. 67. OK, you can process andqueue results for whoeverlistens to them (semi-time-critical, lower-level queue).Now how to store real fast alot of data like this?Wednesday, April 24, 13
  68. 68. There is no such thing ashigh-performance, high-load-capable, high-scale,multi-purpose, richmodel, absolutelyreliable and 100%consistent databaseWednesday, April 24, 13
  69. 69. database != data storeWednesday, April 24, 13
  70. 70. Classic databases and evenNoSQL data stores, fordifferent reasons, sometimestend to lose their originalintention / focusWednesday, April 24, 13
  71. 71. NewSQL world aims tosolve the scale-uplimitations of RDBMSthrough distributionwhile still guaranteeingACIDish transactionsWednesday, April 24, 13
  72. 72. New ones arriveevery dayWednesday, April 24, 13
  73. 73. Let’s look behind thefacadeWednesday, April 24, 13
  74. 74. You can be real fast justspilling data block-wiseto the disk throughDMA, but beware ofcachesWednesday, April 24, 13
  75. 75. You’ll be a bit slowerwith an in-memory,journaling K/V store -but beware of weakstorage reliabilityWednesday, April 24, 13
  76. 76. You will be slower, but winreliability (and redundancy ifyou wish) when you go with acolumn-oriented or K/V,natively distributed andmasterless store - as model-agnostic as possibleWednesday, April 24, 13
  77. 77. But you need to be awarethat to make such astore real fast, you’llhave to turn a lot ofinfrastructural nobsbefore your data evenhits the storeWednesday, April 24, 13
  78. 78. OK, now it’s in the store,though you probablydidn’t need to store it.But what if you run intothe (slow) batch? Howmake it faster?Wednesday, April 24, 13
  79. 79. Go with native, machine-and system-closeextensions instead ofgeneral portabilityWednesday, April 24, 13
  80. 80. Keep it all in memory.Memory of a distributedsystem is alsodistributedWednesday, April 24, 13
  81. 81. Splice your pipes or gowith almost-zero-infrastructure queues ifyou mix technologiesWednesday, April 24, 13
  82. 82. Have the data where youprocess it, don’t move itthere firstWednesday, April 24, 13
  83. 83. Do (in-data-store-)map/reduce with 100%data localityWednesday, April 24, 13
  84. 84. Avoid (heavyweight)abstractionsWednesday, April 24, 13
  85. 85. And what about Big Dataappliances?Wednesday, April 24, 13
  86. 86. Appliances can be fast,real fastWednesday, April 24, 13
  87. 87. But you’re slow if youdon’t give them data as itcomesWednesday, April 24, 13
  88. 88. And what about Big DataClouds or Cloud ingeneral?Wednesday, April 24, 13
  89. 89. Clouds can be fast, realfast. If you can afford itWednesday, April 24, 13
  90. 90. And you’re slow if youdon’t give them data as itcomesWednesday, April 24, 13
  91. 91. There is no single tool aroundthat will do your Big DataWednesday, April 24, 13
  92. 92. Everything that makes youfaster - from hardware overkernel tweaks and networkoptimization to directmemory access and minimal-abstraction code are yourfriendsWednesday, April 24, 13
  93. 93. When you don’t need toretrieve or search, you winWednesday, April 24, 13
  94. 94. It’s all about speed. Sizedoesn’t matter a lotWednesday, April 24, 13
  95. 95. Do we need zoos?Wednesday, April 24, 13

×