Harry potter and enormous data
Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
Why? When? How? Wherewith?
Finance/insurance dilemma the better your offer, the better you need it to get
Trading/brokerage dilemma the faster/cheaper you are, the faster/cheaper you need to get
Web dilemma the more people visit you, the more people you need to reach
Telco/public utility dilemma the more traffic you have, the better you need it to get
Medicine dilemma the more you check, the more you need to check
Law and order dilemma the more you protect, the more you need to perceive
Intelligence dilemma the more you know, the more you need to know
The price of success is data growth. The more success you have/want, the more data you have/need
Why ? When? How? Wherewith?
Your data is (on) the web
Your data is for/from mobile devices
Your data comes from sensors/RFID
Your data comes over GPS
Your data is video/audio
You data is (totally) unstructured
You get flooded by tera-/petabytes of data
You simply get bombed with data
Your data flows on streams at a very high rate from different locations
You have to read The Matrix
You need to distribute your data over the whole world
Your existence depends on (the quality of) your data
Why? When? How? Wherewith?
It’s not sufficient anymore just to throw it at your Oracle DB and to hope it works
You have no chance without science
To control enormous data means to learn/know algorithms/ADTs computing systems networking operating systems database syste...
You need to scale very far. And to scale with enormous data means to...
Chop in smaller pieces
Chop in bite-size, manageable pieces
Separate reading from writing
Update and mark, don’t delete physically
Minimize hard relations
Separate archive from accessible data
Trash everything that has only to be analyzed in real-time
Parallelize and distribute
Avoid single bottle necks
Decentralize with “ equal” nodes
Minimize the distance between the data and its processors
Design with Byzantine faults in mind
Build upon consensus, agreement, voting, quorum
Don’t trust time and timestamps
Strive for O(1) for data lookups #
Utilize commodity hardware
Consider hardware fallibility
Relax new hardware startup procedure
Bring data to its users
Build upon asynchronous message passing
Consider network unreliability
Consider asynchronous message passing unreliability
Design with eventual actuality/consistency in mind
Implement redundancy and replication
Consider latency an adjustment screw
Consider availability an adjustment screw
Be prepared for disaster
Utilize the fog/clouds/CDNs
Utilize GPUs
Design for theoretically unlimited amount of data
Design for frequent structure changes
Design for the all-in-one mix
Why? When? How? Wherewith?
There are storage approaches much better suited for enormous data amounts than your MySQL
There are preparation/processing mechanisms much better suited for enormous data than your shell scripts
There are technologies much better suited for spreading enormous data all over the world than your Apache cluster
There are mechanisms much better suited for statistical analysis of enormous data than your DWH
There are real-time processing approaches much better suited for enormous data than your SQL queries
There are technologies much better suited for the visualization of enormous data than your Excel
Just pick right tools for the job
I want to optimize and customize my offers, improve business forecasts. How can I do that?
Protocol every single user activity and MapReduce the logs. Even in real-time
You can use a distributed MapReduce framework to analyze your textual/binary log files
Hadoop, Disco, Skynet, Mars etc.
Or you can log straight into a distributed data store and MapReduce it there
Riak, Cassandra, Greenplum etc.
Use graph DBs to store and track your customers’ relations of any kind and any depth
Neo4j, GraphBase, OrientDB etc.
I want to monitor my market, competitors’ prices and automatically adjust my offers, offer and trade in first. How can I d...
Build upon CEP/high speed trading platform to process events, fire alerts and execute business logic as events come, to ma...
Esper, LMAX, StreamBase etc.
Keep it all in memory
Redis, Memcached etc.
I want to act and react on events and anomalies just in time, perceive and predict risks. How can I do that?
Build upon CEP to process events and fire alerts as they come, consider their causality
Esper, StreamBase etc.
I want to analyze huge video and audio files and streams real fast. How can I do that?
Parallelize calculations on many GPUs
OpenCL, CUDA, GPGPU etc.
I want to automatically optimize my nets and infrastructure according to traffic. How can I do that?
You can build upon CEP to ad hoc adjust traffic/net from SNMP traps, sensors
Esper, StreamBase etc.
You can use a distributed MapReduce framework to analyze your log files
Hadoop, Disco, Skynet, Mars etc.
I want my users all over the world to access my content and offers as fast as possible. How can I do that?
Go with a CDN to bring your content and offers as close to the consumer as possible
Akamai,  Amazon CloudFront etc.
Use cloud based stores to make your data globally accessible. Split and spatially distribute your data there
Amazon S3, Rackspace Cloud Files etc.
I want to easily visualize my huge data amounts in dashboards, reports in almost no time. How can I do that?
Go with a general purpose platform for statistical computing and/or data visualization
R, Mondrian etc.
Thank you
Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages or generated through pub...
Upcoming SlideShare
Loading in...5
×

Harry Potter and Enormous Data (Pavlo Baron)

2,286

Published on

Slides of the talk I did at the Jboss One Day Talk 2012. It explains the typical use cases, theoretical aspects and practical implementations of something one calls "big data". The point is to understand how to deal wiith enormous data emounts

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,286
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Harry Potter and Enormous Data (Pavlo Baron)

  1. 1. Harry potter and enormous data
  2. 2. Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
  3. 3. Why? When? How? Wherewith?
  4. 4. Finance/insurance dilemma the better your offer, the better you need it to get
  5. 5. Trading/brokerage dilemma the faster/cheaper you are, the faster/cheaper you need to get
  6. 6. Web dilemma the more people visit you, the more people you need to reach
  7. 7. Telco/public utility dilemma the more traffic you have, the better you need it to get
  8. 8. Medicine dilemma the more you check, the more you need to check
  9. 9. Law and order dilemma the more you protect, the more you need to perceive
  10. 10. Intelligence dilemma the more you know, the more you need to know
  11. 11. The price of success is data growth. The more success you have/want, the more data you have/need
  12. 12. Why ? When? How? Wherewith?
  13. 13. Your data is (on) the web
  14. 14. Your data is for/from mobile devices
  15. 15. Your data comes from sensors/RFID
  16. 16. Your data comes over GPS
  17. 17. Your data is video/audio
  18. 18. You data is (totally) unstructured
  19. 19. You get flooded by tera-/petabytes of data
  20. 20. You simply get bombed with data
  21. 21. Your data flows on streams at a very high rate from different locations
  22. 22. You have to read The Matrix
  23. 23. You need to distribute your data over the whole world
  24. 24. Your existence depends on (the quality of) your data
  25. 25. Why? When? How? Wherewith?
  26. 26. It’s not sufficient anymore just to throw it at your Oracle DB and to hope it works
  27. 27. You have no chance without science
  28. 28. To control enormous data means to learn/know algorithms/ADTs computing systems networking operating systems database systems distributed systems
  29. 29. You need to scale very far. And to scale with enormous data means to...
  30. 30. Chop in smaller pieces
  31. 31. Chop in bite-size, manageable pieces
  32. 32. Separate reading from writing
  33. 33. Update and mark, don’t delete physically
  34. 34. Minimize hard relations
  35. 35. Separate archive from accessible data
  36. 36. Trash everything that has only to be analyzed in real-time
  37. 37. Parallelize and distribute
  38. 38. Avoid single bottle necks
  39. 39. Decentralize with “ equal” nodes
  40. 40. Minimize the distance between the data and its processors
  41. 41. Design with Byzantine faults in mind
  42. 42. Build upon consensus, agreement, voting, quorum
  43. 43. Don’t trust time and timestamps
  44. 44. Strive for O(1) for data lookups #
  45. 45. Utilize commodity hardware
  46. 46. Consider hardware fallibility
  47. 47. Relax new hardware startup procedure
  48. 48. Bring data to its users
  49. 49. Build upon asynchronous message passing
  50. 50. Consider network unreliability
  51. 51. Consider asynchronous message passing unreliability
  52. 52. Design with eventual actuality/consistency in mind
  53. 53. Implement redundancy and replication
  54. 54. Consider latency an adjustment screw
  55. 55. Consider availability an adjustment screw
  56. 56. Be prepared for disaster
  57. 57. Utilize the fog/clouds/CDNs
  58. 58. Utilize GPUs
  59. 59. Design for theoretically unlimited amount of data
  60. 60. Design for frequent structure changes
  61. 61. Design for the all-in-one mix
  62. 62. Why? When? How? Wherewith?
  63. 63. There are storage approaches much better suited for enormous data amounts than your MySQL
  64. 64. There are preparation/processing mechanisms much better suited for enormous data than your shell scripts
  65. 65. There are technologies much better suited for spreading enormous data all over the world than your Apache cluster
  66. 66. There are mechanisms much better suited for statistical analysis of enormous data than your DWH
  67. 67. There are real-time processing approaches much better suited for enormous data than your SQL queries
  68. 68. There are technologies much better suited for the visualization of enormous data than your Excel
  69. 69. Just pick right tools for the job
  70. 70. I want to optimize and customize my offers, improve business forecasts. How can I do that?
  71. 71. Protocol every single user activity and MapReduce the logs. Even in real-time
  72. 72. You can use a distributed MapReduce framework to analyze your textual/binary log files
  73. 73. Hadoop, Disco, Skynet, Mars etc.
  74. 74. Or you can log straight into a distributed data store and MapReduce it there
  75. 75. Riak, Cassandra, Greenplum etc.
  76. 76. Use graph DBs to store and track your customers’ relations of any kind and any depth
  77. 77. Neo4j, GraphBase, OrientDB etc.
  78. 78. I want to monitor my market, competitors’ prices and automatically adjust my offers, offer and trade in first. How can I do that?
  79. 79. Build upon CEP/high speed trading platform to process events, fire alerts and execute business logic as events come, to make predictions
  80. 80. Esper, LMAX, StreamBase etc.
  81. 81. Keep it all in memory
  82. 82. Redis, Memcached etc.
  83. 83. I want to act and react on events and anomalies just in time, perceive and predict risks. How can I do that?
  84. 84. Build upon CEP to process events and fire alerts as they come, consider their causality
  85. 85. Esper, StreamBase etc.
  86. 86. I want to analyze huge video and audio files and streams real fast. How can I do that?
  87. 87. Parallelize calculations on many GPUs
  88. 88. OpenCL, CUDA, GPGPU etc.
  89. 89. I want to automatically optimize my nets and infrastructure according to traffic. How can I do that?
  90. 90. You can build upon CEP to ad hoc adjust traffic/net from SNMP traps, sensors
  91. 91. Esper, StreamBase etc.
  92. 92. You can use a distributed MapReduce framework to analyze your log files
  93. 93. Hadoop, Disco, Skynet, Mars etc.
  94. 94. I want my users all over the world to access my content and offers as fast as possible. How can I do that?
  95. 95. Go with a CDN to bring your content and offers as close to the consumer as possible
  96. 96. Akamai, Amazon CloudFront etc.
  97. 97. Use cloud based stores to make your data globally accessible. Split and spatially distribute your data there
  98. 98. Amazon S3, Rackspace Cloud Files etc.
  99. 99. I want to easily visualize my huge data amounts in dashboards, reports in almost no time. How can I do that?
  100. 100. Go with a general purpose platform for statistical computing and/or data visualization
  101. 101. R, Mondrian etc.
  102. 102. Thank you
  103. 103. Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages or generated through public online generators

×