• Save
Harry Potter and Enormous Data (Pavlo Baron)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Harry Potter and Enormous Data (Pavlo Baron)

on

  • 2,546 views

Slides of the talk I did at the Jboss One Day Talk 2012. It explains the typical use cases, theoretical aspects and practical implementations of something one calls "big data". The point is to ...

Slides of the talk I did at the Jboss One Day Talk 2012. It explains the typical use cases, theoretical aspects and practical implementations of something one calls "big data". The point is to understand how to deal wiith enormous data emounts

Statistics

Views

Total Views
2,546
Views on SlideShare
2,490
Embed Views
56

Actions

Likes
2
Downloads
0
Comments
0

4 Embeds 56

http://paper.li 24
http://coderwall.com 23
http://a0.twimg.com 7
https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Harry Potter and Enormous Data (Pavlo Baron) Presentation Transcript

  • 1. Harry potter and enormous data
  • 2. Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
  • 3. Why? When? How? Wherewith?
  • 4. Finance/insurance dilemma the better your offer, the better you need it to get
  • 5. Trading/brokerage dilemma the faster/cheaper you are, the faster/cheaper you need to get
  • 6. Web dilemma the more people visit you, the more people you need to reach
  • 7. Telco/public utility dilemma the more traffic you have, the better you need it to get
  • 8. Medicine dilemma the more you check, the more you need to check
  • 9. Law and order dilemma the more you protect, the more you need to perceive
  • 10. Intelligence dilemma the more you know, the more you need to know
  • 11. The price of success is data growth. The more success you have/want, the more data you have/need
  • 12. Why ? When? How? Wherewith?
  • 13. Your data is (on) the web
  • 14. Your data is for/from mobile devices
  • 15. Your data comes from sensors/RFID
  • 16. Your data comes over GPS
  • 17. Your data is video/audio
  • 18. You data is (totally) unstructured
  • 19. You get flooded by tera-/petabytes of data
  • 20. You simply get bombed with data
  • 21. Your data flows on streams at a very high rate from different locations
  • 22. You have to read The Matrix
  • 23. You need to distribute your data over the whole world
  • 24. Your existence depends on (the quality of) your data
  • 25. Why? When? How? Wherewith?
  • 26. It’s not sufficient anymore just to throw it at your Oracle DB and to hope it works
  • 27. You have no chance without science
  • 28. To control enormous data means to learn/know algorithms/ADTs computing systems networking operating systems database systems distributed systems
  • 29. You need to scale very far. And to scale with enormous data means to...
  • 30. Chop in smaller pieces
  • 31. Chop in bite-size, manageable pieces
  • 32. Separate reading from writing
  • 33. Update and mark, don’t delete physically
  • 34. Minimize hard relations
  • 35. Separate archive from accessible data
  • 36. Trash everything that has only to be analyzed in real-time
  • 37. Parallelize and distribute
  • 38. Avoid single bottle necks
  • 39. Decentralize with “ equal” nodes
  • 40. Minimize the distance between the data and its processors
  • 41. Design with Byzantine faults in mind
  • 42. Build upon consensus, agreement, voting, quorum
  • 43. Don’t trust time and timestamps
  • 44. Strive for O(1) for data lookups #
  • 45. Utilize commodity hardware
  • 46. Consider hardware fallibility
  • 47. Relax new hardware startup procedure
  • 48. Bring data to its users
  • 49. Build upon asynchronous message passing
  • 50. Consider network unreliability
  • 51. Consider asynchronous message passing unreliability
  • 52. Design with eventual actuality/consistency in mind
  • 53. Implement redundancy and replication
  • 54. Consider latency an adjustment screw
  • 55. Consider availability an adjustment screw
  • 56. Be prepared for disaster
  • 57. Utilize the fog/clouds/CDNs
  • 58. Utilize GPUs
  • 59. Design for theoretically unlimited amount of data
  • 60. Design for frequent structure changes
  • 61. Design for the all-in-one mix
  • 62. Why? When? How? Wherewith?
  • 63. There are storage approaches much better suited for enormous data amounts than your MySQL
  • 64. There are preparation/processing mechanisms much better suited for enormous data than your shell scripts
  • 65. There are technologies much better suited for spreading enormous data all over the world than your Apache cluster
  • 66. There are mechanisms much better suited for statistical analysis of enormous data than your DWH
  • 67. There are real-time processing approaches much better suited for enormous data than your SQL queries
  • 68. There are technologies much better suited for the visualization of enormous data than your Excel
  • 69. Just pick right tools for the job
  • 70. I want to optimize and customize my offers, improve business forecasts. How can I do that?
  • 71. Protocol every single user activity and MapReduce the logs. Even in real-time
  • 72. You can use a distributed MapReduce framework to analyze your textual/binary log files
  • 73. Hadoop, Disco, Skynet, Mars etc.
  • 74. Or you can log straight into a distributed data store and MapReduce it there
  • 75. Riak, Cassandra, Greenplum etc.
  • 76. Use graph DBs to store and track your customers’ relations of any kind and any depth
  • 77. Neo4j, GraphBase, OrientDB etc.
  • 78. I want to monitor my market, competitors’ prices and automatically adjust my offers, offer and trade in first. How can I do that?
  • 79. Build upon CEP/high speed trading platform to process events, fire alerts and execute business logic as events come, to make predictions
  • 80. Esper, LMAX, StreamBase etc.
  • 81. Keep it all in memory
  • 82. Redis, Memcached etc.
  • 83. I want to act and react on events and anomalies just in time, perceive and predict risks. How can I do that?
  • 84. Build upon CEP to process events and fire alerts as they come, consider their causality
  • 85. Esper, StreamBase etc.
  • 86. I want to analyze huge video and audio files and streams real fast. How can I do that?
  • 87. Parallelize calculations on many GPUs
  • 88. OpenCL, CUDA, GPGPU etc.
  • 89. I want to automatically optimize my nets and infrastructure according to traffic. How can I do that?
  • 90. You can build upon CEP to ad hoc adjust traffic/net from SNMP traps, sensors
  • 91. Esper, StreamBase etc.
  • 92. You can use a distributed MapReduce framework to analyze your log files
  • 93. Hadoop, Disco, Skynet, Mars etc.
  • 94. I want my users all over the world to access my content and offers as fast as possible. How can I do that?
  • 95. Go with a CDN to bring your content and offers as close to the consumer as possible
  • 96. Akamai, Amazon CloudFront etc.
  • 97. Use cloud based stores to make your data globally accessible. Split and spatially distribute your data there
  • 98. Amazon S3, Rackspace Cloud Files etc.
  • 99. I want to easily visualize my huge data amounts in dashboards, reports in almost no time. How can I do that?
  • 100. Go with a general purpose platform for statistical computing and/or data visualization
  • 101. R, Mondrian etc.
  • 102. Thank you
  • 103. Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages or generated through public online generators