BigData & CDN - OOP2011 (Pavlo Baron)

10,886 views

Published on

In this talk I have discussed some ideas of BigData distribution using CDNs (Content Delivery Networks). These ideas included not only the static content, but had primarily content pre-computation in focus. I have also discussed some basic technical tricks of global content distribution

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,886
On SlideShare
0
From Embeds
0
Number of Embeds
55
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

BigData & CDN - OOP2011 (Pavlo Baron)

  1. Big Data and CDN Pavlo Baron
  2. Pavlo Baron
  3.  
  4. www.pbit.org [email_address] @pavlobaron
  5. What is Big Data
  6. Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools
  7. Huh?
  8. Somewhere a mosquito coughs…
  9. … and somewhere else a data center gets flooded with data
  10. Huh???
  11. More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) get shared each month on Facebook
  12. Twitter users are, in total, tweeting an average of 55 million tweets a day, also including links etc.
  13. OMG!
  14. But there is even much more: cameras, sensors, RFID, logs, geolocation, GIS and so on
  15. kk 
  16. There are several perspectives at Big Data
  17. Data storage and archiving
  18. Data preparation
  19. Live data provisioning
  20. Data analysis / analytics
  21. Real-time event and stream processing
  22. Data visualization
  23. Where does Big Data come from
  24. “ Uncontrolled” human activities in the world wide web, or Web 2.0 – if you like
  25. Every human leaves a vast number of data marks on the web every day: intentionally, accidentally and unknowingly
  26. Intentionally: we blog, tweet, upload, flatter, link, etc…
  27. And: the web has become an industry of its own. With us in the thick of it
  28. Accidentally: we are humans and we make mistakes…
  29. Unknowingly: we get tricked, misled, controlled, logged etc…
  30. The vast number of data marks we leave on the web every day gets copied, duplicated. Data explodes.
  31. Data flowing on streams at a very high rate from many actors
  32. The amount of data flying over the air has become enormous, and it’s growing unpredictably
  33. It’s not only nuclear reactors anymore having hi-tech sensors and generating tons of data
  34. And our physically huge globe…
  35. … has become a tiny electronic ball. It’s completely wired. Data needs just seconds to “circumnavigate” the world
  36. Laws and regulations force us to store and archive all sorts of data, and it’s getting more and more
  37. Human knowledge grows extremely fast. It’s far too gigantic for one single brain
  38. Big Brother – Big Data. We get observed, filmed, recorded, logged, geolocated etc.
  39. Panic!
  40. Don’t panic. Get over it. Brace yourself for the battle.
  41. First of all, some major changes have happened
  42. Instead of huge expensive cabinets…
  43. … we can use lots of cheap commodity hardware
  44. Physics hit the wall…
  45. … and we need to think parallel
  46. Our physically huge globe…
  47. s … has become a tiny electronic ball. It’s completely wired
  48. Spontaneous requirements…
  49. … can be covered by the fog (aka cloud)
  50. And what are my weapons
  51. Cut your data in smaller pieces
  52. Make those pieces bite-size (manageable)
  53. Bring the data closer to those who need it
  54. Bring the data closer to where it’s physically accessed
  55. Give up relations – where you don’t need them
  56. Give up actuality – where you don’t need it
  57. Find optimal and effective replication mechanisms
  58. Consider latency an adjustment screw – if you can
  59. Consider availability an adjustment screw – if you can
  60. Be prepared to deal with unlimited amount of data – depending on the perspective
  61. Know your data
  62. Know your volumes
  63. Know your scenarios
  64. Consider it what it is: a science
  65. Right tool for the job
  66. kk 
  67. And how does this technically work
  68. Live data provisioning
  69. What’s the problem
  70. Your users are widely spread, maybe all over the world
  71. And you own Big Data, which has many facets – geographic, financial etc.
  72. And your classic silo architecture could break under the weight of such data
  73. And why would I need that
  74. You start and want to be one of those. Aha, ok 
  75. You simply grew up to a level
  76. Now you need to segment your users and thus to be faster and more reliable at locations,…
  77. … to keep your servers free of load and thus to avoid bottle necks,…
  78. … to cut your big data in smaller, better manageable chunks
  79. What are my weapons
  80. If your content is static in web terms, you are already well prepared
  81. In many cases, you can make your dynamic data static (pre-compute content)
  82. Huh?
  83. Let’s take a look at an online bookstore
  84. Hey, the online bookstore is completely dynamic – it’s a shop system!
  85. Really?
  86. Static media
  87. Images, videos, audio files etc. – do they really change that often? Or do they change at all?
  88. Book description
  89. Even when you change the prices you still can pre-compute the page at some time – you don’t need to compute the content while the page is getting accessed
  90. Browser mode
  91. This is a classic use case for static content pre-computation. There is often simply no need to navigate through dynamically built paths
  92. “ Web 2.0” features
  93. And even when you offer “Web 2.0” features such as rating through customers, you can asynchronously recompute (parts of) the pages using the new rating information
  94. Some book store content modifications are not very critical. They can be updated lagged on geographical base
  95. You see: many parts of an online bookstore seem dynamic, but can be actually pre-computed and delivered (lagged) as static content in web terms
  96. It’s all about the frequency of change, distances, wideness of distribution and the big data pain
  97. Aha…
  98. And what about pure dynamic features
  99. Book search
  100. Even this ultimately dynamic sounding feature can be (partially) de-dynamized. Consider the full text index as static content, not necessarily the data itself
  101. Shopping cart
  102. Sure, you cannot pre-compute the shopping cart. But maybe you also don’t need to synchronize a German customer’s cart to the whole world and keep it “local” instead
  103. Owning big data doesn’t necessarily mean owning 100% dynamic data in terms of web
  104. Aha…
  105. And now let’s distribute it with CDN – content delivery network
  106. Huh?
  107. Akamai web traffic dominance
  108. Akamai web traffic monitoring
  109. Akamai EdgePlatform
  110. 73,000 servers 70 countries 1,000 networks 30% of world’s web traffic (OMG, is the rest Google?)
  111. There are several CDN providers offering (world wide) such infrastructures
  112. And now let’s get a little “insane” 
  113.  
  114.  
  115.  
  116.  
  117. Huh???
  118. Yeah, something’s going on behind the scenes 
  119. How does this technically work
  120. CDN is like a deputy. You make a contract, and it takes over parts of your platform
  121. From here, it delivers to your users the content you tell it to deliver, but being much closer to them and much more intelligent than you when it comes to managing the load
  122. CDN has its infrastructure including actual nodes directly at the backbones, offering web caching, server load balancing, request routing and, based upon these techniques, content delivery services
  123. What you have seen earlier: CDN’s DNS infrastructure has each time returned a different IP address with TTL = 20
  124. This is done either through DNS cache “splitting” or dynamically based on the IP address of the origin name server which made the DNS A query
  125.  
  126. What you now can expect is that the returned IP address leads you to a load balancer – your gate to a whole sub-infrastructure of the CDN which balances between, for example, web caches
  127. last mile last mile cache refresh inter-cache replication cache access cache access 1.2.3.4 5.6.7.8 A query 10.2.3.40 50.6.7.80 10.2.3.40 50.6.7.80 your servers caches caches name server
  128. CDN uses different algorithms to decide where it routes user requests to: based upon current load, cost, location etc.
  129. But in the end, your content gets delivered to the user. If it expires, CDN refreshes it from your servers in the background
  130. According to HTTP/1.1, a web cache has to be controlled by: Freshness Validation Invalidation
  131. As the very last step, you might have to offer the “last mile” – the very last application access, e.g. the last item view or similar. Here, the user hits your server
  132. kk 
  133. How can I benefit from this having big data
  134. When you have e.g. images / videos as your big data, you can consider this data as static and thus push down-/uploads to CDN
  135. So, you segment your users and keep your own servers free of load. What you might lose, is consistency between segments
  136.  
  137.  
  138. If you use CDN to collect your new data, you might need some complex replication mechanisms
  139. Depending on your agreement with the CDN provider, data amount, frequency etc., you can pick from one of the replication directions: Push Pull
  140. Master-slave: 1 master (M), n slaves (S), many clients (C) M C C S r/w r replicate
  141. Multimaster: n masters (M), many clients (C) M C C M r/w r/w replicate
  142. You should look for a replication strategy which minimizes complexity and thus make one site the master and the other site the slave
  143. Or you pre-compute static content out of your dynamic big data – a sort of snapshots, and distribute it with CDN
  144. So, you keep your database servers almost free of load. Complexity comes with the snapshot / cache management
  145. Or you can even push some functional parts of your platform to CDN such as searches, cart etc.
  146. You win a lot dealing with big data, but you are more dependent on the CDN provider, and your overall architecture is weaker
  147. Or if you really want to experiment, you can even try to push whole executed database queries to CDN like you would do it with memory caches
  148. You can experiment with all sorts of caches: Write through Write back Write allocate
  149. That’s really cool, but even much more complex and unreliable than a cluster-distributed memory cache 
  150. kk 
  151. With all that in mind: you can have a lot of your big data out there with CDN
  152. Thank you 
  153.  

×