Your SlideShare is downloading. ×
Big Data and CDN Pavlo Baron
Pavlo Baron
 
www.pbit.org [email_address] @pavlobaron
What is Big Data
Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools
Huh?
Somewhere a mosquito coughs…
…  and somewhere else a data center gets flooded with data
Huh???
More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) get shared each mo...
Twitter  users are, in total, tweeting an average of 55 million tweets a day, also including links etc.
OMG!
But there is even much more: cameras, sensors, RFID, logs, geolocation, GIS and so on
kk  
There are several perspectives at Big Data
Data storage and archiving
Data preparation
Live data provisioning
Data analysis / analytics
Real-time event and stream processing
Data visualization
Where does Big Data come from
“ Uncontrolled” human activities in the world wide web, or Web 2.0 – if you like
Every human leaves a vast number of data marks on the web every day: intentionally, accidentally and unknowingly
Intentionally: we blog, tweet, upload, flatter, link, etc…
And: the web has become an industry of its own. With us in the thick of it
Accidentally: we are humans and we make mistakes…
Unknowingly: we get tricked, misled, controlled, logged etc…
The vast number of data marks we leave on the web every day gets copied, duplicated. Data explodes.
Data flowing on streams at a very high rate from many actors
The amount of data flying over the air has become enormous, and it’s growing unpredictably
It’s not only nuclear reactors anymore having hi-tech sensors and generating tons of data
And our physically huge globe…
… has become a tiny electronic ball. It’s completely wired. Data needs just seconds to “circumnavigate” the world
Laws and regulations force us to store and archive all sorts of data, and it’s getting more and more
Human knowledge grows extremely fast. It’s far too gigantic for one single brain
Big Brother – Big Data. We get observed, filmed, recorded, logged, geolocated etc.
Panic!
Don’t panic. Get over it. Brace yourself for the battle.
First of all, some major changes have happened
Instead of huge expensive cabinets…
… we can use lots of cheap commodity hardware
Physics hit the wall…
… and we need to think parallel
Our physically huge globe…
s … has become a tiny electronic ball. It’s completely wired
Spontaneous requirements…
… can be covered by the fog (aka cloud)
And what are my weapons
Cut your data in smaller pieces
Make those pieces bite-size (manageable)
Bring the data closer to those who need it
Bring the data closer to where it’s physically accessed
Give up relations – where you don’t need them
Give up actuality – where you don’t need it
Find optimal and effective replication mechanisms
Consider latency an adjustment screw – if you can
Consider availability an adjustment screw – if you can
Be prepared to deal with unlimited amount of data – depending on the perspective
Know your data
Know your volumes
Know your scenarios
Consider it what it is: a science
Right tool for the job
kk  
And how does this technically work
Live data provisioning
What’s the problem
Your users are widely spread, maybe all over the world
And you own Big Data, which has many facets – geographic, financial etc.
And your classic silo architecture could break under the weight of such data
And why would I need that
You start and want to be one of those. Aha, ok  
You simply grew up to a level
Now you need to segment your users and thus to be faster and more reliable at locations,…
…  to keep your servers free of load and thus to avoid bottle necks,…
… to cut your big data in smaller, better manageable chunks
What are my weapons
If your content is static in web terms, you are already well prepared
In many cases, you can make your dynamic data static (pre-compute content)
Huh?
Let’s take a look at an online bookstore
Hey, the online bookstore is completely dynamic – it’s a shop system!
Really?
Static media
Images, videos, audio files etc. – do they really change that often? Or do they change at all?
Book description
Even when you change the prices you still can pre-compute the page at some time – you don’t need to compute the content wh...
Browser mode
This is a classic use case for static content pre-computation. There is often simply no need to navigate through dynamical...
“ Web 2.0” features
And even when you offer “Web 2.0” features such as rating through customers, you can asynchronously recompute (parts of) t...
Some book store content modifications are not very critical. They can be updated lagged on geographical base
You see: many parts of an online bookstore seem dynamic, but can be actually pre-computed and delivered (lagged) as static...
It’s all about the frequency of change, distances, wideness of distribution and the big data pain
Aha…
And what about pure dynamic features
Book search
Even this ultimately dynamic sounding feature can be (partially) de-dynamized. Consider the full text index as static cont...
Shopping cart
Sure, you cannot pre-compute the shopping cart. But maybe you also don’t need to synchronize a German customer’s cart to t...
Owning big data doesn’t necessarily mean owning 100% dynamic data in terms of web
Aha…
And now let’s distribute it with CDN – content delivery network
Huh?
Akamai web traffic dominance
Akamai web traffic monitoring
Akamai EdgePlatform
73,000  servers 70  countries 1,000  networks 30%  of world’s web traffic (OMG, is the rest Google?)
There are several CDN providers offering (world wide) such infrastructures
And now let’s get a little “insane”  
 
 
 
 
Huh???
Yeah, something’s going on behind the scenes  
How does this technically work
CDN is like a deputy. You make a contract, and it takes over parts of your platform
From here, it delivers to your users the content you tell it to deliver, but being much closer to them and much more intel...
CDN has its infrastructure including actual nodes directly at the backbones, offering web caching, server load balancing, ...
What you have seen earlier: CDN’s DNS infrastructure has each time returned a different IP address with TTL = 20
This is done either through DNS cache “splitting” or dynamically based on the IP address of the origin name server which m...
 
What you now can expect is that the returned IP address leads you to a load balancer – your gate to a whole sub-infrastruc...
last mile last mile cache refresh inter-cache replication cache access cache access 1.2.3.4 5.6.7.8 A query 10.2.3.40 50.6...
CDN uses different algorithms to decide where it routes user requests to: based upon current load, cost, location etc.
But in the end, your content gets delivered to the user. If it expires, CDN refreshes it from your servers in the background
According to HTTP/1.1, a web cache has to be controlled by: Freshness Validation Invalidation
As the very last step, you might have to offer the “last mile” – the very last application access, e.g. the last item view...
kk  
How can I benefit from this having big data
When you have e.g. images / videos as your big data, you can consider this data as static and thus push down-/uploads to CDN
So, you segment your users and keep your own servers free of load. What you might lose, is consistency between segments
 
 
If you use CDN to collect your new data, you might need some complex replication mechanisms
Depending on your agreement with the CDN provider, data amount, frequency etc., you can pick from one of the replication d...
Master-slave: 1 master (M), n slaves (S), many clients (C) M C C S r/w r replicate
Multimaster: n masters (M), many clients (C) M C C M r/w r/w replicate
You should look for a replication strategy which minimizes complexity and thus make one site the master and the other site...
Or you pre-compute static content out of your dynamic big data – a sort of snapshots, and distribute it with CDN
So, you keep your database servers almost free of load. Complexity comes with the snapshot / cache management
Or you can even push some functional parts of your platform to CDN such as searches, cart etc.
You win a lot dealing with big data, but you are more dependent on the CDN provider, and your overall architecture is weaker
Or if you really want to experiment, you can even try to push whole executed database queries to CDN like you would do it ...
You can experiment with all sorts of caches: Write through Write back Write allocate
That’s really cool, but even much more complex and unreliable than a cluster-distributed memory cache  
kk  
With all that in mind: you can have a lot of your big data out there with CDN
Thank you  
 
Upcoming SlideShare
Loading in...5
×

BigData & CDN - OOP2011 (Pavlo Baron)

4,203

Published on

In this talk I have discussed some ideas of BigData distribution using CDNs (Content Delivery Networks). These ideas included not only the static content, but had primarily content pre-computation in focus. I have also discussed some basic technical tricks of global content distribution

Published in: Technology

Transcript of "BigData & CDN - OOP2011 (Pavlo Baron)"

  1. 1. Big Data and CDN Pavlo Baron
  2. 2. Pavlo Baron
  3. 4. www.pbit.org [email_address] @pavlobaron
  4. 5. What is Big Data
  5. 6. Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools
  6. 7. Huh?
  7. 8. Somewhere a mosquito coughs…
  8. 9. … and somewhere else a data center gets flooded with data
  9. 10. Huh???
  10. 11. More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) get shared each month on Facebook
  11. 12. Twitter users are, in total, tweeting an average of 55 million tweets a day, also including links etc.
  12. 13. OMG!
  13. 14. But there is even much more: cameras, sensors, RFID, logs, geolocation, GIS and so on
  14. 15. kk 
  15. 16. There are several perspectives at Big Data
  16. 17. Data storage and archiving
  17. 18. Data preparation
  18. 19. Live data provisioning
  19. 20. Data analysis / analytics
  20. 21. Real-time event and stream processing
  21. 22. Data visualization
  22. 23. Where does Big Data come from
  23. 24. “ Uncontrolled” human activities in the world wide web, or Web 2.0 – if you like
  24. 25. Every human leaves a vast number of data marks on the web every day: intentionally, accidentally and unknowingly
  25. 26. Intentionally: we blog, tweet, upload, flatter, link, etc…
  26. 27. And: the web has become an industry of its own. With us in the thick of it
  27. 28. Accidentally: we are humans and we make mistakes…
  28. 29. Unknowingly: we get tricked, misled, controlled, logged etc…
  29. 30. The vast number of data marks we leave on the web every day gets copied, duplicated. Data explodes.
  30. 31. Data flowing on streams at a very high rate from many actors
  31. 32. The amount of data flying over the air has become enormous, and it’s growing unpredictably
  32. 33. It’s not only nuclear reactors anymore having hi-tech sensors and generating tons of data
  33. 34. And our physically huge globe…
  34. 35. … has become a tiny electronic ball. It’s completely wired. Data needs just seconds to “circumnavigate” the world
  35. 36. Laws and regulations force us to store and archive all sorts of data, and it’s getting more and more
  36. 37. Human knowledge grows extremely fast. It’s far too gigantic for one single brain
  37. 38. Big Brother – Big Data. We get observed, filmed, recorded, logged, geolocated etc.
  38. 39. Panic!
  39. 40. Don’t panic. Get over it. Brace yourself for the battle.
  40. 41. First of all, some major changes have happened
  41. 42. Instead of huge expensive cabinets…
  42. 43. … we can use lots of cheap commodity hardware
  43. 44. Physics hit the wall…
  44. 45. … and we need to think parallel
  45. 46. Our physically huge globe…
  46. 47. s … has become a tiny electronic ball. It’s completely wired
  47. 48. Spontaneous requirements…
  48. 49. … can be covered by the fog (aka cloud)
  49. 50. And what are my weapons
  50. 51. Cut your data in smaller pieces
  51. 52. Make those pieces bite-size (manageable)
  52. 53. Bring the data closer to those who need it
  53. 54. Bring the data closer to where it’s physically accessed
  54. 55. Give up relations – where you don’t need them
  55. 56. Give up actuality – where you don’t need it
  56. 57. Find optimal and effective replication mechanisms
  57. 58. Consider latency an adjustment screw – if you can
  58. 59. Consider availability an adjustment screw – if you can
  59. 60. Be prepared to deal with unlimited amount of data – depending on the perspective
  60. 61. Know your data
  61. 62. Know your volumes
  62. 63. Know your scenarios
  63. 64. Consider it what it is: a science
  64. 65. Right tool for the job
  65. 66. kk 
  66. 67. And how does this technically work
  67. 68. Live data provisioning
  68. 69. What’s the problem
  69. 70. Your users are widely spread, maybe all over the world
  70. 71. And you own Big Data, which has many facets – geographic, financial etc.
  71. 72. And your classic silo architecture could break under the weight of such data
  72. 73. And why would I need that
  73. 74. You start and want to be one of those. Aha, ok 
  74. 75. You simply grew up to a level
  75. 76. Now you need to segment your users and thus to be faster and more reliable at locations,…
  76. 77. … to keep your servers free of load and thus to avoid bottle necks,…
  77. 78. … to cut your big data in smaller, better manageable chunks
  78. 79. What are my weapons
  79. 80. If your content is static in web terms, you are already well prepared
  80. 81. In many cases, you can make your dynamic data static (pre-compute content)
  81. 82. Huh?
  82. 83. Let’s take a look at an online bookstore
  83. 84. Hey, the online bookstore is completely dynamic – it’s a shop system!
  84. 85. Really?
  85. 86. Static media
  86. 87. Images, videos, audio files etc. – do they really change that often? Or do they change at all?
  87. 88. Book description
  88. 89. Even when you change the prices you still can pre-compute the page at some time – you don’t need to compute the content while the page is getting accessed
  89. 90. Browser mode
  90. 91. This is a classic use case for static content pre-computation. There is often simply no need to navigate through dynamically built paths
  91. 92. “ Web 2.0” features
  92. 93. And even when you offer “Web 2.0” features such as rating through customers, you can asynchronously recompute (parts of) the pages using the new rating information
  93. 94. Some book store content modifications are not very critical. They can be updated lagged on geographical base
  94. 95. You see: many parts of an online bookstore seem dynamic, but can be actually pre-computed and delivered (lagged) as static content in web terms
  95. 96. It’s all about the frequency of change, distances, wideness of distribution and the big data pain
  96. 97. Aha…
  97. 98. And what about pure dynamic features
  98. 99. Book search
  99. 100. Even this ultimately dynamic sounding feature can be (partially) de-dynamized. Consider the full text index as static content, not necessarily the data itself
  100. 101. Shopping cart
  101. 102. Sure, you cannot pre-compute the shopping cart. But maybe you also don’t need to synchronize a German customer’s cart to the whole world and keep it “local” instead
  102. 103. Owning big data doesn’t necessarily mean owning 100% dynamic data in terms of web
  103. 104. Aha…
  104. 105. And now let’s distribute it with CDN – content delivery network
  105. 106. Huh?
  106. 107. Akamai web traffic dominance
  107. 108. Akamai web traffic monitoring
  108. 109. Akamai EdgePlatform
  109. 110. 73,000 servers 70 countries 1,000 networks 30% of world’s web traffic (OMG, is the rest Google?)
  110. 111. There are several CDN providers offering (world wide) such infrastructures
  111. 112. And now let’s get a little “insane” 
  112. 117. Huh???
  113. 118. Yeah, something’s going on behind the scenes 
  114. 119. How does this technically work
  115. 120. CDN is like a deputy. You make a contract, and it takes over parts of your platform
  116. 121. From here, it delivers to your users the content you tell it to deliver, but being much closer to them and much more intelligent than you when it comes to managing the load
  117. 122. CDN has its infrastructure including actual nodes directly at the backbones, offering web caching, server load balancing, request routing and, based upon these techniques, content delivery services
  118. 123. What you have seen earlier: CDN’s DNS infrastructure has each time returned a different IP address with TTL = 20
  119. 124. This is done either through DNS cache “splitting” or dynamically based on the IP address of the origin name server which made the DNS A query
  120. 126. What you now can expect is that the returned IP address leads you to a load balancer – your gate to a whole sub-infrastructure of the CDN which balances between, for example, web caches
  121. 127. last mile last mile cache refresh inter-cache replication cache access cache access 1.2.3.4 5.6.7.8 A query 10.2.3.40 50.6.7.80 10.2.3.40 50.6.7.80 your servers caches caches name server
  122. 128. CDN uses different algorithms to decide where it routes user requests to: based upon current load, cost, location etc.
  123. 129. But in the end, your content gets delivered to the user. If it expires, CDN refreshes it from your servers in the background
  124. 130. According to HTTP/1.1, a web cache has to be controlled by: Freshness Validation Invalidation
  125. 131. As the very last step, you might have to offer the “last mile” – the very last application access, e.g. the last item view or similar. Here, the user hits your server
  126. 132. kk 
  127. 133. How can I benefit from this having big data
  128. 134. When you have e.g. images / videos as your big data, you can consider this data as static and thus push down-/uploads to CDN
  129. 135. So, you segment your users and keep your own servers free of load. What you might lose, is consistency between segments
  130. 138. If you use CDN to collect your new data, you might need some complex replication mechanisms
  131. 139. Depending on your agreement with the CDN provider, data amount, frequency etc., you can pick from one of the replication directions: Push Pull
  132. 140. Master-slave: 1 master (M), n slaves (S), many clients (C) M C C S r/w r replicate
  133. 141. Multimaster: n masters (M), many clients (C) M C C M r/w r/w replicate
  134. 142. You should look for a replication strategy which minimizes complexity and thus make one site the master and the other site the slave
  135. 143. Or you pre-compute static content out of your dynamic big data – a sort of snapshots, and distribute it with CDN
  136. 144. So, you keep your database servers almost free of load. Complexity comes with the snapshot / cache management
  137. 145. Or you can even push some functional parts of your platform to CDN such as searches, cart etc.
  138. 146. You win a lot dealing with big data, but you are more dependent on the CDN provider, and your overall architecture is weaker
  139. 147. Or if you really want to experiment, you can even try to push whole executed database queries to CDN like you would do it with memory caches
  140. 148. You can experiment with all sorts of caches: Write through Write back Write allocate
  141. 149. That’s really cool, but even much more complex and unreliable than a cluster-distributed memory cache 
  142. 150. kk 
  143. 151. With all that in mind: you can have a lot of your big data out there with CDN
  144. 152. Thank you 

×