BigData & CDN - OOP2011 (Pavlo Baron)

www.pbit.org [email_address] @pavlobaron

Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools

Somewhere a mosquito coughs…

… and somewhere else a data center gets flooded with data

More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) get shared each month on Facebook

Twitter users are, in total, tweeting an average of 55 million tweets a day, also including links etc.

But there is even much more: cameras, sensors, RFID, logs, geolocation, GIS and so on

There are several perspectives at Big Data

Real-time event and stream processing

“ Uncontrolled” human activities in the world wide web, or Web 2.0 – if you like

Every human leaves a vast number of data marks on the web every day: intentionally, accidentally and unknowingly

Intentionally: we blog, tweet, upload, flatter, link, etc…

And: the web has become an industry of its own. With us in the thick of it

Accidentally: we are humans and we make mistakes…

Unknowingly: we get tricked, misled, controlled, logged etc…

The vast number of data marks we leave on the web every day gets copied, duplicated. Data explodes.

Data flowing on streams at a very high rate from many actors

The amount of data flying over the air has become enormous, and it’s growing unpredictably

It’s not only nuclear reactors anymore having hi-tech sensors and generating tons of data

And our physically huge globe…

… has become a tiny electronic ball. It’s completely wired. Data needs just seconds to “circumnavigate” the world

Laws and regulations force us to store and archive all sorts of data, and it’s getting more and more

Human knowledge grows extremely fast. It’s far too gigantic for one single brain

Big Brother – Big Data. We get observed, filmed, recorded, logged, geolocated etc.

Don’t panic. Get over it. Brace yourself for the battle.

First of all, some major changes have happened

Instead of huge expensive cabinets…

… we can use lots of cheap commodity hardware

… and we need to think parallel

s … has become a tiny electronic ball. It’s completely wired

… can be covered by the fog (aka cloud)

Cut your data in smaller pieces

Make those pieces bite-size (manageable)

Bring the data closer to those who need it

Bring the data closer to where it’s physically accessed

Give up relations – where you don’t need them

Give up actuality – where you don’t need it

Find optimal and effective replication mechanisms

Consider latency an adjustment screw – if you can

Consider availability an adjustment screw – if you can

Be prepared to deal with unlimited amount of data – depending on the perspective

Consider it what it is: a science

And how does this technically work

Your users are widely spread, maybe all over the world

And you own Big Data, which has many facets – geographic, financial etc.

And your classic silo architecture could break under the weight of such data

You start and want to be one of those. Aha, ok 

Now you need to segment your users and thus to be faster and more reliable at locations,…

… to keep your servers free of load and thus to avoid bottle necks,…

… to cut your big data in smaller, better manageable chunks

If your content is static in web terms, you are already well prepared

In many cases, you can make your dynamic data static (pre-compute content)

Let’s take a look at an online bookstore

Hey, the online bookstore is completely dynamic – it’s a shop system!

Images, videos, audio files etc. – do they really change that often? Or do they change at all?

Even when you change the prices you still can pre-compute the page at some time – you don’t need to compute the content while the page is getting accessed

This is a classic use case for static content pre-computation. There is often simply no need to navigate through dynamically built paths

And even when you offer “Web 2.0” features such as rating through customers, you can asynchronously recompute (parts of) the pages using the new rating information

Some book store content modifications are not very critical. They can be updated lagged on geographical base

You see: many parts of an online bookstore seem dynamic, but can be actually pre-computed and delivered (lagged) as static content in web terms

It’s all about the frequency of change, distances, wideness of distribution and the big data pain

And what about pure dynamic features

Even this ultimately dynamic sounding feature can be (partially) de-dynamized. Consider the full text index as static content, not necessarily the data itself

Sure, you cannot pre-compute the shopping cart. But maybe you also don’t need to synchronize a German customer’s cart to the whole world and keep it “local” instead

Owning big data doesn’t necessarily mean owning 100% dynamic data in terms of web

And now let’s distribute it with CDN – content delivery network

73,000 servers 70 countries 1,000 networks 30% of world’s web traffic (OMG, is the rest Google?)

There are several CDN providers offering (world wide) such infrastructures

And now let’s get a little “insane” 

Yeah, something’s going on behind the scenes 

How does this technically work

CDN is like a deputy. You make a contract, and it takes over parts of your platform

From here, it delivers to your users the content you tell it to deliver, but being much closer to them and much more intelligent than you when it comes to managing the load

CDN has its infrastructure including actual nodes directly at the backbones, offering web caching, server load balancing, request routing and, based upon these techniques, content delivery services

What you have seen earlier: CDN’s DNS infrastructure has each time returned a different IP address with TTL = 20

This is done either through DNS cache “splitting” or dynamically based on the IP address of the origin name server which made the DNS A query

What you now can expect is that the returned IP address leads you to a load balancer – your gate to a whole sub-infrastructure of the CDN which balances between, for example, web caches

last mile last mile cache refresh inter-cache replication cache access cache access 1.2.3.4 5.6.7.8 A query 10.2.3.40 50.6.7.80 10.2.3.40 50.6.7.80 your servers caches caches name server

CDN uses different algorithms to decide where it routes user requests to: based upon current load, cost, location etc.

But in the end, your content gets delivered to the user. If it expires, CDN refreshes it from your servers in the background

According to HTTP/1.1, a web cache has to be controlled by: Freshness Validation Invalidation

As the very last step, you might have to offer the “last mile” – the very last application access, e.g. the last item view or similar. Here, the user hits your server

How can I benefit from this having big data

When you have e.g. images / videos as your big data, you can consider this data as static and thus push down-/uploads to CDN

So, you segment your users and keep your own servers free of load. What you might lose, is consistency between segments

If you use CDN to collect your new data, you might need some complex replication mechanisms

Depending on your agreement with the CDN provider, data amount, frequency etc., you can pick from one of the replication directions: Push Pull

Master-slave: 1 master (M), n slaves (S), many clients (C) M C C S r/w r replicate

Multimaster: n masters (M), many clients (C) M C C M r/w r/w replicate

You should look for a replication strategy which minimizes complexity and thus make one site the master and the other site the slave

Or you pre-compute static content out of your dynamic big data – a sort of snapshots, and distribute it with CDN

So, you keep your database servers almost free of load. Complexity comes with the snapshot / cache management

Or you can even push some functional parts of your platform to CDN such as searches, cart etc.

You win a lot dealing with big data, but you are more dependent on the CDN provider, and your overall architecture is weaker

Or if you really want to experiment, you can even try to push whole executed database queries to CDN like you would do it with memory caches

You can experiment with all sorts of caches: Write through Write back Write allocate

That’s really cool, but even much more complex and unreliable than a cluster-distributed memory cache 

With all that in mind: you can have a lot of your big data out there with CDN

BigData & CDN - OOP2011 (Pavlo Baron)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to BigData & CDN - OOP2011 (Pavlo Baron)

Similar to BigData & CDN - OOP2011 (Pavlo Baron) (20)

More from Pavlo Baron

More from Pavlo Baron (20)

Recently uploaded

Recently uploaded (20)

BigData & CDN - OOP2011 (Pavlo Baron)