Dealing with web scale data

Dealing with Web Scale Data

Gautham Pai, Founder
jnaapti

A Few Guidelines
Ask questions – be active

What I cover depends on how active you are

Learn concepts before technology

You will be bombarded with several concepts, tools and
technologies – just remember that you are learning to bridge
concepts and technology.

After this program, you should be comfortable dabbling with
these concepts on your own – even reading things that are
not covered today.

http://jnaapti.com/

The Different Vases

Source: http://www.flickr.com/photos/bachmont/1382572541/

http://jnaapti.com/

The Different Vases :(
Not preferable

Ideal!

Sufficient

Source: http://www.flickr.com/photos/bachmont/1382572541/

http://jnaapti.com/

Quick Poll
How many of you are from a CS background?

Knowledge of:

Data Structures

Algorithms

Databases

Have heard of:

NoSQL

Key-Value Stores

Cloud Computing

MapReduce

Hadoop

http://jnaapti.com/

Part 0 – Setting the Context

What is this talk about?
2 themes in this talk:

About data – how is it stored, how do we work with
it

About understanding technology via concepts
learnt

http://jnaapti.com/

How much data are we talking about really?
200 million Tweets per day – as of Jun 2011

Wikipedia dump

current revisions only - 31GB uncompressed

entire history runs into multiple TBs uncompressed

Common Crawl data – 10s of Tbs

Tumblr – adding 3TB of new data everyday

Google processes 25PB of data per day

Facebook – 135+ billion messages a month

Facebook – 130TB of logs generated per day

Vestas - Wind data - 18 to 24 petabytes of data to be processed

http://jnaapti.com/

We are dealing with a lot more data...
Increase in the number of sensor devices

Larger audience of users using our applications via the
web and social networks results in increased data
generation

Cost of storage is falling – so we never discard any of
the data

http://jnaapti.com/

What's in it for me?
Scrabulous case study

Built by 2 young chaps from Kolkata

Both were in their early 20's when
they built it

One was still in college.

500,000 users daily – back in 2008,
25,000$ in ad-revenues per month

These days lots of apps being built by
Source: Wikipedia
college under-graduates.

If they can do it, you can do it too!

http://jnaapti.com/

You have all it takes
You have access to a lot
of the tools that big
corporations use for free

You have computing
power available cheaply

You have access to a lot
of the data for free

http://jnaapti.com/

What do I need then?

All you need is a little intelligence and a lot of
perseverance and you are on your way!

http://jnaapti.com/

Questions to ask
Ok, you have the resources

You build a cool web
application

It is an overnight hit - can you
handle it?

What happens if the server has
a disk crash?

Can we prevent website Slashdot Effect
outages in the account of
hardware failures?

http://jnaapti.com/

Looking for answers
What do technology companies like Google/Facebook/Twitter
use to manage data? What challenges do they face in managing
such huge volumes of data? How do they analyze such data?

Image Source: http://opencompute.org/

http://jnaapti.com/

From concept to technology
We learn quite a few subjects in
Computer Science – data structures,
algorithms, databases, networking,
operating systems, graph theory, etc.

Are we ever going to use this/need this
as engineers?

How do I use my knowledge of CS to
understand the latest developments in
the industry?

Image Source:http://www.flickr.com/photos/nics_events/2223583947/

http://jnaapti.com/

From concept to technology
This talk is about connecting concepts to real world
examples

Image Source:http://www.flickr.com/photos/nics_events/2223583947/

http://jnaapti.com/

A few snappy examples
Analysis of question papers from various companies

Analysis of image patterns in your photos and movie
collections

Analysis of your Facebook friends

2nd degree connections

Who is active at what time?

Who talks about what?

http://jnaapti.com/

What is this section all about?
Before dealing with big-data problems, we first need to
know how data is handled.

This section tries to answer questions like:

How is it that 0's and 1's are sufficient to do anything
that a computer does?

Why do we need data structures?

Why do we need databases – why can't I just store all
data as flat files?

http://jnaapti.com/

Computers – A Bit Processor
Computers only 0 0 1 0 0 1 0 1
1 0 0 1 0 0 1 0
understand bits 0 1 1 1 1 1 1 0
0 0 1 1 0 1 1 0
They have a way to store 0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 1
and process these bits 1 0 0 1 0 0 1 0
0 1 1 1 1 1 1 0
It is upto users to give 0 0 1 1 0 1 1 0
0 1 0 0 1 0 0 0
the bits a “meaning”

http://jnaapti.com/

Data Structures
Data structure is like a
cast

Pour your bits into it and
a 'shape' is created

The 'shape' helps us
provide a meaning to the
bits Image Source: http://www.flickr.com/photos/andrein/3020194734/

http://jnaapti.com/

Programming Languages
Human mind does not understand bits. We need higher level
constructs to process bits. This is where programming languages
come in. They act as a bridge between what humans want to do and
what machines understand.

Image Source: http://www.flickr.com/photos/jurvetson/5872448596/

http://jnaapti.com/

Programming Languages
Variables a = 10, b = 20

c = a + b
Types
if condition:
Operators do_this()

for i in range(10):
Conditionals
do_this()
Looping
urllib.urlopen('http://yahoo.com
/').read()
Libraries
[str.lower() for str in
list_of_strings]

http://jnaapti.com/

Primitive Types
Languages usually have two 'bangalore'
primitive types
123
Numbers – Integers,
567.89
Floats, Doubles etc
0
Strings – A sequence of
characters put together -123

Why these two types? Why -567.89
not just strings? '123'

http://jnaapti.com/

Composite Types (or Collections)
The world is complex Name → First Name + Last Name
---
We cannot model everything
Phone No → (Country Code) Area Code +
with only strings and numbers Subscriber Number
---
We need ways to put
Address → Door No + Street + City +
primitive values together to State + Pin Code
form more complex types ---

Collections are a bag of values Composite of composites: Person →
Name + Phone No + Address
put together
---
Bottom up v/s Top down Group of People

http://jnaapti.com/

Collections – General Object Containers
We can represent As a matter of fact,
anything in the world this is what JSON allows
using collections you to do

Collections can be
mapped to bits

Computers can interpret
those bits

http://jnaapti.com/

Collections
Three basic types of collections:

Lists

Sets

Maps

http://jnaapti.com/

Collections – Lists
Grocery shopping example

Order of items matter

Do items need to be of the same type?

The key identifier is the position of the item in the list

Operations on a list:

add an item to list

remove an item from the list

get an item from the list at a specific position

http://jnaapti.com/

Collections – Sets
Items in a set are unique

There is no definite order

Operations on a set:

Add items to the set

Test if an item exists in the set

Remove an item from the set

http://jnaapti.com/

Collections - Maps
Lots of maps in the real Toothpaste - 1, Rs. 54
Matchbox - 10, Rs. 15
world
Tomatoes - 1kg, Rs. 10
Indices are not always
Chips - 1, Rs. 15
integers in real world ---

We may want to Identify Dictionary of word definitions

properties of an item, ---
Phone book containing phone
using some name
numbers

http://jnaapti.com/

Collections – Maps contd...
Maps allow us to Grocery list: Item is the key,
properties are values
associate a key with a ---

value Dictionary as a map: keys are the
words, values are the definitions
The name that is used to ---

identify the set of Phone book as a map: keys are the
names, values are the phone numbers
properties is called a key

The properties identified
is called the value

http://jnaapti.com/

Collections – Maps contd...
Keys don't have a definite Important:
order The analogy breaks here -
Don't get confused with the
Operations on a map:
way a map works – keys
Put a key, value pair don't have an order...

Get a value for a key Looking up keys, not values
- You don't say get me the
Get me all the keys and
word whose definition
I will look at them one is ...
by one

http://jnaapti.com/

More composite types
List of lists List of people is a list
of maps
List of maps
---
Map of maps
Mailboxes containing
... mails is a map of maps

http://jnaapti.com/

Interfaces and Implementation
Lists implemented using Arrays or Linked Lists

Maps implemented using Hashtables

http://jnaapti.com/

Hashtables
Run the key through a magic
function that gives you a number

The number is a unique slot into an
array

The magic function is called a
“hash function” - it is chosen such
that there are minimal collisions
and most uniform distribution
Image Source: Wikipedia

http://jnaapti.com/

Gmail – An Example
What datastructures do we use
here?

Mail

Mailbox

Person

Label

A mailbox has a list of mails

A mail can be represented
using a map

http://jnaapti.com/

Gmail – An Example
What is the mailbox size? How much RAM does a system have?

If all the data of the world could fit into the RAM of a single machine,
we wouldn't have a lot of the problems we face

Luckily, that's not the case!

Properties of RAMs

Are limited in their capacity

Are volatile (data disappears on reboot)

Max data in memory is 256GB

Conclusion: We need the disk

http://jnaapti.com/

Hmm... Our First “Big” Data Problem
Let us say, the data is present as a huge 7 GB file in the
disk.

What is the amount of time it takes to read this file
into memory?

How do I measure disk speeds?

http://jnaapti.com/

Measuring Disk Read Speed
$ date;cat a_very_large_file > /dev/null;date
$ iotop

http://jnaapti.com/

Disk Read Speed
We can get disk read speeds close to 80MB/s

Let's round it off to 100MB/s

Reading 7000MB would take 70 seconds

Would you wait if Gmail took 70 seconds to fetch your mails?

Remember, parallel read accesses and writes slow it down further.

Hmm, ok, this doesn't work, we need something faster, solution?

http://jnaapti.com/

How do we solve this?
Imagine a world where there are no databases - you
have a hard-disk and you are asked to solve this
problem.

We need to be able to read only the data we want as
quickly as we can.

How do we solve this?

http://jnaapti.com/

Solution
Store data in fixed sized records and then have a way to
jump to the starting location of a specific record

http://jnaapti.com/

Relational Databases
Relational databases are an abstraction of your
filesystem to deal with “relational” data.

http://jnaapti.com/

A word about Abstraction
Reading from a disk

Instruct the hardware to move the read head to a specific location, now
read the data

Reading from a file

Open the file, Read it, Close it

Reading from a database

Connect to the DB, query for data, Close connection

One of the skills you can pickup as an engineer is being able to define an
operation at every level of abstraction

http://jnaapti.com/

Relational Database Design
Define Entities and their Relationships

Handling 1..1, 1..n and m..n relationships

Perform normalization

Take the entities and their relationships and come up
with tables, fields, primary keys and foreign keys

Define queries to add, update, fetch and delete data

http://jnaapti.com/

Mapping Design to Implementation
Data is stored in tables (which map to entities)

Tables contain records (rows) and fields (columns)

Records are of fixed length

Records are stored sequentially

http://jnaapti.com/

Relational Databases – Storage Structure
Use hash-tables to point to records in the tables – so
individual records can be retrieved without having to
search the entire dataset.

This process is called “indexing”.

In theory you can have many such indexes.

Foreign keys are also indexed to speed up the lookup.

http://jnaapti.com/

Data Storage Structures
Ordered/unordered Flat files

ISAM

Heaps

Hash buckets

B+ Trees

http://jnaapti.com/

Part 2 – Dealing with Web
Scale Data

Web Application Design
Client/Server

Distributed computing

Models/Views/Controllers (MVC)

http://jnaapti.com/

Client/Server Model

http://jnaapti.com/

Client/Server Model – Separate DB Layer

http://jnaapti.com/

Problem 1 – Too Many Requests
What if a thousand users access my server at the same time?

If the server can handle 200 such requests in parallel in one
second, what if I have 400 requests per second?

1st second → 200 requests

2nd second → 600 requests (200 are from the previous second)

Results in server thrashing

Solution: Load Balanced Setup

http://jnaapti.com/

Load Balancing

http://jnaapti.com/

Load Balancing
Load balancing is a way of parallelizing processing
across multiple machines

The load balancer acts as a proxy that streams
requests and responses between the client and the
processing server.

Eg: HAProxy

Stateful and Stateless Architectures

http://jnaapti.com/

Problem 2 – Even More Requests
What if the Load Balancer itself becomes the
bottleneck?

Solution:

Round Robin DNS

Building multiple independent clusters

http://jnaapti.com/

Clustered Setup

http://jnaapti.com/

Problem 3

http://jnaapti.com/

Problem 3 – The Stateful Database
A single database cannot handle all requests from all
users.

Unlike front-end servers, databases are not “stateless”

If we are trying to only read information, it's fine, but
if we are trying to write information, this is a problem.

http://jnaapti.com/

Scale Up v/s Scale Out
Scale up means to add resources (CPUs or memory) to
a single system system in order to increase its
processing capabilities

Scale up has limitations in how much we can scale –
but is easier to do

Scale out means to add more nodes to a system

Scale out provides linear scalability, is less
expensive, but is complex compared to scale-up

http://jnaapti.com/

Scale Up Solution to the DB Problem

http://jnaapti.com/

Scale Up Solution to the DB Problem
Increase the system's capacity by adding more
resources to the system – faster disks, more RAM,
faster processors, more cores etc

Introduce on-the-fly compression of data in the
database

Scale up is not scalable enough

http://jnaapti.com/

Scale Out Solutions to the DB Problem

http://jnaapti.com/

Scale Out Solutions to the DB Problem
Until the virtualization revolution and until we reached
the limits of hardware, we were looking at scale up
solutions rather than scale out solutions

Partition your data and put them on multiple systems
– a subset of the rows in each system

This is called Sharding

http://jnaapti.com/

Issues with Sharding
No clear way of partitioning the data

Maintaining ACID (Atomicity, Consistency, Isolation,
Durability) properties is complex

Joining data across machines is complex

Re-sharding is complex

http://jnaapti.com/

Other Issues with Relational Databases
Data could be unstructured/semi-structured

Impedance mismatch (ORM issues)

Sparse values are not handled well - results in wastage of
storage (although some engines handle this today)

Changes in schema are difficult

Not all data require ACID/Transactional support

Normalization results in more queries and that means
more disk accesses - some apps can do without them

http://jnaapti.com/

The NoSQL Revolution
NoSQL revolution happened to solve the many issues faced
with storing web-scale data in relational databases

NoSQL as the name suggests don't use SQL to store and
retrieve data

Widely adopted in web applications these days, several
solutions available

Still in research – no clear winner and therefore difficult to
choose among alternatives

http://jnaapti.com/

Advantages of NoSQL Stores
They don't require fixed schemas

Avoid joins

Sharding (Scale out) is easier – some even do it
automatically

Many of the implementations replicate the data and
thus avoid SPOFs (Single Point of Failure)

http://jnaapti.com/

Examples of NoSQL Stores
MongoDB

CouchDB

Neo4J

Cassandra

BigTable

...

http://jnaapti.com/

Types of NoSQL Stores
Key/Value

Document Stores

Graph Databases

Object Databases

RDF Databases

...

http://jnaapti.com/

NoSQL Storage Structures
Distributed Hashtables

Consistent Hashing

Order-Preserving Partitioning

B-tree

COW B-tree

Stratified B-tree

http://jnaapti.com/

Part 3 – Analyzing Web Scale
Data

Examples of Web Scale Data Analysis
Distributed Grep - Look for a pattern in all the Tweets

Inverted Index Building - This is what is used by search
engines

Sentiment Analysis

Competition Analysis

Log Analysis

http://jnaapti.com/

Understanding the problem of Analysis
Unlike in the case of retrieving data, in the case of
analysis, we need to read through everything, but
reads are slow in the disk.

Let's see a simple math:

1 Hard Disk read speed is 100MB/s

100 Hard Disks read in parallel gives 10GB/s!

Can we exploit this parallelism?

http://jnaapti.com/

The Coin Counting Example
You have a sack full of coins, and you are asked to separate
them into 1, 2, 5 and 10 Rs coins and tell how many of each
are present.

Now, let's say you have few sacks full of coins and it will take
you a lot of time to count it yourself – so you call a few other
people to help you out.

Now, let's say there is few rooms full of coins (like in some
large temples in India) – how will you count them?

http://jnaapti.com/

Coin Counting Problem – in depth
You can't add more people to the same room – the
room is already full.

You can get a few more rooms, ask people to take some
coins to the other room and then do the counting
there, and come back with the coins and the final count.

This will mean a lot of “traffic” in the corridor.

So what's a better solution?

http://jnaapti.com/

A Possible Solution to the Coin Counting Problem

Unload the coins in different rooms rather than in the
same room.

Then get workers in different rooms. With an increase
in coins, increase the number of rooms and workers.

Let the workers in each room work independently.

This is how Map/Reduce frameworks work

http://jnaapti.com/

Traditional Parallel Processing
Use of threads, sharing data, synchronization

Results in Deadlocks, Livelocks, Starvation etc

Handling failures is complex

Parallel Programming is hard this way.

http://jnaapti.com/

Requirements from a parallel processing framework
Higher level programming constructs – don't need to deal with sockets,
threading, locking, sharing data etc

Manage failures - if a task fails or a system breaks down, we want the
framework to transparently manage it

Recoverability - If a system fails, another system must be able to pick up
its workload

Replication – if a system fails, we don't lose data – the framework
should replicate data in multiple nodes

Scalability – Adding more compute nodes should help us increase the
compute capacity

http://jnaapti.com/

Pulling data Or Pushing Computations?
Pulling data for computation results in a bottleneck

Every “database store” also has a “processor”.

Instead of pulling the data for computation, can we
think about pushing the computation out to where the
data resides?

Computation is in "bytes", may be a few MB of object
code, that is still trivial compared to the data it works
on

http://jnaapti.com/

MapReduce
Concept introduced by Google in 2004

Framework is inspired by map and reduce functions
found in functional programming languages

Hadoop is an opensource implementation of
MapReduce

http://jnaapti.com/

MapReduce Frameworks
Data is spread throughout machines before starting
the task

Computation is done in the nodes where data is stored

Data is replicated in multiple machines to increase
reliability

Tasks are executed on multiple nodes just in case one
of them is running slow

http://jnaapti.com/

Using the Common Crawl Data – A Case Study
The dump is a few 10s of TBs in size

Where/How do you download it?

Answer: You don't need to download it

Instead you push your computation to where the data
exists, perform your computation and then only fetch
results you are interested in!

http://jnaapti.com/

Recap
My knowledge of computer science:

Am I ever going to use this/need this as an
engineer?

How do I use this knowledge to understand the
latest developments in software engineering?

Hope you have an answer now!

http://jnaapti.com/

Parting Thoughts
Technology changes very rapidly – don't expect to be
spoon-fed

Practise, Practise, Practise - Katas

Concept before Technology

Try out new things – even if they are not related to your
project/curriculum

Read and understand other people's code

Read a lot, for example: http://highscalability.com/

http://jnaapti.com/

We at jnaapti conduct workshops and provide
training on these technologies – contact us at
http://jnaapti.com/ for more details

http://jnaapti.com/

For feedback/clarification email:
gautham-at-jnaapti.com

http://jnaapti.com/

Thanks and...

http://jnaapti.com/

All the Best

http://jnaapti.com/

Happy Hacking

http://jnaapti.com/

Sources
Twitter - http://blog.twitter.com/2011/03/numbers.html

Tumblr -
http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-bill

Facebook log data -
http://www.facebook.com/note.php?note_id=409881258919

Facebook messages -
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time

Vestas -
http://www-01.ibm.com/software/success/cssdb.nsf/cs/RMUE-8NMJQ

http://jnaapti.com/

Dealing with web scale data

Recommended

Recommended

More Related Content

Similar to Dealing with web scale data

Similar to Dealing with web scale data (20)

Recently uploaded

Recently uploaded (20)

Dealing with web scale data