SlideShare a Scribd company logo
Dealing with Web Scale Data


Gautham Pai, Founder
jnaapti
A Few Guidelines
Ask questions – be active

  What I cover depends on how active you are

Learn concepts before technology

  You will be bombarded with several concepts, tools and
  technologies – just remember that you are learning to bridge
  concepts and technology.

  After this program, you should be comfortable dabbling with
  these concepts on your own – even reading things that are
  not covered today.

                        http://jnaapti.com/
The Different Vases




Source: http://www.flickr.com/photos/bachmont/1382572541/

                 http://jnaapti.com/
The Different Vases :(
                               Not preferable

Ideal!




Sufficient




             Source: http://www.flickr.com/photos/bachmont/1382572541/

                              http://jnaapti.com/
Quick Poll
How many of you are from a CS background?

Knowledge of:

  Data Structures

  Algorithms

  Databases

Have heard of:

  NoSQL

  Key-Value Stores

  Cloud Computing

  MapReduce

  Hadoop


                               http://jnaapti.com/
Part 0 – Setting the Context
What is this talk about?
2 themes in this talk:

  About data – how is it stored, how do we work with
  it

  About understanding technology via concepts
  learnt




                         http://jnaapti.com/
How much data are we talking about really?
 200 million Tweets per day – as of Jun 2011

 Wikipedia dump

   current revisions only - 31GB uncompressed

   entire history runs into multiple TBs uncompressed

 Common Crawl data – 10s of Tbs

 Tumblr – adding 3TB of new data everyday

 Google processes 25PB of data per day

 Facebook – 135+ billion messages a month

 Facebook – 130TB of logs generated per day

 Vestas - Wind data - 18 to 24 petabytes of data to be processed


                                http://jnaapti.com/
We are dealing with a lot more data...
Increase in the number of sensor devices

Larger audience of users using our applications via the
web and social networks results in increased data
generation

Cost of storage is falling – so we never discard any of
the data




                     http://jnaapti.com/
What's in it for me?
Scrabulous case study

  Built by 2 young chaps from Kolkata

  Both were in their early 20's when
  they built it

  One was still in college.

  500,000 users daily – back in 2008,
  25,000$ in ad-revenues per month

These days lots of apps being built by
                                                         Source: Wikipedia
college under-graduates.



 If they can do it, you can do it too!


                                   http://jnaapti.com/
You have all it takes
You have access to a lot
of the tools that big
corporations use for free

You have computing
power available cheaply

You have access to a lot
of the data for free



                        http://jnaapti.com/
What do I need then?




All you need is a little intelligence and a lot of
    perseverance and you are on your way!




                   http://jnaapti.com/
Questions to ask
Ok, you have the resources

You build a cool web
application

It is an overnight hit - can you
handle it?

What happens if the server has
a disk crash?

Can we prevent website                              Slashdot Effect
outages in the account of
hardware failures?

                              http://jnaapti.com/
Looking for answers
What do technology companies like Google/Facebook/Twitter
use to manage data? What challenges do they face in managing
 such huge volumes of data? How do they analyze such data?




                     Image Source: http://opencompute.org/

                          http://jnaapti.com/
From concept to technology
We learn quite a few subjects in
Computer Science – data structures,
algorithms, databases, networking,
operating systems, graph theory, etc.

Are we ever going to use this/need this
as engineers?

How do I use my knowledge of CS to
understand the latest developments in
the industry?



                                       Image Source:http://www.flickr.com/photos/nics_events/2223583947/

                             http://jnaapti.com/
From concept to technology
This talk is about connecting concepts to real world
                     examples




                              Image Source:http://www.flickr.com/photos/nics_events/2223583947/

                    http://jnaapti.com/
A few snappy examples
Analysis of question papers from various companies

Analysis of image patterns in your photos and movie
collections

Analysis of your Facebook friends

  2nd degree connections

  Who is active at what time?

  Who talks about what?

                    http://jnaapti.com/
Part 1 – Dealing with Data
What is this section all about?
Before dealing with big-data problems, we first need to
know how data is handled.

This section tries to answer questions like:

  How is it that 0's and 1's are sufficient to do anything
  that a computer does?

  Why do we need data structures?

  Why do we need databases – why can't I just store all
  data as flat files?

                        http://jnaapti.com/
Computers – A Bit Processor
Computers only                      0        0   1   0   0   1   0   1
                                    1        0   0   1   0   0   1   0
understand bits                     0        1   1   1   1   1   1   0
                                    0        0   1   1   0   1   1   0
They have a way to store            0        1   0   0   1   0   0   0
                                    0        0   1   0   0   1   0   1
and process these bits              1        0   0   1   0   0   1   0
                                    0        1   1   1   1   1   1   0
It is upto users to give            0        0   1   1   0   1   1   0
                                    0        1   0   0   1   0   0   0
the bits a “meaning”




                       http://jnaapti.com/
Data Structures
Data structure is like a
cast

Pour your bits into it and
a 'shape' is created

The 'shape' helps us
provide a meaning to the
bits                             Image Source: http://www.flickr.com/photos/andrein/3020194734/




                       http://jnaapti.com/
Programming Languages
   Human mind does not understand bits. We need higher level
 constructs to process bits. This is where programming languages
come in. They act as a bridge between what humans want to do and
                    what machines understand.




                 Image Source: http://www.flickr.com/photos/jurvetson/5872448596/

                                  http://jnaapti.com/
Programming Languages
Variables                    a = 10, b = 20

                             c = a + b
Types
                             if condition:
Operators                           do_this()

                             for i in range(10):
Conditionals
                                    do_this()
Looping
                             urllib.urlopen('http://yahoo.com
                             /').read()
Libraries
                             [str.lower() for str in
                             list_of_strings]

                   http://jnaapti.com/
Primitive Types
Languages usually have two        'bangalore'
primitive types
                                  123
  Numbers – Integers,
                                  567.89
  Floats, Doubles etc
                                  0
  Strings – A sequence of
  characters put together         -123

Why these two types? Why          -567.89
not just strings?                 '123'


                        http://jnaapti.com/
Composite Types (or Collections)
The world is complex               Name → First Name + Last Name
                                   ---
We cannot model everything
                                   Phone No → (Country Code) Area Code +
with only strings and numbers      Subscriber Number
                                   ---
We need ways to put
                                   Address → Door No + Street + City +
primitive values together to       State + Pin Code
form more complex types            ---

Collections are a bag of values    Composite of composites: Person →
                                   Name + Phone No + Address
put together
                                   ---
Bottom up v/s Top down             Group of People


                         http://jnaapti.com/
Collections – General Object Containers
We can represent               As a matter of fact,
anything in the world          this is what JSON allows
using collections              you to do

Collections can be
mapped to bits

Computers can interpret
those bits



                     http://jnaapti.com/
Collections
Three basic types of collections:

  Lists

  Sets

  Maps




                     http://jnaapti.com/
Collections – Lists
Grocery shopping example

Order of items matter

Do items need to be of the same type?

The key identifier is the position of the item in the list

Operations on a list:

  add an item to list

  remove an item from the list

  get an item from the list at a specific position

                          http://jnaapti.com/
Collections – Sets
Items in a set are unique

There is no definite order

Operations on a set:

  Add items to the set

  Test if an item exists in the set

  Remove an item from the set



                       http://jnaapti.com/
Collections - Maps
Lots of maps in the real       Toothpaste - 1, Rs. 54
                               Matchbox - 10, Rs. 15
world
                               Tomatoes - 1kg, Rs. 10
Indices are not always
                               Chips - 1, Rs. 15
integers in real world         ---

We may want to Identify        Dictionary of word definitions

properties of an item,         ---
                               Phone book containing phone
using some name
                               numbers



                     http://jnaapti.com/
Collections – Maps contd...
Maps allow us to                Grocery list: Item is the key,
                                properties are values
associate a key with a          ---

value                           Dictionary as a map: keys are the
                                words, values are the definitions
The name that is used to        ---

identify the set of             Phone book as a map: keys are the
                                names, values are the phone numbers
properties is called a key

The properties identified
is called the value

                      http://jnaapti.com/
Collections – Maps contd...
Keys don't have a definite       Important:
order                            The analogy breaks here -
                                 Don't get confused with the
Operations on a map:
                                 way a map works – keys
  Put a key, value pair          don't have an order...

  Get a value for a key   Looking up keys, not values
                          - You don't say get me the
  Get me all the keys and
                          word whose definition
  I will look at them one is ...
  by one

                       http://jnaapti.com/
More composite types
List of lists                   List of people is a list
                                of maps
List of maps
                                ---
Map of maps
                                Mailboxes containing
...                             mails is a map of maps




                      http://jnaapti.com/
Interfaces and Implementation
Lists implemented using Arrays or Linked Lists

Maps implemented using Hashtables




                    http://jnaapti.com/
Hashtables
Run the key through a magic
function that gives you a number

The number is a unique slot into an
array

The magic function is called a
“hash function” - it is chosen such
that there are minimal collisions
and most uniform distribution
                                                    Image Source: Wikipedia




                              http://jnaapti.com/
Gmail – An Example
What datastructures do we use
here?

  Mail

  Mailbox

  Person

  Label

A mailbox has a list of mails

A mail can be represented
using a map

                           http://jnaapti.com/
Gmail – An Example
What is the mailbox size? How much RAM does a system have?

If all the data of the world could fit into the RAM of a single machine,
we wouldn't have a lot of the problems we face

Luckily, that's not the case!

Properties of RAMs

  Are limited in their capacity

  Are volatile (data disappears on reboot)

  Max data in memory is 256GB

                    Conclusion: We need the disk

                            http://jnaapti.com/
Hmm... Our First “Big” Data Problem
Let us say, the data is present as a huge 7 GB file in the
disk.

What is the amount of time it takes to read this file
into memory?

How do I measure disk speeds?




                      http://jnaapti.com/
Measuring Disk Read Speed
$ date;cat a_very_large_file > /dev/null;date
$ iotop




                    http://jnaapti.com/
Disk Read Speed
We can get disk read speeds close to 80MB/s

Let's round it off to 100MB/s

Reading 7000MB would take 70 seconds

Would you wait if Gmail took 70 seconds to fetch your mails?

Remember, parallel read accesses and writes slow it down further.



   Hmm, ok, this doesn't work, we need something faster, solution?




                                http://jnaapti.com/
How do we solve this?
Imagine a world where there are no databases - you
have a hard-disk and you are asked to solve this
problem.

We need to be able to read only the data we want as
quickly as we can.



               How do we solve this?


                     http://jnaapti.com/
Solution
Store data in fixed sized records and then have a way to
   jump to the starting location of a specific record




                      http://jnaapti.com/
Relational Databases
Relational databases are an abstraction of your
   filesystem to deal with “relational” data.




                  http://jnaapti.com/
A word about Abstraction
Reading from a disk

  Instruct the hardware to move the read head to a specific location, now
  read the data

Reading from a file

  Open the file, Read it, Close it

Reading from a database

  Connect to the DB, query for data, Close connection



One of the skills you can pickup as an engineer is being able to define an
                  operation at every level of abstraction

                              http://jnaapti.com/
Relational Database Design
Define Entities and their Relationships

Handling 1..1, 1..n and m..n relationships

Perform normalization

Take the entities and their relationships and come up
with tables, fields, primary keys and foreign keys

Define queries to add, update, fetch and delete data



                     http://jnaapti.com/
Mapping Design to Implementation
Data is stored in tables (which map to entities)

Tables contain records (rows) and fields (columns)

Records are of fixed length

Records are stored sequentially




                     http://jnaapti.com/
Relational Databases – Storage Structure
Use hash-tables to point to records in the tables – so
individual records can be retrieved without having to
search the entire dataset.

This process is called “indexing”.

In theory you can have many such indexes.

Foreign keys are also indexed to speed up the lookup.




                      http://jnaapti.com/
Data Storage Structures
Ordered/unordered Flat files

ISAM

Heaps

Hash buckets

B+ Trees




                    http://jnaapti.com/
Part 2 – Dealing with Web
Scale Data
Web Application Design
Client/Server

Distributed computing

Models/Views/Controllers (MVC)




                   http://jnaapti.com/
Client/Server Model




     http://jnaapti.com/
Client/Server Model – Separate DB Layer




               http://jnaapti.com/
Problem 1 – Too Many Requests
What if a thousand users access my server at the same time?

If the server can handle 200 such requests in parallel in one
second, what if I have 400 requests per second?

  1st second → 200 requests

  2nd second → 600 requests (200 are from the previous second)

Results in server thrashing

Solution: Load Balanced Setup




                          http://jnaapti.com/
Load Balancing




   http://jnaapti.com/
Load Balancing
Load balancing is a way of parallelizing processing
across multiple machines

The load balancer acts as a proxy that streams
requests and responses between the client and the
processing server.

Eg: HAProxy

Stateful and Stateless Architectures


                     http://jnaapti.com/
Problem 2 – Even More Requests
What if the Load Balancer itself becomes the
bottleneck?

Solution:

  Round Robin DNS

  Building multiple independent clusters




                    http://jnaapti.com/
Clustered Setup




   http://jnaapti.com/
Problem 3




 http://jnaapti.com/
Problem 3 – The Stateful Database
A single database cannot handle all requests from all
users.

Unlike front-end servers, databases are not “stateless”

If we are trying to only read information, it's fine, but
if we are trying to write information, this is a problem.




                      http://jnaapti.com/
Scale Up v/s Scale Out
Scale up means to add resources (CPUs or memory) to
a single system system in order to increase its
processing capabilities

  Scale up has limitations in how much we can scale –
  but is easier to do

Scale out means to add more nodes to a system

  Scale out provides linear scalability, is less
  expensive, but is complex compared to scale-up

                        http://jnaapti.com/
Scale Up Solution to the DB Problem




             http://jnaapti.com/
Scale Up Solution to the DB Problem
Increase the system's capacity by adding more
resources to the system – faster disks, more RAM,
faster processors, more cores etc

Introduce on-the-fly compression of data in the
database



           Scale up is not scalable enough


                     http://jnaapti.com/
Scale Out Solutions to the DB Problem




              http://jnaapti.com/
Scale Out Solutions to the DB Problem




              http://jnaapti.com/
Scale Out Solutions to the DB Problem
Until the virtualization revolution and until we reached
the limits of hardware, we were looking at scale up
solutions rather than scale out solutions

Partition your data and put them on multiple systems
– a subset of the rows in each system

This is called Sharding




                     http://jnaapti.com/
Issues with Sharding
No clear way of partitioning the data

Maintaining ACID (Atomicity, Consistency, Isolation,
Durability) properties is complex

Joining data across machines is complex

Re-sharding is complex




                     http://jnaapti.com/
Other Issues with Relational Databases
Data could be unstructured/semi-structured

Impedance mismatch (ORM issues)

Sparse values are not handled well - results in wastage of
storage (although some engines handle this today)

Changes in schema are difficult

Not all data require ACID/Transactional support

Normalization results in more queries and that means
more disk accesses - some apps can do without them

                       http://jnaapti.com/
The NoSQL Revolution
NoSQL revolution happened to solve the many issues faced
with storing web-scale data in relational databases

NoSQL as the name suggests don't use SQL to store and
retrieve data

Widely adopted in web applications these days, several
solutions available

Still in research – no clear winner and therefore difficult to
choose among alternatives



                         http://jnaapti.com/
Advantages of NoSQL Stores
They don't require fixed schemas

Avoid joins

Sharding (Scale out) is easier – some even do it
automatically

Many of the implementations replicate the data and
thus avoid SPOFs (Single Point of Failure)




                     http://jnaapti.com/
Examples of NoSQL Stores
MongoDB

CouchDB

Neo4J

Cassandra

BigTable

...



                http://jnaapti.com/
Types of NoSQL Stores
Key/Value

Document Stores

Graph Databases

Object Databases

RDF Databases

...



                   http://jnaapti.com/
NoSQL Storage Structures
Distributed Hashtables

Consistent Hashing

Order-Preserving Partitioning

B-tree

COW B-tree

Stratified B-tree



                     http://jnaapti.com/
Part 3 – Analyzing Web Scale
Data
Examples of Web Scale Data Analysis
Distributed Grep - Look for a pattern in all the Tweets

Inverted Index Building - This is what is used by search
engines

Sentiment Analysis

Competition Analysis

Log Analysis



                     http://jnaapti.com/
Understanding the problem of Analysis
Unlike in the case of retrieving data, in the case of
analysis, we need to read through everything, but
reads are slow in the disk.

Let's see a simple math:

  1 Hard Disk read speed is 100MB/s

  100 Hard Disks read in parallel gives 10GB/s!

Can we exploit this parallelism?


                      http://jnaapti.com/
The Coin Counting Example
You have a sack full of coins, and you are asked to separate
them into 1, 2, 5 and 10 Rs coins and tell how many of each
are present.

Now, let's say you have few sacks full of coins and it will take
you a lot of time to count it yourself – so you call a few other
people to help you out.

Now, let's say there is few rooms full of coins (like in some
large temples in India) – how will you count them?


                        http://jnaapti.com/
Coin Counting Problem – in depth
You can't add more people to the same room – the
room is already full.

You can get a few more rooms, ask people to take some
coins to the other room and then do the counting
there, and come back with the coins and the final count.

This will mean a lot of “traffic” in the corridor.

So what's a better solution?


                        http://jnaapti.com/
A Possible Solution to the Coin Counting Problem

Unload the coins in different rooms rather than in the
same room.

Then get workers in different rooms. With an increase
in coins, increase the number of rooms and workers.

Let the workers in each room work independently.



     This is how Map/Reduce frameworks work


                     http://jnaapti.com/
Traditional Parallel Processing
Use of threads, sharing data, synchronization

Results in Deadlocks, Livelocks, Starvation etc

Handling failures is complex



       Parallel Programming is hard this way.




                     http://jnaapti.com/
Requirements from a parallel processing framework
 Higher level programming constructs – don't need to deal with sockets,
 threading, locking, sharing data etc

 Manage failures - if a task fails or a system breaks down, we want the
 framework to transparently manage it

 Recoverability - If a system fails, another system must be able to pick up
 its workload

 Replication – if a system fails, we don't lose data – the framework
 should replicate data in multiple nodes

 Scalability – Adding more compute nodes should help us increase the
 compute capacity

                             http://jnaapti.com/
Pulling data Or Pushing Computations?
Pulling data for computation results in a bottleneck

Every “database store” also has a “processor”.

Instead of pulling the data for computation, can we
think about pushing the computation out to where the
data resides?

Computation is in "bytes", may be a few MB of object
code, that is still trivial compared to the data it works
on

                      http://jnaapti.com/
MapReduce
Concept introduced by Google in 2004

Framework is inspired by map and reduce functions
found in functional programming languages

Hadoop is an opensource implementation of
MapReduce




                   http://jnaapti.com/
MapReduce Frameworks
Data is spread throughout machines before starting
the task

Computation is done in the nodes where data is stored

Data is replicated in multiple machines to increase
reliability

Tasks are executed on multiple nodes just in case one
of them is running slow


                     http://jnaapti.com/
Using the Common Crawl Data – A Case Study
 The dump is a few 10s of TBs in size

 Where/How do you download it?

 Answer: You don't need to download it

 Instead you push your computation to where the data
 exists, perform your computation and then only fetch
 results you are interested in!




                      http://jnaapti.com/
Recap
My knowledge of computer science:

  Am I ever going to use this/need this as an
  engineer?

  How do I use this knowledge to understand the
  latest developments in software engineering?



         Hope you have an answer now!


                    http://jnaapti.com/
Parting Thoughts
Technology changes very rapidly – don't expect to be
spoon-fed

Practise, Practise, Practise - Katas

Concept before Technology

Try out new things – even if they are not related to your
project/curriculum

Read and understand other people's code

Read a lot, for example: http://highscalability.com/

                        http://jnaapti.com/
We at jnaapti conduct workshops and provide
training on these technologies – contact us at
      http://jnaapti.com/ for more details




                  http://jnaapti.com/
For feedback/clarification email:
    gautham-at-jnaapti.com




           http://jnaapti.com/
Thanks and...




 http://jnaapti.com/
All the Best




 http://jnaapti.com/
Happy Hacking




  http://jnaapti.com/
Sources
Twitter - http://blog.twitter.com/2011/03/numbers.html

Tumblr -
http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-bill

Facebook log data -
http://www.facebook.com/note.php?note_id=409881258919

Facebook messages -
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time

Vestas -
http://www-01.ibm.com/software/success/cssdb.nsf/cs/RMUE-8NMJQ

                        http://jnaapti.com/

More Related Content

Similar to Dealing with web scale data

Metadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionMetadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge Production
Kevin Rundblad
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Georgia Tech Hack Day
Georgia Tech Hack DayGeorgia Tech Hack Day
Georgia Tech Hack Day
Christian Heilmann
 
Hacking For Innovation
Hacking For InnovationHacking For Innovation
Hacking For Innovation
Christian Heilmann
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
SensePost
 
unit 3(big data, AI).pptx
unit 3(big data, AI).pptxunit 3(big data, AI).pptx
unit 3(big data, AI).pptx
NavdeepMathur6
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
Trent McConaghy
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
Department of Communication Science, University of Amsterdam
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
مروان الوجيه
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
guest5b1607
 
Yahoo for the Masses
Yahoo for the MassesYahoo for the Masses
Yahoo for the Masses
Christian Heilmann
 
DataHub
DataHubDataHub
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)
Pavlo Baron
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
keepingfoundthingsfound
 
Semantic Web 2.0
Semantic Web 2.0Semantic Web 2.0
Semantic Web 2.0
hchen1
 
Walter api
Walter apiWalter api
Walter api
Nicholas Schiller
 
Semantic.edu, an introduction
Semantic.edu, an introductionSemantic.edu, an introduction
Semantic.edu, an introduction
Bryan Alexander
 
Semantic Web, NON-technically speaking
Semantic Web, NON-technically speakingSemantic Web, NON-technically speaking
Semantic Web, NON-technically speaking
guest30b2a1
 

Similar to Dealing with web scale data (20)

Metadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionMetadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge Production
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Georgia Tech Hack Day
Georgia Tech Hack DayGeorgia Tech Hack Day
Georgia Tech Hack Day
 
Hacking For Innovation
Hacking For InnovationHacking For Innovation
Hacking For Innovation
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
unit 3(big data, AI).pptx
unit 3(big data, AI).pptxunit 3(big data, AI).pptx
unit 3(big data, AI).pptx
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Yahoo for the Masses
Yahoo for the MassesYahoo for the Masses
Yahoo for the Masses
 
DataHub
DataHubDataHub
DataHub
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
 
Semantic Web 2.0
Semantic Web 2.0Semantic Web 2.0
Semantic Web 2.0
 
Walter api
Walter apiWalter api
Walter api
 
Semantic.edu, an introduction
Semantic.edu, an introductionSemantic.edu, an introduction
Semantic.edu, an introduction
 
Semantic Web, NON-technically speaking
Semantic Web, NON-technically speakingSemantic Web, NON-technically speaking
Semantic Web, NON-technically speaking
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 

Dealing with web scale data

  • 1. Dealing with Web Scale Data Gautham Pai, Founder jnaapti
  • 2. A Few Guidelines Ask questions – be active What I cover depends on how active you are Learn concepts before technology You will be bombarded with several concepts, tools and technologies – just remember that you are learning to bridge concepts and technology. After this program, you should be comfortable dabbling with these concepts on your own – even reading things that are not covered today. http://jnaapti.com/
  • 3. The Different Vases Source: http://www.flickr.com/photos/bachmont/1382572541/ http://jnaapti.com/
  • 4. The Different Vases :( Not preferable Ideal! Sufficient Source: http://www.flickr.com/photos/bachmont/1382572541/ http://jnaapti.com/
  • 5. Quick Poll How many of you are from a CS background? Knowledge of: Data Structures Algorithms Databases Have heard of: NoSQL Key-Value Stores Cloud Computing MapReduce Hadoop http://jnaapti.com/
  • 6. Part 0 – Setting the Context
  • 7. What is this talk about? 2 themes in this talk: About data – how is it stored, how do we work with it About understanding technology via concepts learnt http://jnaapti.com/
  • 8. How much data are we talking about really? 200 million Tweets per day – as of Jun 2011 Wikipedia dump current revisions only - 31GB uncompressed entire history runs into multiple TBs uncompressed Common Crawl data – 10s of Tbs Tumblr – adding 3TB of new data everyday Google processes 25PB of data per day Facebook – 135+ billion messages a month Facebook – 130TB of logs generated per day Vestas - Wind data - 18 to 24 petabytes of data to be processed http://jnaapti.com/
  • 9. We are dealing with a lot more data... Increase in the number of sensor devices Larger audience of users using our applications via the web and social networks results in increased data generation Cost of storage is falling – so we never discard any of the data http://jnaapti.com/
  • 10. What's in it for me? Scrabulous case study Built by 2 young chaps from Kolkata Both were in their early 20's when they built it One was still in college. 500,000 users daily – back in 2008, 25,000$ in ad-revenues per month These days lots of apps being built by Source: Wikipedia college under-graduates. If they can do it, you can do it too! http://jnaapti.com/
  • 11. You have all it takes You have access to a lot of the tools that big corporations use for free You have computing power available cheaply You have access to a lot of the data for free http://jnaapti.com/
  • 12. What do I need then? All you need is a little intelligence and a lot of perseverance and you are on your way! http://jnaapti.com/
  • 13. Questions to ask Ok, you have the resources You build a cool web application It is an overnight hit - can you handle it? What happens if the server has a disk crash? Can we prevent website Slashdot Effect outages in the account of hardware failures? http://jnaapti.com/
  • 14. Looking for answers What do technology companies like Google/Facebook/Twitter use to manage data? What challenges do they face in managing such huge volumes of data? How do they analyze such data? Image Source: http://opencompute.org/ http://jnaapti.com/
  • 15. From concept to technology We learn quite a few subjects in Computer Science – data structures, algorithms, databases, networking, operating systems, graph theory, etc. Are we ever going to use this/need this as engineers? How do I use my knowledge of CS to understand the latest developments in the industry? Image Source:http://www.flickr.com/photos/nics_events/2223583947/ http://jnaapti.com/
  • 16. From concept to technology This talk is about connecting concepts to real world examples Image Source:http://www.flickr.com/photos/nics_events/2223583947/ http://jnaapti.com/
  • 17. A few snappy examples Analysis of question papers from various companies Analysis of image patterns in your photos and movie collections Analysis of your Facebook friends 2nd degree connections Who is active at what time? Who talks about what? http://jnaapti.com/
  • 18. Part 1 – Dealing with Data
  • 19. What is this section all about? Before dealing with big-data problems, we first need to know how data is handled. This section tries to answer questions like: How is it that 0's and 1's are sufficient to do anything that a computer does? Why do we need data structures? Why do we need databases – why can't I just store all data as flat files? http://jnaapti.com/
  • 20. Computers – A Bit Processor Computers only 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 understand bits 0 1 1 1 1 1 1 0 0 0 1 1 0 1 1 0 They have a way to store 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 and process these bits 1 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 It is upto users to give 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 the bits a “meaning” http://jnaapti.com/
  • 21. Data Structures Data structure is like a cast Pour your bits into it and a 'shape' is created The 'shape' helps us provide a meaning to the bits Image Source: http://www.flickr.com/photos/andrein/3020194734/ http://jnaapti.com/
  • 22. Programming Languages Human mind does not understand bits. We need higher level constructs to process bits. This is where programming languages come in. They act as a bridge between what humans want to do and what machines understand. Image Source: http://www.flickr.com/photos/jurvetson/5872448596/ http://jnaapti.com/
  • 23. Programming Languages Variables a = 10, b = 20 c = a + b Types if condition: Operators do_this() for i in range(10): Conditionals do_this() Looping urllib.urlopen('http://yahoo.com /').read() Libraries [str.lower() for str in list_of_strings] http://jnaapti.com/
  • 24. Primitive Types Languages usually have two 'bangalore' primitive types 123 Numbers – Integers, 567.89 Floats, Doubles etc 0 Strings – A sequence of characters put together -123 Why these two types? Why -567.89 not just strings? '123' http://jnaapti.com/
  • 25. Composite Types (or Collections) The world is complex Name → First Name + Last Name --- We cannot model everything Phone No → (Country Code) Area Code + with only strings and numbers Subscriber Number --- We need ways to put Address → Door No + Street + City + primitive values together to State + Pin Code form more complex types --- Collections are a bag of values Composite of composites: Person → Name + Phone No + Address put together --- Bottom up v/s Top down Group of People http://jnaapti.com/
  • 26. Collections – General Object Containers We can represent As a matter of fact, anything in the world this is what JSON allows using collections you to do Collections can be mapped to bits Computers can interpret those bits http://jnaapti.com/
  • 27. Collections Three basic types of collections: Lists Sets Maps http://jnaapti.com/
  • 28. Collections – Lists Grocery shopping example Order of items matter Do items need to be of the same type? The key identifier is the position of the item in the list Operations on a list: add an item to list remove an item from the list get an item from the list at a specific position http://jnaapti.com/
  • 29. Collections – Sets Items in a set are unique There is no definite order Operations on a set: Add items to the set Test if an item exists in the set Remove an item from the set http://jnaapti.com/
  • 30. Collections - Maps Lots of maps in the real Toothpaste - 1, Rs. 54 Matchbox - 10, Rs. 15 world Tomatoes - 1kg, Rs. 10 Indices are not always Chips - 1, Rs. 15 integers in real world --- We may want to Identify Dictionary of word definitions properties of an item, --- Phone book containing phone using some name numbers http://jnaapti.com/
  • 31. Collections – Maps contd... Maps allow us to Grocery list: Item is the key, properties are values associate a key with a --- value Dictionary as a map: keys are the words, values are the definitions The name that is used to --- identify the set of Phone book as a map: keys are the names, values are the phone numbers properties is called a key The properties identified is called the value http://jnaapti.com/
  • 32. Collections – Maps contd... Keys don't have a definite Important: order The analogy breaks here - Don't get confused with the Operations on a map: way a map works – keys Put a key, value pair don't have an order... Get a value for a key Looking up keys, not values - You don't say get me the Get me all the keys and word whose definition I will look at them one is ... by one http://jnaapti.com/
  • 33. More composite types List of lists List of people is a list of maps List of maps --- Map of maps Mailboxes containing ... mails is a map of maps http://jnaapti.com/
  • 34. Interfaces and Implementation Lists implemented using Arrays or Linked Lists Maps implemented using Hashtables http://jnaapti.com/
  • 35. Hashtables Run the key through a magic function that gives you a number The number is a unique slot into an array The magic function is called a “hash function” - it is chosen such that there are minimal collisions and most uniform distribution Image Source: Wikipedia http://jnaapti.com/
  • 36. Gmail – An Example What datastructures do we use here? Mail Mailbox Person Label A mailbox has a list of mails A mail can be represented using a map http://jnaapti.com/
  • 37. Gmail – An Example What is the mailbox size? How much RAM does a system have? If all the data of the world could fit into the RAM of a single machine, we wouldn't have a lot of the problems we face Luckily, that's not the case! Properties of RAMs Are limited in their capacity Are volatile (data disappears on reboot) Max data in memory is 256GB Conclusion: We need the disk http://jnaapti.com/
  • 38. Hmm... Our First “Big” Data Problem Let us say, the data is present as a huge 7 GB file in the disk. What is the amount of time it takes to read this file into memory? How do I measure disk speeds? http://jnaapti.com/
  • 39. Measuring Disk Read Speed $ date;cat a_very_large_file > /dev/null;date $ iotop http://jnaapti.com/
  • 40. Disk Read Speed We can get disk read speeds close to 80MB/s Let's round it off to 100MB/s Reading 7000MB would take 70 seconds Would you wait if Gmail took 70 seconds to fetch your mails? Remember, parallel read accesses and writes slow it down further. Hmm, ok, this doesn't work, we need something faster, solution? http://jnaapti.com/
  • 41. How do we solve this? Imagine a world where there are no databases - you have a hard-disk and you are asked to solve this problem. We need to be able to read only the data we want as quickly as we can. How do we solve this? http://jnaapti.com/
  • 42. Solution Store data in fixed sized records and then have a way to jump to the starting location of a specific record http://jnaapti.com/
  • 43. Relational Databases Relational databases are an abstraction of your filesystem to deal with “relational” data. http://jnaapti.com/
  • 44. A word about Abstraction Reading from a disk Instruct the hardware to move the read head to a specific location, now read the data Reading from a file Open the file, Read it, Close it Reading from a database Connect to the DB, query for data, Close connection One of the skills you can pickup as an engineer is being able to define an operation at every level of abstraction http://jnaapti.com/
  • 45. Relational Database Design Define Entities and their Relationships Handling 1..1, 1..n and m..n relationships Perform normalization Take the entities and their relationships and come up with tables, fields, primary keys and foreign keys Define queries to add, update, fetch and delete data http://jnaapti.com/
  • 46. Mapping Design to Implementation Data is stored in tables (which map to entities) Tables contain records (rows) and fields (columns) Records are of fixed length Records are stored sequentially http://jnaapti.com/
  • 47. Relational Databases – Storage Structure Use hash-tables to point to records in the tables – so individual records can be retrieved without having to search the entire dataset. This process is called “indexing”. In theory you can have many such indexes. Foreign keys are also indexed to speed up the lookup. http://jnaapti.com/
  • 48. Data Storage Structures Ordered/unordered Flat files ISAM Heaps Hash buckets B+ Trees http://jnaapti.com/
  • 49. Part 2 – Dealing with Web Scale Data
  • 50. Web Application Design Client/Server Distributed computing Models/Views/Controllers (MVC) http://jnaapti.com/
  • 51. Client/Server Model http://jnaapti.com/
  • 52. Client/Server Model – Separate DB Layer http://jnaapti.com/
  • 53. Problem 1 – Too Many Requests What if a thousand users access my server at the same time? If the server can handle 200 such requests in parallel in one second, what if I have 400 requests per second? 1st second → 200 requests 2nd second → 600 requests (200 are from the previous second) Results in server thrashing Solution: Load Balanced Setup http://jnaapti.com/
  • 54. Load Balancing http://jnaapti.com/
  • 55. Load Balancing Load balancing is a way of parallelizing processing across multiple machines The load balancer acts as a proxy that streams requests and responses between the client and the processing server. Eg: HAProxy Stateful and Stateless Architectures http://jnaapti.com/
  • 56. Problem 2 – Even More Requests What if the Load Balancer itself becomes the bottleneck? Solution: Round Robin DNS Building multiple independent clusters http://jnaapti.com/
  • 57. Clustered Setup http://jnaapti.com/
  • 59. Problem 3 – The Stateful Database A single database cannot handle all requests from all users. Unlike front-end servers, databases are not “stateless” If we are trying to only read information, it's fine, but if we are trying to write information, this is a problem. http://jnaapti.com/
  • 60. Scale Up v/s Scale Out Scale up means to add resources (CPUs or memory) to a single system system in order to increase its processing capabilities Scale up has limitations in how much we can scale – but is easier to do Scale out means to add more nodes to a system Scale out provides linear scalability, is less expensive, but is complex compared to scale-up http://jnaapti.com/
  • 61. Scale Up Solution to the DB Problem http://jnaapti.com/
  • 62. Scale Up Solution to the DB Problem Increase the system's capacity by adding more resources to the system – faster disks, more RAM, faster processors, more cores etc Introduce on-the-fly compression of data in the database Scale up is not scalable enough http://jnaapti.com/
  • 63. Scale Out Solutions to the DB Problem http://jnaapti.com/
  • 64. Scale Out Solutions to the DB Problem http://jnaapti.com/
  • 65. Scale Out Solutions to the DB Problem Until the virtualization revolution and until we reached the limits of hardware, we were looking at scale up solutions rather than scale out solutions Partition your data and put them on multiple systems – a subset of the rows in each system This is called Sharding http://jnaapti.com/
  • 66. Issues with Sharding No clear way of partitioning the data Maintaining ACID (Atomicity, Consistency, Isolation, Durability) properties is complex Joining data across machines is complex Re-sharding is complex http://jnaapti.com/
  • 67. Other Issues with Relational Databases Data could be unstructured/semi-structured Impedance mismatch (ORM issues) Sparse values are not handled well - results in wastage of storage (although some engines handle this today) Changes in schema are difficult Not all data require ACID/Transactional support Normalization results in more queries and that means more disk accesses - some apps can do without them http://jnaapti.com/
  • 68. The NoSQL Revolution NoSQL revolution happened to solve the many issues faced with storing web-scale data in relational databases NoSQL as the name suggests don't use SQL to store and retrieve data Widely adopted in web applications these days, several solutions available Still in research – no clear winner and therefore difficult to choose among alternatives http://jnaapti.com/
  • 69. Advantages of NoSQL Stores They don't require fixed schemas Avoid joins Sharding (Scale out) is easier – some even do it automatically Many of the implementations replicate the data and thus avoid SPOFs (Single Point of Failure) http://jnaapti.com/
  • 70. Examples of NoSQL Stores MongoDB CouchDB Neo4J Cassandra BigTable ... http://jnaapti.com/
  • 71. Types of NoSQL Stores Key/Value Document Stores Graph Databases Object Databases RDF Databases ... http://jnaapti.com/
  • 72. NoSQL Storage Structures Distributed Hashtables Consistent Hashing Order-Preserving Partitioning B-tree COW B-tree Stratified B-tree http://jnaapti.com/
  • 73. Part 3 – Analyzing Web Scale Data
  • 74. Examples of Web Scale Data Analysis Distributed Grep - Look for a pattern in all the Tweets Inverted Index Building - This is what is used by search engines Sentiment Analysis Competition Analysis Log Analysis http://jnaapti.com/
  • 75. Understanding the problem of Analysis Unlike in the case of retrieving data, in the case of analysis, we need to read through everything, but reads are slow in the disk. Let's see a simple math: 1 Hard Disk read speed is 100MB/s 100 Hard Disks read in parallel gives 10GB/s! Can we exploit this parallelism? http://jnaapti.com/
  • 76. The Coin Counting Example You have a sack full of coins, and you are asked to separate them into 1, 2, 5 and 10 Rs coins and tell how many of each are present. Now, let's say you have few sacks full of coins and it will take you a lot of time to count it yourself – so you call a few other people to help you out. Now, let's say there is few rooms full of coins (like in some large temples in India) – how will you count them? http://jnaapti.com/
  • 77. Coin Counting Problem – in depth You can't add more people to the same room – the room is already full. You can get a few more rooms, ask people to take some coins to the other room and then do the counting there, and come back with the coins and the final count. This will mean a lot of “traffic” in the corridor. So what's a better solution? http://jnaapti.com/
  • 78. A Possible Solution to the Coin Counting Problem Unload the coins in different rooms rather than in the same room. Then get workers in different rooms. With an increase in coins, increase the number of rooms and workers. Let the workers in each room work independently. This is how Map/Reduce frameworks work http://jnaapti.com/
  • 79. Traditional Parallel Processing Use of threads, sharing data, synchronization Results in Deadlocks, Livelocks, Starvation etc Handling failures is complex Parallel Programming is hard this way. http://jnaapti.com/
  • 80. Requirements from a parallel processing framework Higher level programming constructs – don't need to deal with sockets, threading, locking, sharing data etc Manage failures - if a task fails or a system breaks down, we want the framework to transparently manage it Recoverability - If a system fails, another system must be able to pick up its workload Replication – if a system fails, we don't lose data – the framework should replicate data in multiple nodes Scalability – Adding more compute nodes should help us increase the compute capacity http://jnaapti.com/
  • 81. Pulling data Or Pushing Computations? Pulling data for computation results in a bottleneck Every “database store” also has a “processor”. Instead of pulling the data for computation, can we think about pushing the computation out to where the data resides? Computation is in "bytes", may be a few MB of object code, that is still trivial compared to the data it works on http://jnaapti.com/
  • 82. MapReduce Concept introduced by Google in 2004 Framework is inspired by map and reduce functions found in functional programming languages Hadoop is an opensource implementation of MapReduce http://jnaapti.com/
  • 83. MapReduce Frameworks Data is spread throughout machines before starting the task Computation is done in the nodes where data is stored Data is replicated in multiple machines to increase reliability Tasks are executed on multiple nodes just in case one of them is running slow http://jnaapti.com/
  • 84. Using the Common Crawl Data – A Case Study The dump is a few 10s of TBs in size Where/How do you download it? Answer: You don't need to download it Instead you push your computation to where the data exists, perform your computation and then only fetch results you are interested in! http://jnaapti.com/
  • 85. Recap My knowledge of computer science: Am I ever going to use this/need this as an engineer? How do I use this knowledge to understand the latest developments in software engineering? Hope you have an answer now! http://jnaapti.com/
  • 86. Parting Thoughts Technology changes very rapidly – don't expect to be spoon-fed Practise, Practise, Practise - Katas Concept before Technology Try out new things – even if they are not related to your project/curriculum Read and understand other people's code Read a lot, for example: http://highscalability.com/ http://jnaapti.com/
  • 87. We at jnaapti conduct workshops and provide training on these technologies – contact us at http://jnaapti.com/ for more details http://jnaapti.com/
  • 88. For feedback/clarification email: gautham-at-jnaapti.com http://jnaapti.com/
  • 90. All the Best http://jnaapti.com/
  • 91. Happy Hacking http://jnaapti.com/
  • 92. Sources Twitter - http://blog.twitter.com/2011/03/numbers.html Tumblr - http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-bill Facebook log data - http://www.facebook.com/note.php?note_id=409881258919 Facebook messages - http://highscalability.com/blog/2010/11/16/facebooks-new-real-time Vestas - http://www-01.ibm.com/software/success/cssdb.nsf/cs/RMUE-8NMJQ http://jnaapti.com/