The Big Data Stack

The Big Data Stack
Zubair Nabi
zubair.nabi@cantab.net
7 January, 2014

Data Timeline

0

fork()

2003

5EB

2.7ZB

2012

2015

8ZB

Example*: Facebook
•
•
•
•
•
•

2.5B – content items shared
2.7B – ‘Likes’
300M – photos uploaded
105TB – data scanned every 30 minutes
500+TB – new data ingested
100+PB – data warehouse

* VP Engineering, Jay Parikh – 2012

Example: Facebook’s Haystack*
• 65B photos
– 4 images of different size stored for each photo
– For a total of 260B images and 20PB of storage

• 1B new photos uploaded each week
– Increment of 60TB

• At peak traffic, 1M images served per second
• An image request is like finding a needle in a
haystack
*Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in
Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems
design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8.

More Examples
• The LHC at CERN generates 22PB of data annually
(after throwing away around 99% of readings)
• The Square Kilometre Array (under construction)
is expected to generate hundreds of PB each day
• Farecast, a part of Bing, searches through 225B
flight and price records to advise customers on
their ticket purchases

More Examples (2)
• The amount of annual traffic flowing over the
Internet is around 700EB
• Walmart handles in excess of 1M transactions
every hour (25PB in total)
• 400M Tweets everyday

Big Data
• Large datasets whose processing and storage
requirements exceed all traditional paradigms and
infrastructure
– On the order of terabytes and beyond

• Generated by web 2.0 applications, sensor networks,
scientific applications, financial applications, etc.
• Radically different tools needed to record, store,
process, and visualize
• Moving away from the desktop
• Offloaded to the “cloud”
• Poses challenges for computation, storage, and
infrastructure

The Stack
• Presentation layer
• Application layer: processing + storage
• Operating System layer
• Virtualization layer (optional)
• Network layer (intra- and inter-data center)
• Physical infrastructure layer
Can roughly be called the “cloud”

Presentation Layer
• Acts as the user-facing end of the entire
ecosystem
• Forwards user queries to the backend
(potentially the rest of the stack)
• Can be both local and remote
• For most web 2.0 applications, the
presentation layer is a web portal

Presentation Layer (2)
• For instance, the Google search website is a
presentation layer
– Takes user queries
– Forwards them to a scatter-gather application
– Presents the results to the user (within a time
bound)
• Made up of many technologies, such as HTTP, HTML,
AJAX, etc.
• Can also be a visualization library

Application Layer
• Serves as the back-end
• Either computes a result for the user, or
fetches a previously computed result or
content from storage
• The execution is predominantly distributed
• The computation itself might entail crossdisciplinary (across sciences) technology

Processing
• Can be a custom solution, such as a scattergather application
• Might also be an existing data intensive
computation framework, such as MapReduce,
Spark, MPI, etc. or a stream processing
system, such as IBM Infosphere Streams,
Storm, S4, etc.
• Analytics engines: R, Matlab, etc.

Numbers Everyone Should Know*
Operation

Time (nsec)

Time

L1 cache reference

0.5

0.5s

Branch mispredict

5

5s

L2 cache reference

7

7s

Mutex lock/unlock

25

25s

Main memory reference

100

1m40s

Send 2K over 1Gbps network

20,000

5h30m

Read 1MB sequentially from memory

250,000

~3days

Disk seek

10,000,000

~6days

Read 1MB sequentially from disk

20,000,000

8months

Send packet CA -> NL -> CA

150,000,000

4.75years

* Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from
LADIS, 2009.

Ubiquitous Computation: Machine
Learning
• Making predictions based on existing data
• Classifying emails into spam and non-spam
• American Express analyzes the monthly
expenditures of its cardholders to suggest
products to them
• Facebook uses it to figure out the order of
Newsfeed stories, friend and page
recommendations, etc.
• Amazon uses it to make product
recommendations while Netflix employs it for
movie recommendations

Case Study: MapReduce
• Designed by Google to process large amounts of data
– Google’s “hammer for 80% of their data crunching”
– Original paper has 9000+ citations

• The user only needs to write two functions
• The framework abstracts away work distribution,
network connectivity, data movement, and
synchronization
• Can seamlessly scale to hundreds of thousands of
machines
• Open-source version, Hadoop, being used by everyone,
from Yahoo and Facebook to LinkedIn and The New
York Times

Case Study: MapReduce (2)
• Used for embarrassingly parallel applications,
most divide-and-conquer algorithms
• For instance, the count of each word in a
billion document library can be calculated in
less than 10 lines of custom code
• Data is stored on a distributed filesystem
• map() -> groupBy -> reduce()

Case Study: Storm
• Used to analyze “data in motion”
– Originally designed at Backtype but later acquired by
Twitter; now an Apache source project

• Each datapoint, called a tuple, passes through a
processing pipeline
Source (spout)

Operator(s) (bolt)

Sink

• The user only needs to provide the code for each
operator and a graph specification (topology)

Storage
• Most Big Data solutions revolve around data
without any structure (possibly from
heterogeneous sources)
• The scale of the data makes a cleaning phase next
to impossible
• Therefore, storage solutions need to explicitly
support unstructured and semi-structured data
• Traditional RDBMS being replaced by NoSQL and
NewSQL solutions
– Varying from document stores to key-value stores

Storage (2)
1. Relational database management systems (RDBMS):
IBM DB2 MySQL, Oracle DB, etc. (structured data)
2. NoSQL: Key-value stores, document stores, graphs,
tables, etc. (semi-structured and unstructured data)
–
–
–
–

Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Key-value stores: Dynamo, Cassandra, Voldemort, etc.
Tables: BigTable, HBase, etc.

3. NewSQL: The best of both worlds: Spanner, VoltDB,
etc.

NoSQL
• Different Semantics:
– RDBMS provide ACID semantics:
•
•
•
•
•

Atomicity: The entire transaction either succeeds or fails
Consistent: Data within the database remains consistent after each
Transaction
Isolation: Transactions are sandboxed from each other
Durable: Transactions are persistent across failures and restarts

– Overkill in case of most user-facing applications
– Most applications are more interested in availability and willing
to sacrifice consistency leading to eventual consistency

• High Throughput: Most NoSQL databases sacrifice
consistency for availability leading to higher throughput (in
some cases an order of magnitude)

Case Study: BigTable*
• Distributed multi-dimensional table
• Indexed by both row-key as well as column-key
• Rows are maintained in lexicographic order and
are dynamically partitioned into tablets
• Implemented atop GFS
• Multiple tablet servers and a single master
* Fay Chang, et al. 2006. Bigtable: a distributed storage system for structured data. In
Proceedings of the 7th symposium on Operating systems design and implementation
(OSDI '06). USENIX Association, Berkeley, CA, USA, 205-218.

Case Study: Spanner*
• A database that stretches across the globe, seamlessly
operating across hundreds of datacenters and millions
of machines, and trillions of rows of information
• Took Google 4 and a half years to design and develop
• Time is of the essence in distributed systems; (possibly
geo-distributed) machines, applications, processes, and
threads need to be synchronized
* James C. Corbett, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of
the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX
Association, Berkeley, CA, USA, 251-264.

Case Study: Spanner (2)
• Spanner consists of a “TrueTime API”, which
makes use of atomic clocks and GPS!
• Ensures consistency for the entire system
• Even if two commits (with agreed upon ordering)
take place at other ends of the globe (say US and
China), their ordering will be preserved
• For instance, the Google ad system (an online
auction where ordering matters) can span the
entire globe

Cluster Managers
• Mix different programming paradigms
– For instance, batch-processing with stream-processing

• Cluster consolidation
– No need to manually partition cluster across multiple frameworks

• Data sharing
– Pass data from, say, MapReduce to Storm and vice versa

• Higher level job orchestration
– The ability to have a graph of heterogeneous job types

• Examples include YARN, Mesos, and Google’s Omega

Operating System Layer
• Consists of the traditional operating system
stack with the usual suspects, Windows,
variants of *nix, etc.
• Alternatives exist though. Specialized for the
cloud or multicore systems
• Exokernels, multikernels, and unikernels

Virtualization Layer
• Allows multiple operating systems to run on top
of the same physical hardware
• Enables infrastructure sharing, isolation, and
optimized utilization
• Different allocation strategies possible
• Easier to dedicate CPU and memory but not the
network
• Allocation either in the form of VMs or containers
• VMWare, Xen, LXC, etc.

Network Layer
•
•
•
•
•

Connects the entire ecosystem together
Consists of the entire protocol stack
Tenants assigned to Virtual LANs
Multiple protocols available across the stack
Most datacenters employ traditional Ethernet as the L2
fabric, although optical, wireless, and Infiniband are
not far-fetched
• Software Defined Networks have also enabled more
informed traffic engineering
• Run-of-the-mill tree topologies being replaced by
radical recursive and random topologies

Physical Infrastructure Layer
•
•
•
•

The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Servers are connected in various topologies using different
interconnects
• Dubbed as datacenters
• Modular and self-containing, container-sized datacenters can be
moved at will
• “We must treat the datacenter itself as one massive warehousescale computer” – Luiz André Barroso and Urs Hölzle, Google*
* Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as a Computer: An Introduction to
the Design of Warehouse-Scale Machines (1st ed.). Morgan and Claypool Publishers.

Power Generation
• According to the New York Times in 2012,
datacenters are collectively responsible for the
energy equivalent of 7-10 nuclear power
plants running at full capacity
• Datacenters have started using renewable
energy sources, such as solar and wind power
• Engendering the paradigm of “move
computation wherever renewable sources
exist”

Heat Dissipation
• The scale of the set up necessitates radical cooling
mechanisms
• Facebook in Prineville, US, “the Tibet of North
America”
– Rather than use inefficient water chillers, the datacenter
pulls the outside air into the facility and uses it to cool
down the servers

• Google in Hamina, Finland, on the banks of the Baltic
Sea
– The cooling mechanism pulls sea water through an
underground tunnel and uses it to cool down the servers

Case Study: Google
• All that infrastructure enables Google to:
– Index 20B web pages a day
– Handle in excess of 3B search queries daily
– Provide email storage to 425M Gmail users
– Serve 3B YouTube videos a day

The Big Data Stack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Big Data Stack

Similar to The Big Data Stack (20)

More from Zubair Nabi

More from Zubair Nabi (11)

Recently uploaded

Recently uploaded (20)

The Big Data Stack

Editor's Notes