SlideShare a Scribd company logo
The Big Data Stack
Zubair Nabi
zubair.nabi@cantab.net
7 January, 2014
Data Timeline

0

fork()

2003

5EB

2.7ZB

2012

2015

8ZB
Example*: Facebook
•
•
•
•
•
•

2.5B – content items shared
2.7B – ‘Likes’
300M – photos uploaded
105TB – data scanned every 30 minutes
500+TB – new data ingested
100+PB – data warehouse

* VP Engineering, Jay Parikh – 2012
Example: Facebook’s Haystack*
• 65B photos
– 4 images of different size stored for each photo
– For a total of 260B images and 20PB of storage

• 1B new photos uploaded each week
– Increment of 60TB

• At peak traffic, 1M images served per second
• An image request is like finding a needle in a
haystack
*Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in
Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems
design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8.
More Examples
• The LHC at CERN generates 22PB of data annually
(after throwing away around 99% of readings)
• The Square Kilometre Array (under construction)
is expected to generate hundreds of PB each day
• Farecast, a part of Bing, searches through 225B
flight and price records to advise customers on
their ticket purchases
More Examples (2)
• The amount of annual traffic flowing over the
Internet is around 700EB
• Walmart handles in excess of 1M transactions
every hour (25PB in total)
• 400M Tweets everyday
Big Data
• Large datasets whose processing and storage
requirements exceed all traditional paradigms and
infrastructure
– On the order of terabytes and beyond

• Generated by web 2.0 applications, sensor networks,
scientific applications, financial applications, etc.
• Radically different tools needed to record, store,
process, and visualize
• Moving away from the desktop
• Offloaded to the “cloud”
• Poses challenges for computation, storage, and
infrastructure
The Stack
• Presentation layer
• Application layer: processing + storage
• Operating System layer
• Virtualization layer (optional)
• Network layer (intra- and inter-data center)
• Physical infrastructure layer
Can roughly be called the “cloud”
Presentation Layer
• Acts as the user-facing end of the entire
ecosystem
• Forwards user queries to the backend
(potentially the rest of the stack)
• Can be both local and remote
• For most web 2.0 applications, the
presentation layer is a web portal
Presentation Layer (2)
• For instance, the Google search website is a
presentation layer
– Takes user queries
– Forwards them to a scatter-gather application
– Presents the results to the user (within a time
bound)
• Made up of many technologies, such as HTTP, HTML,
AJAX, etc.
• Can also be a visualization library
Application Layer
• Serves as the back-end
• Either computes a result for the user, or
fetches a previously computed result or
content from storage
• The execution is predominantly distributed
• The computation itself might entail crossdisciplinary (across sciences) technology
Processing
• Can be a custom solution, such as a scattergather application
• Might also be an existing data intensive
computation framework, such as MapReduce,
Spark, MPI, etc. or a stream processing
system, such as IBM Infosphere Streams,
Storm, S4, etc.
• Analytics engines: R, Matlab, etc.
Numbers Everyone Should Know*
Operation

Time (nsec)

Time

L1 cache reference

0.5

0.5s

Branch mispredict

5

5s

L2 cache reference

7

7s

Mutex lock/unlock

25

25s

Main memory reference

100

1m40s

Send 2K over 1Gbps network

20,000

5h30m

Read 1MB sequentially from memory

250,000

~3days

Disk seek

10,000,000

~6days

Read 1MB sequentially from disk

20,000,000

8months

Send packet CA -> NL -> CA

150,000,000

4.75years

* Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from
LADIS, 2009.
Ubiquitous Computation: Machine
Learning
• Making predictions based on existing data
• Classifying emails into spam and non-spam
• American Express analyzes the monthly
expenditures of its cardholders to suggest
products to them
• Facebook uses it to figure out the order of
Newsfeed stories, friend and page
recommendations, etc.
• Amazon uses it to make product
recommendations while Netflix employs it for
movie recommendations
Case Study: MapReduce
• Designed by Google to process large amounts of data
– Google’s “hammer for 80% of their data crunching”
– Original paper has 9000+ citations

• The user only needs to write two functions
• The framework abstracts away work distribution,
network connectivity, data movement, and
synchronization
• Can seamlessly scale to hundreds of thousands of
machines
• Open-source version, Hadoop, being used by everyone,
from Yahoo and Facebook to LinkedIn and The New
York Times
Case Study: MapReduce (2)
• Used for embarrassingly parallel applications,
most divide-and-conquer algorithms
• For instance, the count of each word in a
billion document library can be calculated in
less than 10 lines of custom code
• Data is stored on a distributed filesystem
• map() -> groupBy -> reduce()
Case Study: Storm
• Used to analyze “data in motion”
– Originally designed at Backtype but later acquired by
Twitter; now an Apache source project

• Each datapoint, called a tuple, passes through a
processing pipeline
Source (spout)

Operator(s) (bolt)

Sink

• The user only needs to provide the code for each
operator and a graph specification (topology)
Storage
• Most Big Data solutions revolve around data
without any structure (possibly from
heterogeneous sources)
• The scale of the data makes a cleaning phase next
to impossible
• Therefore, storage solutions need to explicitly
support unstructured and semi-structured data
• Traditional RDBMS being replaced by NoSQL and
NewSQL solutions
– Varying from document stores to key-value stores
Storage (2)
1. Relational database management systems (RDBMS):
IBM DB2 MySQL, Oracle DB, etc. (structured data)
2. NoSQL: Key-value stores, document stores, graphs,
tables, etc. (semi-structured and unstructured data)
–
–
–
–

Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Key-value stores: Dynamo, Cassandra, Voldemort, etc.
Tables: BigTable, HBase, etc.

3. NewSQL: The best of both worlds: Spanner, VoltDB,
etc.
NoSQL
• Different Semantics:
– RDBMS provide ACID semantics:
•
•
•
•
•

Atomicity: The entire transaction either succeeds or fails
Consistent: Data within the database remains consistent after each
Transaction
Isolation: Transactions are sandboxed from each other
Durable: Transactions are persistent across failures and restarts

– Overkill in case of most user-facing applications
– Most applications are more interested in availability and willing
to sacrifice consistency leading to eventual consistency

• High Throughput: Most NoSQL databases sacrifice
consistency for availability leading to higher throughput (in
some cases an order of magnitude)
Case Study: BigTable*
• Distributed multi-dimensional table
• Indexed by both row-key as well as column-key
• Rows are maintained in lexicographic order and
are dynamically partitioned into tablets
• Implemented atop GFS
• Multiple tablet servers and a single master
* Fay Chang, et al. 2006. Bigtable: a distributed storage system for structured data. In
Proceedings of the 7th symposium on Operating systems design and implementation
(OSDI '06). USENIX Association, Berkeley, CA, USA, 205-218.
Case Study: Spanner*
• A database that stretches across the globe, seamlessly
operating across hundreds of datacenters and millions
of machines, and trillions of rows of information
• Took Google 4 and a half years to design and develop
• Time is of the essence in distributed systems; (possibly
geo-distributed) machines, applications, processes, and
threads need to be synchronized
* James C. Corbett, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of
the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX
Association, Berkeley, CA, USA, 251-264.
Case Study: Spanner (2)
• Spanner consists of a “TrueTime API”, which
makes use of atomic clocks and GPS!
• Ensures consistency for the entire system
• Even if two commits (with agreed upon ordering)
take place at other ends of the globe (say US and
China), their ordering will be preserved
• For instance, the Google ad system (an online
auction where ordering matters) can span the
entire globe
Framework Plurality
Cluster Managers
• Mix different programming paradigms
– For instance, batch-processing with stream-processing

• Cluster consolidation
– No need to manually partition cluster across multiple frameworks

• Data sharing
– Pass data from, say, MapReduce to Storm and vice versa

• Higher level job orchestration
– The ability to have a graph of heterogeneous job types

• Examples include YARN, Mesos, and Google’s Omega
Operating System Layer
• Consists of the traditional operating system
stack with the usual suspects, Windows,
variants of *nix, etc.
• Alternatives exist though. Specialized for the
cloud or multicore systems
• Exokernels, multikernels, and unikernels
Virtualization Layer
• Allows multiple operating systems to run on top
of the same physical hardware
• Enables infrastructure sharing, isolation, and
optimized utilization
• Different allocation strategies possible
• Easier to dedicate CPU and memory but not the
network
• Allocation either in the form of VMs or containers
• VMWare, Xen, LXC, etc.
Network Layer
•
•
•
•
•

Connects the entire ecosystem together
Consists of the entire protocol stack
Tenants assigned to Virtual LANs
Multiple protocols available across the stack
Most datacenters employ traditional Ethernet as the L2
fabric, although optical, wireless, and Infiniband are
not far-fetched
• Software Defined Networks have also enabled more
informed traffic engineering
• Run-of-the-mill tree topologies being replaced by
radical recursive and random topologies
Physical Infrastructure Layer
•
•
•
•

The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Servers are connected in various topologies using different
interconnects
• Dubbed as datacenters
• Modular and self-containing, container-sized datacenters can be
moved at will
• “We must treat the datacenter itself as one massive warehousescale computer” – Luiz André Barroso and Urs Hölzle, Google*
* Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as a Computer: An Introduction to
the Design of Warehouse-Scale Machines (1st ed.). Morgan and Claypool Publishers.
Power Generation
• According to the New York Times in 2012,
datacenters are collectively responsible for the
energy equivalent of 7-10 nuclear power
plants running at full capacity
• Datacenters have started using renewable
energy sources, such as solar and wind power
• Engendering the paradigm of “move
computation wherever renewable sources
exist”
Heat Dissipation
• The scale of the set up necessitates radical cooling
mechanisms
• Facebook in Prineville, US, “the Tibet of North
America”
– Rather than use inefficient water chillers, the datacenter
pulls the outside air into the facility and uses it to cool
down the servers

• Google in Hamina, Finland, on the banks of the Baltic
Sea
– The cooling mechanism pulls sea water through an
underground tunnel and uses it to cool down the servers
Case Study: Google
• All that infrastructure enables Google to:
– Index 20B web pages a day
– Handle in excess of 3B search queries daily
– Provide email storage to 425M Gmail users
– Serve 3B YouTube videos a day
The Big Data Stack

More Related Content

What's hot

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
wina wulansari
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??
Abdul Aslam
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
Sigmoid
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
zihad164
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Deepika ParthaSarathy
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
Anamika Singh
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
Viet-Trung TRAN
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
Heman Hosainpana
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Unit 2
Unit 2Unit 2

What's hot (20)

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??Distributed Operating System,Network OS and Middle-ware.??
Distributed Operating System,Network OS and Middle-ware.??
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Unit 2
Unit 2Unit 2
Unit 2
 

Viewers also liked

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
Zubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS Hybrids
Zubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
Zubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
Zubair Nabi
 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
Zubair Nabi
 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 

Viewers also liked (20)

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS Hybrids
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
 
AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 

Similar to The Big Data Stack

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
Sachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
SumanthReddy540432
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
Tomas Sirny
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
ShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
VishalBH1
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 

Similar to The Big Data Stack (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Big data
Big dataBig data
Big data
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

More from Zubair Nabi

Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
Zubair Nabi
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
Zubair Nabi
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
Zubair Nabi
 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with Cassandra
Zubair Nabi
 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and Storage
Zubair Nabi
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google Filesystem
Zubair Nabi
 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad Application
Zubair Nabi
 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative Architectures
Zubair Nabi
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce Paradigm
Zubair Nabi
 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPI
Zubair Nabi
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
 

More from Zubair Nabi (11)

Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
 
Lab 4: Interfacing with Cassandra
Lab 4: Interfacing with CassandraLab 4: Interfacing with Cassandra
Lab 4: Interfacing with Cassandra
 
Topic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and StorageTopic 10: Taxonomy of Data and Storage
Topic 10: Taxonomy of Data and Storage
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google Filesystem
 
Lab 3: Writing a Naiad Application
Lab 3: Writing a Naiad ApplicationLab 3: Writing a Naiad Application
Lab 3: Writing a Naiad Application
 
Topic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative ArchitecturesTopic 8: Enhancements and Alternative Architectures
Topic 8: Enhancements and Alternative Architectures
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce Paradigm
 
Lab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPILab 1: Introduction to Amazon EC2 and MPI
Lab 1: Introduction to Amazon EC2 and MPI
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

The Big Data Stack

  • 1. The Big Data Stack Zubair Nabi zubair.nabi@cantab.net 7 January, 2014
  • 2.
  • 4. Example*: Facebook • • • • • • 2.5B – content items shared 2.7B – ‘Likes’ 300M – photos uploaded 105TB – data scanned every 30 minutes 500+TB – new data ingested 100+PB – data warehouse * VP Engineering, Jay Parikh – 2012
  • 5. Example: Facebook’s Haystack* • 65B photos – 4 images of different size stored for each photo – For a total of 260B images and 20PB of storage • 1B new photos uploaded each week – Increment of 60TB • At peak traffic, 1M images served per second • An image request is like finding a needle in a haystack *Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8.
  • 6. More Examples • The LHC at CERN generates 22PB of data annually (after throwing away around 99% of readings) • The Square Kilometre Array (under construction) is expected to generate hundreds of PB each day • Farecast, a part of Bing, searches through 225B flight and price records to advise customers on their ticket purchases
  • 7. More Examples (2) • The amount of annual traffic flowing over the Internet is around 700EB • Walmart handles in excess of 1M transactions every hour (25PB in total) • 400M Tweets everyday
  • 8. Big Data • Large datasets whose processing and storage requirements exceed all traditional paradigms and infrastructure – On the order of terabytes and beyond • Generated by web 2.0 applications, sensor networks, scientific applications, financial applications, etc. • Radically different tools needed to record, store, process, and visualize • Moving away from the desktop • Offloaded to the “cloud” • Poses challenges for computation, storage, and infrastructure
  • 9. The Stack • Presentation layer • Application layer: processing + storage • Operating System layer • Virtualization layer (optional) • Network layer (intra- and inter-data center) • Physical infrastructure layer Can roughly be called the “cloud”
  • 10. Presentation Layer • Acts as the user-facing end of the entire ecosystem • Forwards user queries to the backend (potentially the rest of the stack) • Can be both local and remote • For most web 2.0 applications, the presentation layer is a web portal
  • 11. Presentation Layer (2) • For instance, the Google search website is a presentation layer – Takes user queries – Forwards them to a scatter-gather application – Presents the results to the user (within a time bound) • Made up of many technologies, such as HTTP, HTML, AJAX, etc. • Can also be a visualization library
  • 12. Application Layer • Serves as the back-end • Either computes a result for the user, or fetches a previously computed result or content from storage • The execution is predominantly distributed • The computation itself might entail crossdisciplinary (across sciences) technology
  • 13. Processing • Can be a custom solution, such as a scattergather application • Might also be an existing data intensive computation framework, such as MapReduce, Spark, MPI, etc. or a stream processing system, such as IBM Infosphere Streams, Storm, S4, etc. • Analytics engines: R, Matlab, etc.
  • 14. Numbers Everyone Should Know* Operation Time (nsec) Time L1 cache reference 0.5 0.5s Branch mispredict 5 5s L2 cache reference 7 7s Mutex lock/unlock 25 25s Main memory reference 100 1m40s Send 2K over 1Gbps network 20,000 5h30m Read 1MB sequentially from memory 250,000 ~3days Disk seek 10,000,000 ~6days Read 1MB sequentially from disk 20,000,000 8months Send packet CA -> NL -> CA 150,000,000 4.75years * Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.
  • 15. Ubiquitous Computation: Machine Learning • Making predictions based on existing data • Classifying emails into spam and non-spam • American Express analyzes the monthly expenditures of its cardholders to suggest products to them • Facebook uses it to figure out the order of Newsfeed stories, friend and page recommendations, etc. • Amazon uses it to make product recommendations while Netflix employs it for movie recommendations
  • 16. Case Study: MapReduce • Designed by Google to process large amounts of data – Google’s “hammer for 80% of their data crunching” – Original paper has 9000+ citations • The user only needs to write two functions • The framework abstracts away work distribution, network connectivity, data movement, and synchronization • Can seamlessly scale to hundreds of thousands of machines • Open-source version, Hadoop, being used by everyone, from Yahoo and Facebook to LinkedIn and The New York Times
  • 17. Case Study: MapReduce (2) • Used for embarrassingly parallel applications, most divide-and-conquer algorithms • For instance, the count of each word in a billion document library can be calculated in less than 10 lines of custom code • Data is stored on a distributed filesystem • map() -> groupBy -> reduce()
  • 18. Case Study: Storm • Used to analyze “data in motion” – Originally designed at Backtype but later acquired by Twitter; now an Apache source project • Each datapoint, called a tuple, passes through a processing pipeline Source (spout) Operator(s) (bolt) Sink • The user only needs to provide the code for each operator and a graph specification (topology)
  • 19. Storage • Most Big Data solutions revolve around data without any structure (possibly from heterogeneous sources) • The scale of the data makes a cleaning phase next to impossible • Therefore, storage solutions need to explicitly support unstructured and semi-structured data • Traditional RDBMS being replaced by NoSQL and NewSQL solutions – Varying from document stores to key-value stores
  • 20. Storage (2) 1. Relational database management systems (RDBMS): IBM DB2 MySQL, Oracle DB, etc. (structured data) 2. NoSQL: Key-value stores, document stores, graphs, tables, etc. (semi-structured and unstructured data) – – – – Document stores: MongoDB, CouchDB, etc. Graphs: FlockDB, etc. Key-value stores: Dynamo, Cassandra, Voldemort, etc. Tables: BigTable, HBase, etc. 3. NewSQL: The best of both worlds: Spanner, VoltDB, etc.
  • 21. NoSQL • Different Semantics: – RDBMS provide ACID semantics: • • • • • Atomicity: The entire transaction either succeeds or fails Consistent: Data within the database remains consistent after each Transaction Isolation: Transactions are sandboxed from each other Durable: Transactions are persistent across failures and restarts – Overkill in case of most user-facing applications – Most applications are more interested in availability and willing to sacrifice consistency leading to eventual consistency • High Throughput: Most NoSQL databases sacrifice consistency for availability leading to higher throughput (in some cases an order of magnitude)
  • 22. Case Study: BigTable* • Distributed multi-dimensional table • Indexed by both row-key as well as column-key • Rows are maintained in lexicographic order and are dynamically partitioned into tablets • Implemented atop GFS • Multiple tablet servers and a single master * Fay Chang, et al. 2006. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 205-218.
  • 23. Case Study: Spanner* • A database that stretches across the globe, seamlessly operating across hundreds of datacenters and millions of machines, and trillions of rows of information • Took Google 4 and a half years to design and develop • Time is of the essence in distributed systems; (possibly geo-distributed) machines, applications, processes, and threads need to be synchronized * James C. Corbett, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, USA, 251-264.
  • 24. Case Study: Spanner (2) • Spanner consists of a “TrueTime API”, which makes use of atomic clocks and GPS! • Ensures consistency for the entire system • Even if two commits (with agreed upon ordering) take place at other ends of the globe (say US and China), their ordering will be preserved • For instance, the Google ad system (an online auction where ordering matters) can span the entire globe
  • 26. Cluster Managers • Mix different programming paradigms – For instance, batch-processing with stream-processing • Cluster consolidation – No need to manually partition cluster across multiple frameworks • Data sharing – Pass data from, say, MapReduce to Storm and vice versa • Higher level job orchestration – The ability to have a graph of heterogeneous job types • Examples include YARN, Mesos, and Google’s Omega
  • 27. Operating System Layer • Consists of the traditional operating system stack with the usual suspects, Windows, variants of *nix, etc. • Alternatives exist though. Specialized for the cloud or multicore systems • Exokernels, multikernels, and unikernels
  • 28. Virtualization Layer • Allows multiple operating systems to run on top of the same physical hardware • Enables infrastructure sharing, isolation, and optimized utilization • Different allocation strategies possible • Easier to dedicate CPU and memory but not the network • Allocation either in the form of VMs or containers • VMWare, Xen, LXC, etc.
  • 29. Network Layer • • • • • Connects the entire ecosystem together Consists of the entire protocol stack Tenants assigned to Virtual LANs Multiple protocols available across the stack Most datacenters employ traditional Ethernet as the L2 fabric, although optical, wireless, and Infiniband are not far-fetched • Software Defined Networks have also enabled more informed traffic engineering • Run-of-the-mill tree topologies being replaced by radical recursive and random topologies
  • 30. Physical Infrastructure Layer • • • • The physical hardware itself Servers and network elements Mechanism for power distribution, wiring, and cooling Servers are connected in various topologies using different interconnects • Dubbed as datacenters • Modular and self-containing, container-sized datacenters can be moved at will • “We must treat the datacenter itself as one massive warehousescale computer” – Luiz André Barroso and Urs Hölzle, Google* * Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (1st ed.). Morgan and Claypool Publishers.
  • 31. Power Generation • According to the New York Times in 2012, datacenters are collectively responsible for the energy equivalent of 7-10 nuclear power plants running at full capacity • Datacenters have started using renewable energy sources, such as solar and wind power • Engendering the paradigm of “move computation wherever renewable sources exist”
  • 32. Heat Dissipation • The scale of the set up necessitates radical cooling mechanisms • Facebook in Prineville, US, “the Tibet of North America” – Rather than use inefficient water chillers, the datacenter pulls the outside air into the facility and uses it to cool down the servers • Google in Hamina, Finland, on the banks of the Baltic Sea – The cooling mechanism pulls sea water through an underground tunnel and uses it to cool down the servers
  • 33.
  • 34. Case Study: Google • All that infrastructure enables Google to: – Index 20B web pages a day – Handle in excess of 3B search queries daily – Provide email storage to 425M Gmail users – Serve 3B YouTube videos a day

Editor's Notes

  1. http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/
  2. http://blog.davidsingleton.org/one-nanosecond-is-to-one-second-as-one-second-is-to-31-7-years/