1. 1: Big Data and Warehouse-scale Computing
Zubair Nabi
zubair.nabi@itu.edu.pk
April 17, 2013
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 1 / 23
2. Outline
1 Introduction
2 Ecosystem
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 2 / 23
3. Outline
1 Introduction
2 Ecosystem
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 3 / 23
4. From the very beginning
From the dawn civilization to the year 2003, we created 5EB of
information
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23
5. From the very beginning
From the dawn civilization to the year 2003, we created 5EB of
information
We now create the same amount of data every 2 days!
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23
6. From the very beginning
From the dawn civilization to the year 2003, we created 5EB of
information
We now create the same amount of data every 2 days!
By 2012, we had spawned 2.7ZB of data
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23
7. From the very beginning
From the dawn civilization to the year 2003, we created 5EB of
information
We now create the same amount of data every 2 days!
By 2012, we had spawned 2.7ZB of data
Following the same trend, we will have 8ZB by 2015
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23
8. Big Data
Large datasets whose processing and storage requirements exceed all
traditional paradigms and infrastructure
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23
9. Big Data
Large datasets whose processing and storage requirements exceed all
traditional paradigms and infrastructure
On the order of exabytes and beyond
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23
10. Big Data
Large datasets whose processing and storage requirements exceed all
traditional paradigms and infrastructure
On the order of exabytes and beyond
Generated by web 2.0 applications, sensor networks, scientific
applications, financial applications, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23
11. Big Data
Large datasets whose processing and storage requirements exceed all
traditional paradigms and infrastructure
On the order of exabytes and beyond
Generated by web 2.0 applications, sensor networks, scientific
applications, financial applications, etc.
Radically different tools needed to record, store, process, and visualize
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23
12. Big Data
Large datasets whose processing and storage requirements exceed all
traditional paradigms and infrastructure
On the order of exabytes and beyond
Generated by web 2.0 applications, sensor networks, scientific
applications, financial applications, etc.
Radically different tools needed to record, store, process, and visualize
Moving away from the desktop
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23
13. Big Data
Large datasets whose processing and storage requirements exceed all
traditional paradigms and infrastructure
On the order of exabytes and beyond
Generated by web 2.0 applications, sensor networks, scientific
applications, financial applications, etc.
Radically different tools needed to record, store, process, and visualize
Moving away from the desktop
Offloaded to the “cloud”
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23
14. Example: Facebook’s “Haystack”
65 billion photos
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23
15. Example: Facebook’s “Haystack”
65 billion photos
4 images of different sizes stored for each photo
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23
16. Example: Facebook’s “Haystack”
65 billion photos
4 images of different sizes stored for each photo
For a total of 260 billion images and 20PB of storage
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23
17. Example: Facebook’s “Haystack”
65 billion photos
4 images of different sizes stored for each photo
For a total of 260 billion images and 20PB of storage
1 billion new photos uploaded each week (increment of 60TB)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23
18. Example: Facebook’s “Haystack”
65 billion photos
4 images of different sizes stored for each photo
For a total of 260 billion images and 20PB of storage
1 billion new photos uploaded each week (increment of 60TB)
At peak traffic 1 million images served per second
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23
19. Example: Facebook’s “Haystack”
65 billion photos
4 images of different sizes stored for each photo
For a total of 260 billion images and 20PB of storage
1 billion new photos uploaded each week (increment of 60TB)
At peak traffic 1 million images served per second
An image request is like finding a needle in a haystack
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23
20. More examples
The LHC at CERN generates 22PB of data annually (after throwing away
around 99% of readings)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23
21. More examples
The LHC at CERN generates 22PB of data annually (after throwing away
around 99% of readings)
The Square Kilometre Array (under construction) is expected to generate
hundreds of PB each day
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23
22. More examples
The LHC at CERN generates 22PB of data annually (after throwing away
around 99% of readings)
The Square Kilometre Array (under construction) is expected to generate
hundreds of PB each day
Farecast, a part of Bing, searches through 225 billion flight and price
records to advise customers on their ticket purchases
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23
23. More examples
The LHC at CERN generates 22PB of data annually (after throwing away
around 99% of readings)
The Square Kilometre Array (under construction) is expected to generate
hundreds of PB each day
Farecast, a part of Bing, searches through 225 billion flight and price
records to advise customers on their ticket purchases
The amount of annual traffic flowing over the Internet is around 700EB
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23
24. More examples
The LHC at CERN generates 22PB of data annually (after throwing away
around 99% of readings)
The Square Kilometre Array (under construction) is expected to generate
hundreds of PB each day
Farecast, a part of Bing, searches through 225 billion flight and price
records to advise customers on their ticket purchases
The amount of annual traffic flowing over the Internet is around 700EB
Walmart handles in excess of 1 million transactions every hour (25PB in
total)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23
25. More examples
The LHC at CERN generates 22PB of data annually (after throwing away
around 99% of readings)
The Square Kilometre Array (under construction) is expected to generate
hundreds of PB each day
Farecast, a part of Bing, searches through 225 billion flight and price
records to advise customers on their ticket purchases
The amount of annual traffic flowing over the Internet is around 700EB
Walmart handles in excess of 1 million transactions every hour (25PB in
total)
400 million Tweets everyday
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23
26. Outline
1 Introduction
2 Ecosystem
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 8 / 23
27. Big data ecosystem
Presentation layer
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
28. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
29. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Operating system layer
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
30. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Operating system layer
Virtualization layer (optional)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
31. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Operating system layer
Virtualization layer (optional)
Network layer (intra- and inter-data center)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
32. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Operating system layer
Virtualization layer (optional)
Network layer (intra- and inter-data center)
Physical infrastructure
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
33. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Operating system layer
Virtualization layer (optional)
Network layer (intra- and inter-data center)
Physical infrastructure
Can roughly be called the “cloud”
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23
34. Presentation Layer
Acts as the user-facing end of the entire ecosystem
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23
35. Presentation Layer
Acts as the user-facing end of the entire ecosystem
Forwards user queries to the backend (potentially the rest of the stack)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23
36. Presentation Layer
Acts as the user-facing end of the entire ecosystem
Forwards user queries to the backend (potentially the rest of the stack)
Can be both local and remote
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23
37. Presentation Layer
Acts as the user-facing end of the entire ecosystem
Forwards user queries to the backend (potentially the rest of the stack)
Can be both local and remote
For most web 2.0 applications, the presentation layer is a web portal
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23
38. Presentation Layer
Acts as the user-facing end of the entire ecosystem
Forwards user queries to the backend (potentially the rest of the stack)
Can be both local and remote
For most web 2.0 applications, the presentation layer is a web portal
For instance, the Google search website is a presentation layer: it takes
user queries, forwards them to a scatter-gather application, and presents
the results to the user (within a time bound)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23
39. Presentation Layer
Acts as the user-facing end of the entire ecosystem
Forwards user queries to the backend (potentially the rest of the stack)
Can be both local and remote
For most web 2.0 applications, the presentation layer is a web portal
For instance, the Google search website is a presentation layer: it takes
user queries, forwards them to a scatter-gather application, and presents
the results to the user (within a time bound)
Made up of many technologies, such as HTTP, HTML, AJAX, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23
40. Application Layer
Serves as the back-end
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23
41. Application Layer
Serves as the back-end
Either computes a result for the user, or fetches a previously computed
result or content from storage
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23
42. Application Layer
Serves as the back-end
Either computes a result for the user, or fetches a previously computed
result or content from storage
The execution is predominantly distributed
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23
43. Application Layer
Serves as the back-end
Either computes a result for the user, or fetches a previously computed
result or content from storage
The execution is predominantly distributed
The computation itself might entail cross-disciplinary (across sciences)
technology
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23
44. Computation
Can be a custom solution, such as a scatter-gather application
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 12 / 23
45. Computation
Can be a custom solution, such as a scatter-gather application
Might also be an existing data intensive computation framework, such as
MapReduce, Dryad, MPI, etc. or a stream processing system, such as
Storm, S4, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 12 / 23
46. Computation
Can be a custom solution, such as a scatter-gather application
Might also be an existing data intensive computation framework, such as
MapReduce, Dryad, MPI, etc. or a stream processing system, such as
Storm, S4, etc.
Analytics engines: R, Matlab, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 12 / 23
47. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
48. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
2 NoSQL: Key-value stores, document stores, graphs, tables, etc.
(semi-structured and unstructured data)
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
49. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
2 NoSQL: Key-value stores, document stores, graphs, tables, etc.
(semi-structured and unstructured data)
Document stores: MongoDB, CouchDB, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
50. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
2 NoSQL: Key-value stores, document stores, graphs, tables, etc.
(semi-structured and unstructured data)
Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
51. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
2 NoSQL: Key-value stores, document stores, graphs, tables, etc.
(semi-structured and unstructured data)
Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Key-value stores: Dynamo, Cassandra, Voldemort, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
52. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
2 NoSQL: Key-value stores, document stores, graphs, tables, etc.
(semi-structured and unstructured data)
Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Key-value stores: Dynamo, Cassandra, Voldemort, etc.
Tables: BigTable, HBase, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
53. Storage
1 Relational database management systems (RDBMS): MySQL, Oracle
DB, IBM DB2, etc. (structured data)
2 NoSQL: Key-value stores, document stores, graphs, tables, etc.
(semi-structured and unstructured data)
Document stores: MongoDB, CouchDB, etc.
Graphs: FlockDB, etc.
Key-value stores: Dynamo, Cassandra, Voldemort, etc.
Tables: BigTable, HBase, etc.
3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23
54. Operating System Layer
Consists of the traditional operating system stack with the usual suspects,
Windows, variants of *nix, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 14 / 23
55. Operating System Layer
Consists of the traditional operating system stack with the usual suspects,
Windows, variants of *nix, etc.
Alternatives exist though. Specialized for the cloud or multicore systems
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 14 / 23
56. Virtualization Layer
Allows multiple operating systems to run on top of the same physical
hardware
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23
57. Virtualization Layer
Allows multiple operating systems to run on top of the same physical
hardware
Enables infrastructure sharing, isolation, and optimized utilization
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23
58. Virtualization Layer
Allows multiple operating systems to run on top of the same physical
hardware
Enables infrastructure sharing, isolation, and optimized utilization
Different allocation strategies possible
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23
59. Virtualization Layer
Allows multiple operating systems to run on top of the same physical
hardware
Enables infrastructure sharing, isolation, and optimized utilization
Different allocation strategies possible
Easier to dedicate CPU and memory but not the network
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23
60. Virtualization Layer
Allows multiple operating systems to run on top of the same physical
hardware
Enables infrastructure sharing, isolation, and optimized utilization
Different allocation strategies possible
Easier to dedicate CPU and memory but not the network
Allocation either in the form of VMs or containers
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23
61. Virtualization Layer
Allows multiple operating systems to run on top of the same physical
hardware
Enables infrastructure sharing, isolation, and optimized utilization
Different allocation strategies possible
Easier to dedicate CPU and memory but not the network
Allocation either in the form of VMs or containers
VMWare, Xen, LXC, etc.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23
62. Network Layer
Connects the entire ecosystem together
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23
63. Network Layer
Connects the entire ecosystem together
Consists of the entire protocol stack
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23
64. Network Layer
Connects the entire ecosystem together
Consists of the entire protocol stack
Tenants assigned to Virtual LANs
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23
65. Network Layer
Connects the entire ecosystem together
Consists of the entire protocol stack
Tenants assigned to Virtual LANs
Multiple protocols available across the stack
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23
66. Physical Infrastructure Layer
The physical hardware itself
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23
67. Physical Infrastructure Layer
The physical hardware itself
Servers and network elements
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23
68. Physical Infrastructure Layer
The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23
69. Physical Infrastructure Layer
The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Servers are connected in various topologies using different interconnects
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23
70. Physical Infrastructure Layer
The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Servers are connected in various topologies using different interconnects
Dubbed as datacenters
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23
71. Physical Infrastructure Layer
The physical hardware itself
Servers and network elements
Mechanism for power distribution, wiring, and cooling
Servers are connected in various topologies using different interconnects
Dubbed as datacenters
“We must treat the datacenter itself as one massive warehouse-scale
computer” – Luiz André Barroso and Urs Hölzle
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23
72. Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 18 / 23
73. Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 19 / 23
74. Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 20 / 23
75. Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 21 / 23
76. Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 22 / 23
77. Example: Google
All that infrastructure enables Google to:
Index 20 billion web pages a day
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23
78. Example: Google
All that infrastructure enables Google to:
Index 20 billion web pages a day
Handle in excess of 3 billion search queries daily
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23
79. Example: Google
All that infrastructure enables Google to:
Index 20 billion web pages a day
Handle in excess of 3 billion search queries daily
Provide email storage to 425 million Gmail users
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23
80. Example: Google
All that infrastructure enables Google to:
Index 20 billion web pages a day
Handle in excess of 3 billion search queries daily
Provide email storage to 425 million Gmail users
Serve 3 billion YouTube videos a day
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23
81. 1 Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel.
2010. Finding a needle in Haystack: Facebook’s photo storage. In
Proceedings of the 9th USENIX conference on Operating systems design
and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA.
2 Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as a
Computer: An Introduction to the Design of Warehouse-Scale Machines
(1st ed.). Morgan and Claypool Publishers.
Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 24 / 23