2. Outline
Distributed file systems
Introduction to Big Data
Storage paradigms (RDBMS, NoSQL, and NewSQL)
Writing an application on top of distributed storage
(Cassandra)
3. file system
The purpose of a file system is to:
Organize and store data
Support sharing of data among users and applications
Ensure persistence of data after a reboot
Examples include FAT, NTFS, ext3, ext4, etc.
4. Distributed file system
Self-explanatory: the file system is distributed across many
machines
The DFS provides a common abstraction to the dispersed files
Each DFS has an associated API that provides a service to
clients, which are normal file operations, such as
create, read, write, etc.
Maintains a namespace which maps logical names to physical
names
Simplifies replication and migration
Examples include the Network file system (NFS), Andrew file system
(AFS), Google file system (GFS), Hadoop Distributed file system
(HDFS) etc.
5. Introduction to GFS
Designed by Google to meet its massive storage needs
Shares many goals with previous distributed file systems such as
performance, scalability, reliability, and availability
At the same time, design driven by key observations of their
workload and infrastructure, both current and future
6. Design Goals
Failure is the norm rather than the exception: The GFS must constantly
introspect and automatically recover from failure
The system stores a fair number of large files: Optimize for large files, on
the order of GBs, but still support small files
Most applications perform large, sequential writes that are mostly append
operations: Support small writes but do not optimize for them
Most operations are producer-consume queues or many-way merging:
Support concurrent reads or writes by hundreds of clients simultaneously
Applications process data in bulk at a high rate: Favor throughput over
latency
7. Files
Files are sliced into fixed-size chunks
64MB
Each chunk is identifiable by an immutable and globally unique
64-bit handle
Chunks are stored by chunkservers as local Linux files
Reads and writes to a chunk are specified by a handle and a byte
range
Each chunk is replicated on multiple chunkservers
3 by default
8. Architecture
Consists of a single master and
multiple chunkservers
The system can be accessed by
multiple clients
Both the master and
chunkservers run as user-space
server processes on commodity
Linux machines
9. Master
In charge of all filesystem metadata
Namespace, access control information, mapping between files and
chunks, and current locations of chunks
Holds this information in memory and regularly syncs it with a log file
Also in charge of chunk leasing, garbage collection, and chunk
migration
Periodically sends each chunkserver a heartbeat signal to check
its state and send it instructions
Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers
As a result, the master does not become a performance bottleneck
10. Master: Consistency Model
All namespace mutations (such as file creation) are atomic as they
are exclusively handled by the master
Namespace locking guarantees atomicity and correctness
The operation log maintained by the master defines a global total
order of these operations
11. Mutation Operations
Each chunk has many replicas
The primary replica holds a lease from the master
It decides the order of all mutations for all replicas
12. Write Operation
Client obtains the location of replicas and
the identity of the primary replica from the
master
It then pushes the data to all replica nodes
The client issues an update request to
primary
Primary forwards the write request to all
replicas
It waits for a reply from all replicas before
returning to the client
13. Record Append Operation
Append location chosen by the GFS and communicated to the
client
Primary forwards the write request to all replicas
It waits for a reply from all replicas before returning to the client
If the records fits in the current chunk, it is written and communicated
to the client
If it does not, the chunk is padded and the client is told to try the next
chunk
Performed atomically
14. Chunk Placement
Put on chunkservers with below average disk space usage
Limit number of “recent” creations on a chunkserver, to ensure that
it does not experience any traffic spike due to its fresh data
For reliability, replicas spread across racks
15. Stale Replica Detection
Each chunk is assigned a version number
Each time a new lease is granted, the version number is
incremented
Stale replicas with outdated version numbers, are simply garbage
collected
16. Garbage Collection
A lazy reclamation strategy is used by not reclaiming chunks at
delete time
Each chunkserver communicates the subset of its current chunks
to the master in the heartbeat signal
Master pinpoints chunks which have been orphaned
Chunks become garbage when they are orphaned
The chunkserver finally reclaims that space
17. Introduction HDFS
Open-source clone of GFS
Comes packaged with Hadoop
Master is called the NameNode and chunkservers are called
DataNodes
Chunks are known as blocks
Exposes a Java API and a command-line interface
20. Today, Government agencies at the Federal, State and Local level are
confronting the same challenge that commercial organizations have been
struggling with in recent years: how to best capture and utilize the increasing
amount of data that is coming from more sources than ever before.
Problem
21. The current framework:
the Web
multidisciplinary
and complex
22.
23.
24. Big Data
Large datasets whose processing and storage requirements exceed all traditional
paradigms and infrastructure
25. 3 Vs of Big Data
The “BIG” in big
data isn’t just
about volume
26. Big data ecosystem
Presentation layer
Application layer: frameworks + storage
Operating system layer
Virtualization layer (optional)
Network layer (intra- and inter-data center)
Physical infrastructure
Can roughly be called the “cloud”
27. More Examples of big data…
Index 20 billion web pages a day, Handle in excess of 3 billion search queries daily
Provide email storage to 425 million Gmail users
Serve 3 billion YouTube videos a day
400 million Tweets everyday
In March 2012, the Obama Administration announced the Big Data Research and Development
Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address
important problems facing the government.
28. Why are they collecting all this data?
Target Marketing
• To send you catalogs for exactly
the merchandise you typically
purchase.
• To suggest medications that
precisely match your medical
history.
• To “push” television channels to
your set instead of your “pulling”
them in.
• To send advertisements on those
channels just for you!
Targeted Information
• To know what you need before
you even know you need it
based on past purchasing
habits!
• To notify you of your expiring
driver’s license or credit cards or
last refill on a Rx, etc.
• To give you turn-by-turn
directions to a shelter in case of
emergency.
35. What is the problem
Traditionally, computation has been processor-bound
For decades, the primary push was to increase the
computing power of a single machine – Faster
processor, more RAM
Distributed systems evolved to allow developers to use
multiple machines for a single job – At compute
time, data is copied to the compute nodes
36. Getting the data to the processors becomes the bottleneck
Quick calculation – Typical disk data transfer rate:
75MB/sec – Time taken to transfer 100GB of data to the processor:
approx. 22 minutes!
What is the problem
37. Failure of a component may cost a lot
What we need when job fail? – May result in a graceful degradation of
application performance, but entire system does not completely fail –
Should not result in the loss of any data – Would not affect the outcome of
the job
What is the problem
39. Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
We use this data to share information and make a more informed
decision about different events
Datasets can easily be classified on the basis of theirstructure
Structured
Unstructured
Semi-structured
40. Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by aschema
Your phone address phone is structured because it has a schema
consisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving around
data laid out across columns and rows
Each field also has an associated type
Possible to search for items based on their data types
41. Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
In most cases, consists of simple log files
42. Semi-structured Data
Occupies the space between the structured and unstructured data
spectrum
For instance, while binary data has no structure, audio and video files.
have meta-data which has structure, such as author, time of creation,
etc.
Can also be labelled as self-describing structure
44. Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
45. Relational Database Management Systems
(RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her details as
data; another table might have the same key but different data, say her
purchases; yet another table with the same key might have a breakdown
of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBM DB2, and
Teradata
46. RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on to a
relational database system
The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done before
the database can be populated
Another consequence of a strict schema is that it is non-trivial to
extend it
For instance, adding a new attribute to an existing row necessitates
adding a new column to the entire table
Extremely suboptimal in tables with millions of rows
47. RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured data
only has a weak one
Data within such datasets also has an associated type
In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
Structureless data generated by real-time sources can change the
number of attributes and their types on the fly
RDBMS would require the creation of a new table each time such a
change takes place
Therefore, unstructured and semi-structured data does not fit the
relational model
48.
49. NoSQL
Its not about not about saying that SQL should never be used, or that SQL is
dead
50. NoSQL
is simply
Not Only SQL!
Its about recognizing that for some problems
other storage solutions are better suited
51. NoSQL
Database management without relational
model, schema free
Usually not ACID
Eventually consistent data
Distributed, fault-tolerant
Large amounts of dataLow and predictable response time (latency)
Scalability & elasticity (at low cost!)
High availability
Flexible schemas / semi-structured data
52. Some NoSQL use cases
1. Massive data volumes
Massively distributed architecture required to store the data
Google, Amazon, Yahoo, Facebook – 10-100K servers
2. Extreme query workload
Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
Schema flexibility (migration) is not trivial at large
Schema changes can be gradually introduced with NoSQL
53. Three (emerging) NOSQL categories
Key-value stores
Based on DHTs / Amazon's Dynamo paper
Data model: (global) collection of K-V pairs
Example: Dynomite, Voldemort, Tokyo
BigTable Clones
Based on Google's BigTable paper
Data model: big table, column families
Example: Hbase, Hypertable
54. Document databases
Inspired by Lotus Notes
Data model: collections of K-V collections
Example: CouchDB, MongoDB
Three (emerging) NOSQL categories…
57. NewSQL
NewSQL is a class of modern relational database management systems that
seek to provide the same scalable performance of NoSQL systems for
OLTP workloads while still maintaining the ACID guarantees of a traditional
single-node database system
58. NewSQL
SQL as the primary interface.
ACID support for transactions
Non-locking concurrency control.
High per-node performance.
Parallel, shared-nothing architecture.
Radically better scalability and
performance
59. A hybrid of traditional RDBMS and NoSQL
Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardware
Classified into:
1 New Databases: Designed from scratch
2 New MySQL Storage Engines: Keep MySQL as interface but replace the
storage engine
3 Transparent Clustering: Add pluggable features to existing databases to
ensure scalability
62. Why column Store ?
Can be significantly faster than row stores for some applications
Fetch only required columns for a query
Better cache effects
Better compression (similar attribute values within a column)Why Column
But can be slower for other applications
OLTP with many row inserts, ..Store?
Long war between the column store and row store camps
63.
64. Introduction
Borrows concepts from both Dynamo and BigTable
Originally developed by Facebook but now an Apache open source
project
Designed for Facebook Chat for efficiently storing, indexing, and
searching messages
65. Design Goals
Processing of a large amount of data
Highly scalable
Reliability at a massive scale
High throughput writes without sacrificing read efficiency
66. Introduction…
http://cassandra.apache.org/
• Developed by Facebook (inbox), now Apache
– Facebook now developing its own version again
• Based on Google BigTable (data model) and Amazon Dynamo (partitioning & consistency)
• P2P
– Every node is aware of all other nodes in the cluster
• Design goals
– High availability
– Eventual consistency (improves HA)
– Incremental scalability / elasticity
– Optimistic replication
67. Data model
– Same as BigTable
– Super Columns (nested Columns) and Super Column Families
– column order in a CF can be specified (name, time)
• Cluster membership
– Gossip – every nodes gossips to 1-3 other nodes about the state of the cluster (merge incoming
info with its own)
– Changes in the cluster (node in/out, failure) propagate quickly (LogN)
– Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2))
• Dynamic partitioning
– Consistent hashing
– Ring of nodes
– Nodes can be “moved” on the ring for load balancing