Cloud storage

Outline
 Distributed file systems
 Introduction to Big Data
 Storage paradigms (RDBMS, NoSQL, and NewSQL) 
 Writing an application on top of distributed storage
(Cassandra)

file system
The purpose of a file system is to:
 Organize and store data
 Support sharing of data among users and applications
 Ensure persistence of data after a reboot
 Examples include FAT, NTFS, ext3, ext4, etc.

Distributed file system
 Self-explanatory: the file system is distributed across many
machines
 The DFS provides a common abstraction to the dispersed files
 Each DFS has an associated API that provides a service to
clients, which are normal file operations, such as
create, read, write, etc.
 Maintains a namespace which maps logical names to physical
names
 Simplifies replication and migration
 Examples include the Network file system (NFS), Andrew file system
(AFS), Google file system (GFS), Hadoop Distributed file system
(HDFS) etc.

Introduction to GFS
 Designed by Google to meet its massive storage needs
 Shares many goals with previous distributed file systems such as
performance, scalability, reliability, and availability
 At the same time, design driven by key observations of their
workload and infrastructure, both current and future

Design Goals
 Failure is the norm rather than the exception: The GFS must constantly
introspect and automatically recover from failure
 The system stores a fair number of large files: Optimize for large files, on
the order of GBs, but still support small files
 Most applications perform large, sequential writes that are mostly append
operations: Support small writes but do not optimize for them
 Most operations are producer-consume queues or many-way merging:
Support concurrent reads or writes by hundreds of clients simultaneously
 Applications process data in bulk at a high rate: Favor throughput over
latency

Files
 Files are sliced into fixed-size chunks
 64MB
 Each chunk is identifiable by an immutable and globally unique
64-bit handle
 Chunks are stored by chunkservers as local Linux files
 Reads and writes to a chunk are specified by a handle and a byte
range
 Each chunk is replicated on multiple chunkservers
 3 by default

Architecture
 Consists of a single master and
multiple chunkservers
 The system can be accessed by
multiple clients
 Both the master and
chunkservers run as user-space
server processes on commodity
Linux machines

Master
 In charge of all filesystem metadata
 Namespace, access control information, mapping between files and
chunks, and current locations of chunks
 Holds this information in memory and regularly syncs it with a log file
 Also in charge of chunk leasing, garbage collection, and chunk
migration
 Periodically sends each chunkserver a heartbeat signal to check
its state and send it instructions
 Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers
 As a result, the master does not become a performance bottleneck

Master: Consistency Model
 All namespace mutations (such as file creation) are atomic as they
are exclusively handled by the master
 Namespace locking guarantees atomicity and correctness
 The operation log maintained by the master defines a global total
order of these operations

Mutation Operations
 Each chunk has many replicas
 The primary replica holds a lease from the master
 It decides the order of all mutations for all replicas

Write Operation
 Client obtains the location of replicas and
the identity of the primary replica from the
master
 It then pushes the data to all replica nodes
 The client issues an update request to
primary
 Primary forwards the write request to all
replicas
 It waits for a reply from all replicas before
returning to the client

Record Append Operation
 Append location chosen by the GFS and communicated to the
client
 Primary forwards the write request to all replicas
 It waits for a reply from all replicas before returning to the client
 If the records fits in the current chunk, it is written and communicated
to the client
 If it does not, the chunk is padded and the client is told to try the next
chunk
 Performed atomically

Chunk Placement
 Put on chunkservers with below average disk space usage
 Limit number of “recent” creations on a chunkserver, to ensure that
it does not experience any traffic spike due to its fresh data
 For reliability, replicas spread across racks

Stale Replica Detection
 Each chunk is assigned a version number
 Each time a new lease is granted, the version number is
incremented
 Stale replicas with outdated version numbers, are simply garbage
collected

Garbage Collection
 A lazy reclamation strategy is used by not reclaiming chunks at
delete time
 Each chunkserver communicates the subset of its current chunks
to the master in the heartbeat signal
 Master pinpoints chunks which have been orphaned
 Chunks become garbage when they are orphaned
 The chunkserver finally reclaims that space

Introduction HDFS
 Open-source clone of GFS
 Comes packaged with Hadoop
 Master is called the NameNode and chunkservers are called
 DataNodes
 Chunks are known as blocks
 Exposes a Java API and a command-line interface

Command-line API
 Accessible through: bin/hdfs dfs –[command args]
 Useful commands:
cat, copyFromLocal, copyToLocal, cp, ls, mkdir, moveFr
omLocal, moveToLocal, mv, rm, etc*.
* http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html

INTRO. TO BIG DATA & STORAGE PARADIGMS

Today, Government agencies at the Federal, State and Local level are
confronting the same challenge that commercial organizations have been
struggling with in recent years: how to best capture and utilize the increasing
amount of data that is coming from more sources than ever before.
Problem

 The current framework:
the Web
multidisciplinary
and complex

Big Data
Large datasets whose processing and storage requirements exceed all traditional
paradigms and infrastructure

3 Vs of Big Data
 The “BIG” in big
data isn’t just
about volume

Big data ecosystem
 Presentation layer
 Application layer: frameworks + storage
 Operating system layer
 Virtualization layer (optional)
 Network layer (intra- and inter-data center)
 Physical infrastructure
 Can roughly be called the “cloud”

More Examples of big data…
 Index 20 billion web pages a day, Handle in excess of 3 billion search queries daily
 Provide email storage to 425 million Gmail users
 Serve 3 billion YouTube videos a day
 400 million Tweets everyday
 In March 2012, the Obama Administration announced the Big Data Research and Development
Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address
important problems facing the government.

Why are they collecting all this data?
Target Marketing
• To send you catalogs for exactly
the merchandise you typically
purchase.
• To suggest medications that
precisely match your medical
history.
• To “push” television channels to
your set instead of your “pulling”
them in.
• To send advertisements on those
channels just for you!
Targeted Information
• To know what you need before
you even know you need it
based on past purchasing
habits!
• To notify you of your expiring
driver’s license or credit cards or
last refill on a Rx, etc.
• To give you turn-by-turn
directions to a shelter in case of
emergency.

What problems can be raised with Big Data ?

What is the problem
 Traditionally, computation has been processor-bound
 For decades, the primary push was to increase the
computing power of a single machine – Faster
processor, more RAM
 Distributed systems evolved to allow developers to use
multiple machines for a single job – At compute
time, data is copied to the compute nodes

 Getting the data to the processors becomes the bottleneck
 Quick calculation – Typical disk data transfer rate:
 75MB/sec – Time taken to transfer 100GB of data to the processor:
 approx. 22 minutes!
What is the problem

 Failure of a component may cost a lot
 What we need when job fail? – May result in a graceful degradation of
application performance, but entire system does not completely fail –
Should not result in the loss of any data – Would not affect the outcome of
the job
What is the problem

Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
We use this data to share information and make a more informed
 decision about different events
Datasets can easily be classiﬁed on the basis of theirstructure
 Structured
 Unstructured
 Semi-structured

Structured Data
 Formatted in a universally understandable and identifiable way
 In most cases, structured data is formally specified by aschema
 Your phone address phone is structured because it has a schema
 consisting of name, phone number, address, email address, etc.
 Most traditional databases contain structured data revolving around
 data laid out across columns and rows
 Each field also has an associated type
 Possible to search for items based on their data types

Unstructured Data
 Data without any conceptual definition or type
 Can vary from raw text to binary data
 Processing unstructured data requires parsing and tagging on the fly
 In most cases, consists of simple log files

Semi-structured Data
 Occupies the space between the structured and unstructured data
 spectrum
 For instance, while binary data has no structure, audio and video ﬁles.
 have meta-data which has structure, such as author, time of creation,
 etc.
 Can also be labelled as self-describing structure

Database Management Systems (DBMS)
 Used to store and manage data
 Support for large amounts of data
 Ensure concurrency, sharing, and locking
 Security is useful too; to enable ﬁne-grained access control
 Ability to keep working in the face of failure

Relational Database Management Systems
(RDBMS)
 The most popular and predominant storage system in use
 Data in different files is connected by using a key field
 Data is laid out in different tables, with a key field that identifies eachrow
 The same key field is used to connect one table to another
 For instance, a relation might have customer ID as key and her details as
data; another table might have the same key but different data, say her
purchases; yet another table with the same key might have a breakdown
of her preferences
 Examples include Oracle Database, MS SQL Server, MySQL, IBM DB2, and
Teradata

RDBMS and Structured Data
 As structured data follows a predeﬁned schema, it naturally maps on to a
relational database system
 The schema deﬁnes the type and structure of the data and its relations
 Schema design is an arduous process and needs to be done before
 the database can be populated
 Another consequence of a strict schema is that it is non-trivial to
 extend it
 For instance, adding a new attribute to an existing row necessitates
 adding a new column to the entire table
 Extremely suboptimal in tables with millions of rows

RDBMS and Semi- and Un-structured Data
 Unstructured data has no notion of schema while semi-structured data
only has a weak one
 Data within such datasets also has an associated type
 In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
 While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
 Structureless data generated by real-time sources can change the
number of attributes and their types on the fly
 RDBMS would require the creation of a new table each time such a
change takes place
 Therefore, unstructured and semi-structured data does not fit the
relational model

NoSQL
Its not about not about saying that SQL should never be used, or that SQL is
dead

NoSQL
is simply
Not Only SQL!
Its about recognizing that for some problems
other storage solutions are better suited

NoSQL
 Database management without relational
model, schema free
 Usually not ACID
 Eventually consistent data
 Distributed, fault-tolerant
 Large amounts of dataLow and predictable response time (latency)
 Scalability & elasticity (at low cost!)
 High availability
 Flexible schemas / semi-structured data

Some NoSQL use cases
1. Massive data volumes
 Massively distributed architecture required to store the data
 Google, Amazon, Yahoo, Facebook – 10-100K servers
2. Extreme query workload
 Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
 Schema flexibility (migration) is not trivial at large
 Schema changes can be gradually introduced with NoSQL

Three (emerging) NOSQL categories
 Key-value stores
 Based on DHTs / Amazon's Dynamo paper
 Data model: (global) collection of K-V pairs
 Example: Dynomite, Voldemort, Tokyo
 BigTable Clones
 Based on Google's BigTable paper
 Data model: big table, column families
 Example: Hbase, Hypertable

 Document databases
 Inspired by Lotus Notes
 Data model: collections of K-V collections
 Example: CouchDB, MongoDB
Three (emerging) NOSQL categories…

NewSQL
NewSQL is a class of modern relational database management systems that
seek to provide the same scalable performance of NoSQL systems for
OLTP workloads while still maintaining the ACID guarantees of a traditional
single-node database system

NewSQL
 SQL as the primary interface.
 ACID support for transactions
 Non-locking concurrency control.
 High per-node performance.
 Parallel, shared-nothing architecture.
 Radically better scalability and
performance

 A hybrid of traditional RDBMS and NoSQL
 Scalability and performance of NoSQL and ACID guarantees of RDBMS
 Use SQL as the primary language
 Ability to scale out and run over commodity hardware
 Classified into:
1 New Databases: Designed from scratch
2 New MySQL Storage Engines: Keep MySQL as interface but replace the
storage engine
3 Transparent Clustering: Add pluggable features to existing databases to
ensure scalability

Why column Store ?
 Can be significantly faster than row stores for some applications
 Fetch only required columns for a query
 Better cache effects
 Better compression (similar attribute values within a column)Why Column
 But can be slower for other applications
 OLTP with many row inserts, ..Store?
 Long war between the column store and row store camps

Introduction
 Borrows concepts from both Dynamo and BigTable
 Originally developed by Facebook but now an Apache open source
 project
 Designed for Facebook Chat for efficiently storing, indexing, and
 searching messages

Design Goals
 Processing of a large amount of data
 Highly scalable
 Reliability at a massive scale
 High throughput writes without sacrificing read efficiency

Introduction…
 http://cassandra.apache.org/
• Developed by Facebook (inbox), now Apache
– Facebook now developing its own version again
• Based on Google BigTable (data model) and Amazon Dynamo (partitioning & consistency)
• P2P
– Every node is aware of all other nodes in the cluster
• Design goals
– High availability
– Eventual consistency (improves HA)
– Incremental scalability / elasticity
– Optimistic replication

Data model
– Same as BigTable
– Super Columns (nested Columns) and Super Column Families
– column order in a CF can be specified (name, time)
• Cluster membership
– Gossip – every nodes gossips to 1-3 other nodes about the state of the cluster (merge incoming
info with its own)
– Changes in the cluster (node in/out, failure) propagate quickly (LogN)
– Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2))
• Dynamic partitioning
– Consistent hashing
– Ring of nodes
– Nodes can be “moved” on the ring for load balancing

Cassandra @ Facebook
• Inbox search
• ca. 2009 - 50 TB data, 150 nodes, 2 datacenters
• Performance (production)

Cloud storage

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Cloud storage

Similar to Cloud storage (20)

Recently uploaded

Recently uploaded (20)

Cloud storage