Understanding and building big data Architectures - NoSQL

Understanding and Building Big
Data Architectures
Ran.ga.na.than B, ThoughtWorks
@ran_than
Part 1 - Storage and NoSQL

Img-src: http://dev.assets.neo4j.com.s3.amazonaws.com/wp-content/uploads/graph-data-jim-webber-presentation.
png

Basics
Story of Sir William Osler, University of Oxford

Img-src:http://i.dailymail.co.uk/i/pix/2009/07/01/article-1196775-0543FB03000005DC-763_634x364.jpg

Latency
● Hibernia Express
● 3,000-mile fiber-optic
● across the Atlantic Ocean to connect London to New York
● goal for 5ms latency
● To be used by Financial Institutes for trading
Src: http://shop.oreilly.com/product/0636920028048.do

HDD Speed
❏ ~122MB per sec
❏ 1TB in 2hr 22 minutes
❏ SSDs are 2-3 times faster
Multiple disks reading parallel:
❏ With 100 HDDs, it takes 6 minutes

DataCenter
Img-src: https://fortunedotcom.files.wordpress.com/2015/06/screen-shot-2015-06-24-at-11-54-41-am.png?w=1024

Cluster and nodes
Img-src:https://en.wikipedia.org/wiki/Computer_cluster#/media/File:Cubieboard_HADOOP_cluster.JPG

Clusters
1. MultiNode: e.g: Hadoop, each node has some
responsibility.
2. Peer-to-Peer: e.g: Cassandra, all nodes are equal

Expectations from
Data Systems
Non-functional parts of the application.

Expectations from Architecture
❏ Reliability
❏ Scalability
❏ Maintainability
❏ High Availability
❏ Fault Tolerance
❏ Security
❏ Compliance
❏ Compatibility

Components
❏ Storage
❏ Cache
❏ Search
❏ Stream Processing
❏ Batch Processing
❏ Data Exchange Protocols

NoSQL Databases
“Database Admins walked into a NoSQL bar. A little while later
they walked out because they couldn’t find a table.”

Why NoSQLs?
❏ Scalability
❏ Cost
❏ Flexibility
❏ Availability
❏ Migrations

CAP theorem
Strong Consistency, High Availability, and Partition-Tolerance
Img-src:http://image.slidesharecdn.com/cap-131117230434-phpapp02/95/dynamo-and-bigtable-in-light-of-the-cap-theorem-12-638.jpg?cb=1384729712

CP “when your business requirements dictate
atomic reads and writes”
Src: http://robertgreiner.com/2014/08/cap-theorem-revisited/

AP “when the system needs to continue to function
in spite of external errors”
Src: http://robertgreiner.com/2014/08/cap-theorem-revisited/

Activity
Design ticket booking with scenarios of
CA, CP, AP

ACID
Atomic, Consistent, Isolated, and Durable

BASE
● Basically Available: If a single node fails, part of the
data won't be available, but the entire data layer stays
operational.
● Soft state: Soft state means data that is not persisted
on the disk, yet in case of failure it could be possible to
restore it.
● Eventually consistent: indicates that the system will
become consistent over time, given that the system
doesn't receive input during that time.

Key-Value DBs
Memcached, Redis, Riak, Voldemort, ...

Implementation 1 - Arrays
● Only int as key
● Values are of same type

Implementation 2 - Associative Arrays
Key Value
user1 Mike
user2 Mary
user3 Nina
On hotspace

Simple Storage Design
- put key value - will add content to file in one line
- get key - will grep for key and return the value from the
file
What are the problems with this?
Activity
How can we improve this?

- Add in memory index, with key and value as byte offset.
Activity

- Segments
- Compaction
Activity

- Sorted Key-Value
- Sparse index
- SSTable
Activity

- Memtable
- Segments as SSTable
Img-src: https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/

Simple Storage Design - Overall
- Writes into RedBlack or AVL trees in memory - memtable
=> faster writes
- When memtable is 64MB, write to disk as SSTable and
clean memtable
- First read from memtable and most recent segments in-
memory sparse index (SSTable) => faster reads
- Run a merging and compaction process in the background
=> lesser storage and faster

Assignment
Let us do the e-commerce design with
only key value pair

Document DBs
CouchDB, MongoDB, MarkLogic, DocumentDB, OrientDB, ...

Img-src: http://blog.philipphauer.de/wp-content/uploads/2015/05/Match-OO-Document.png

Query - MongoDB
Img-src: http://bicortex.com/introduction-to-mongodb-nosql-database-for-sql-developers-part-3/

Query - CouchDB
// emit the first letter of each pokemon's name
var myMapReduceFun = {
map: function (doc) {
emit(doc.name.charAt(0));
},
reduce: '_count'
};
// count the pokemon whose names start with 'P'
pouch.query(myMapReduceFun, {
key: 'P', reduce: true, group: true
}).then(function (result) {
// handle result
}).catch(function (err) {
// handle errors
});

What’s cool?
● Flexible schema.
● Embedded docs come in one read.
●

Not so cool
● Familiarity
● Needs more space.
● Doesn’t speak SQL.

Assignment
Let us do the e-commerce design with
only document database

Graph DBs
Neo4j, TitanDB, OrientDB, ...

Relationships in relational DBs means joining

Why?
❏ Modelling and storing relationships in RDBMS is
complicated
❏ Performance degrades with number and levels of
relationships.
❏ Query complexity grows
❏ Adding new type requires schema redesign

With neo4j, you can traverse 4M+ relationships
per second and core

Img-src: http://blog.octo.com/wp-content/uploads/2012/07/RequestInSQL.png
Squeezing to table structure

Img-src: http://blog.octo.com/wp-content/uploads/2012/07/RequestInGraph.png
Closer to white board model

Graph consists of
❏ Vertices
❏ Edges

Column oriented
DBs
Apache HBase, Cassandra, BigTable, ...

Src: https://www.safaribooksonline.com/library/view/hbase-the-definitive/9781449314682/httpatomoreillycomsourceoreillyimages889228.png

Src: http://static.oschina.net/uploads/img/201303/12072155_ROPI.gif

Img-src: http://www.slideshare.net/romain_jacotin/undestand-google-bigtable-is-as-easy-as-playing-lego-bricks-lecture-by-romain-jacotin

When is this better?
❏ Huge number of columns, with queries on few columns
❏ Aggregation
❏ Column level update
❏ Column data is uniform; so better compression

Time Series Data
Measurement and Time of measurement done repeatedly
Img src: https://www.safaribooksonline.com/library/view/time-series-databases/9781491920909/images/tsdn_0103.png.jpg

Why - Time Series Data
● Trends

When - Time Series
Data
● Huge amount of data
● Mostly query based on time
● Stock exchange
● Sensor data. E.g: Trucks
● Cell towers for usage patterns

Time Series Data
● IoT
● Logs

Replication
This is useful when you have a ncie photo or color-black as a
background. On this slide only, you can put your elements behind
a master element.

Sharding
Which bucket has your data?

Master
Slave
- Master for writes and real-time
reads
- Slaves for reads

Table
partitioning
- By rows or columns for parallelizing
reads

Feature
specific DBs
- Different DB servers for specific
features of application
- Can this scale?

Federated
Tables
A Federated Table is a table which points
to a table in another database instance
(mostly on an other server). It can be
seen as a view to this remote database
table.
- Administration overhead
- Security
- Access over network
- Okay for reporting/analytical tasks

Range
based
Split data based on range condition. Eg:
zip code, region.
- Non-uniform distribution

Hash based Take hash of key and modulo operation,
put the data in the server based on
reminder value.
- Uniform distribution
- Range queries may take time

Co-ordinators
- Take request, if key is in the request, talk to correct shard
- Co-ordinate across shards to give the result back
- Monitor health
- Take care of rebalancing
- Can be a random node, which will complete the task
- Set of co-ordinators

Take care, while sharding
● Balance your shards, with proper shard key
● Choose correct number of shards. E.g: 12
● Give time for rebalancing. In case of increasing capacity of
server, add nodes faster, and give time move your shards.
● Shard on denormalized data.
● Try to have shard key as part of your queries.

THANK YOU
For questions or suggestions:
Ran.ga.na.than B
@ran_than

Understanding and building big data Architectures - NoSQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Understanding and building big data Architectures - NoSQL

Similar to Understanding and building big data Architectures - NoSQL (20)

More from Hyderabad Scalability Meetup

More from Hyderabad Scalability Meetup (10)

Recently uploaded

Recently uploaded (20)

Understanding and building big data Architectures - NoSQL