Modern software design in Big data era

Wenjin Gu
wenjin.gu@genesys.com
Modern software design in Big data era

A quick Demo
A simple Java program speeds up 10 times by
adding some dummy variables at the end of the
class declaration.

Numbers Everyone Should Know
(taken from Jeff Dean – Google keynote)
•L1 cache reference 0.5 ns
•Branch mispredict 5 ns
•L2 cache reference 7 ns
•Mutex lock/unlock 25 ns
•Main memory reference 100 ns
•Compress 1K w/cheap compression algorithm 3,000 ns
•Send 2K bytes over 1 Gbps network 20,000 ns
•Read 1 MB sequentially from memory 250,000 ns
•Round trip within same datacenter 500,000 ns
•Disk seek 10,000,000 ns
•Read 1 MB sequentially from disk 20,000,000 ns
•Send packet CA->Netherlands->CA 150,000,000 ns

Some facts
• L1<<L2<<RAM<<Disk
• Sequential access is much faster than random
access (10 times+)
• Cheap Compression is faster than transfer
data on the network
• Gbps<Disk<100Mbps
Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression
gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
https://code.google.com/p/snappy/

Key to Performance- Improve memory
efficiency
Java is bad at memory efficiency:
int (4 bytes) -> Integer (16 bytes): always prefer
primary type, but map key must be Object
1M records, each record has 5 string fields: 82M
a. Use Map<Map<String, String>>: 706M
b. Use Map<String, String[]>: 495M
c. Use Map<String, byte[][]>: 292 M
d. Use ByteBuffer + Trove map: 92 M
http://java-performance.info/overview-of-memory-
saving-techniques-java/

Bloom Filter – Hash without value
Question: How to support
remove?

Merkle Tree (Tree of Hash)
Cassandra gossip

Data locality – Key to Performance
• On the cache level, CPU always request data at the
cache line boundary (64 bytes at once)
 Place variables used by a same thread nearby
 Place variables used by different threads at least 64
bytes apart (Java 8 introduced @Contended)
http://daniel.mitterdorfer.name/articles/2014/false-
sharing/

• On the memory and disk level, repeat using same
data set is faster due to warm cache
• On the disk level, sequential access is 10 times faster
than random access => write data sequentially in
blocks
Example: CommitLog, Big table row range

• On the network level, data locality means computing
data locally. Instead of moving data to computation,
moving computation to data. (CPU is faster than
network, so it’s cheaper than data)

Data Decoupling – key to Scalability
Modeling data in reader/writer perspective to eliminate hotspot
instead of group data conceptually
Example:
• Unlike many traditional file systems, GFS does not have a per-
directory data structure that lists all the files in that directory.
GFS logically represents its namespace as a lookup table
mapping full pathnames to metadata. (agent group, access
group vs. agent skills)
• column (family) based database.
Anti-pattern: User settings in CfgPerson

Data Decoupling – key to Scalability
Normalization or Denormalization? It’s a
question.
We are taught for decades Normalization is
good: Small size + Consistency
But, it makes strong data coupling => hard to
be scalable

Data Immutability – key to Scalability
• Always available, no contention
• Always consistent, no need to synchronize
• Can be replicated freely whenever needed

Data Immutability – key to Scalability
• Append instead of update (GFS)
• Merge instead of update (SSTable)
• Add tombstone instead of delete (Cassandra)

SSTable
• SSTable : immutable sorted string table, index table is always in
memory
• Merge to remove tombstone

SSTable (LSM-Tree)
• Commit Log (node): sequential write to maximize write throughput (vs B+ tree)
• SSTable (column family ): immutable sorted string table, index table is always in memory
• Merge to remove tombstone

Shared nothing architecture
• nodes are independent and self-sufficient
• no single point of contention across the
system
• The invention of DHT

Hash is great, but inconsistency is a
showstopper

Consistent Hash- two objects meet at
one keyspace
Karger (MIT, 2001 - Chord)
Cassandra,
MapReduce

HRW hashing
An alternative solution: hashing both data and
host, pick the best fit
w1 = h(S1, O), w2 = h(S2, O), ..., wn = h(Sn, O)
Winner: wO = max {w1, w2, ..., wn}
David Thaler and Chinya Ravishankar (University of
Michigan, 1996)

MapReduce The post office model

MapReduce
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Word count:
Split file
map: (void, line) → list(word, 1)
Shuffle
reduce: (word, list(1)) → list(word, count)

Apache Spark
• Developed by Berkeley AMPLab
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
• Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)
• Hadoop MapReduce is on the disk -> Slow
RDDs is a distributed memory model -> Fast
• Traditional distributed memory supports fine
grained updates -> No fault tolerance or need
extensive loggings or replications
RDDs are Immutable, created by coarse
grained transformations (map, join, filter) ->
quickly rebuilt

Other interesting algorithms
• HyperLogLog (cassandra)
•Skip List (lucene,Redis,levelDB)
•MurmurHash (google, cassandra)
•BallTree (google map)
•Fractal Tree(MySQL,mongoDB)
•Dynamic Time Warping

Check list
•Calculate performance in your design
•Estimate data size before you build it
•Good designs are always tailored
•Knows your tools (guava, gs collection,
protobuf, snappy…)
•Share with others

Modern software design in Big data era

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Modern software design in Big data era

Similar to Modern software design in Big data era (20)

Recently uploaded

Recently uploaded (20)

Modern software design in Big data era