SlideShare a Scribd company logo
Wenjin Gu
wenjin.gu@genesys.com
Modern software design in Big data era
A quick Demo
A simple Java program speeds up 10 times by
adding some dummy variables at the end of the
class declaration.
False sharing
Numbers Everyone Should Know
(taken from Jeff Dean – Google keynote)
•L1 cache reference 0.5 ns
•Branch mispredict 5 ns
•L2 cache reference 7 ns
•Mutex lock/unlock 25 ns
•Main memory reference 100 ns
•Compress 1K w/cheap compression algorithm 3,000 ns
•Send 2K bytes over 1 Gbps network 20,000 ns
•Read 1 MB sequentially from memory 250,000 ns
•Round trip within same datacenter 500,000 ns
•Disk seek 10,000,000 ns
•Read 1 MB sequentially from disk 20,000,000 ns
•Send packet CA->Netherlands->CA 150,000,000 ns
Some facts
• L1<<L2<<RAM<<Disk
• Sequential access is much faster than random
access (10 times+)
• Cheap Compression is faster than transfer
data on the network
• Gbps<Disk<100Mbps
Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression
gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
https://code.google.com/p/snappy/
Key to Performance- Improve memory
efficiency
Java is bad at memory efficiency:
int (4 bytes) -> Integer (16 bytes): always prefer
primary type, but map key must be Object
1M records, each record has 5 string fields: 82M
a. Use Map<Map<String, String>>: 706M
b. Use Map<String, String[]>: 495M
c. Use Map<String, byte[][]>: 292 M
d. Use ByteBuffer + Trove map: 92 M
http://java-performance.info/overview-of-memory-
saving-techniques-java/
Bloom Filter – Hash without value
Question: How to support
remove?
Merkle Tree (Tree of Hash)
Cassandra gossip
Data locality – Key to Performance
• On the cache level, CPU always request data at the
cache line boundary (64 bytes at once)
 Place variables used by a same thread nearby
 Place variables used by different threads at least 64
bytes apart (Java 8 introduced @Contended)
http://daniel.mitterdorfer.name/articles/2014/false-
sharing/
Data locality – Key to Performance
• On the memory and disk level, repeat using same
data set is faster due to warm cache
• On the disk level, sequential access is 10 times faster
than random access => write data sequentially in
blocks
Example: CommitLog, Big table row range
Data locality – Key to Performance
• On the network level, data locality means computing
data locally. Instead of moving data to computation,
moving computation to data. (CPU is faster than
network, so it’s cheaper than data)
Data Decoupling – key to Scalability
Modeling data in reader/writer perspective to eliminate hotspot
instead of group data conceptually
Example:
• Unlike many traditional file systems, GFS does not have a per-
directory data structure that lists all the files in that directory.
GFS logically represents its namespace as a lookup table
mapping full pathnames to metadata. (agent group, access
group vs. agent skills)
• column (family) based database.
Anti-pattern: User settings in CfgPerson
Data Decoupling – key to Scalability
Normalization or Denormalization? It’s a
question.
We are taught for decades Normalization is
good: Small size + Consistency
But, it makes strong data coupling => hard to
be scalable
Data Immutability – key to Scalability
• Always available, no contention
• Always consistent, no need to synchronize
• Can be replicated freely whenever needed
Data Immutability – key to Scalability
• Append instead of update (GFS)
• Merge instead of update (SSTable)
• Add tombstone instead of delete (Cassandra)
SSTable
• SSTable : immutable sorted string table, index table is always in
memory
• Merge to remove tombstone
SSTable (LSM-Tree)
• Commit Log (node): sequential write to maximize write throughput (vs B+ tree)
• SSTable (column family ): immutable sorted string table, index table is always in memory
• Merge to remove tombstone
Shared nothing architecture
• nodes are independent and self-sufficient
• no single point of contention across the
system
• The invention of DHT
Hash is great, but inconsistency is a
showstopper
Consistent Hash- two objects meet at
one keyspace
Karger (MIT, 2001 - Chord)
Cassandra,
MapReduce
HRW hashing
An alternative solution: hashing both data and
host, pick the best fit
w1 = h(S1, O), w2 = h(S2, O), ..., wn = h(Sn, O)
Winner: wO = max {w1, w2, ..., wn}
David Thaler and Chinya Ravishankar (University of
Michigan, 1996)
MapReduce The post office model
MapReduce
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Word count:
Split file
map: (void, line) → list(word, 1)
Shuffle
reduce: (word, list(1)) → list(word, count)
Apache Spark
Apache Spark
• Developed by Berkeley AMPLab
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
• Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs)
• Hadoop MapReduce is on the disk -> Slow
RDDs is a distributed memory model -> Fast
• Traditional distributed memory supports fine
grained updates -> No fault tolerance or need
extensive loggings or replications
RDDs are Immutable, created by coarse
grained transformations (map, join, filter) ->
quickly rebuilt
Other interesting algorithms
• HyperLogLog (cassandra)
•Skip List (lucene,Redis,levelDB)
•MurmurHash (google, cassandra)
•BallTree (google map)
•Fractal Tree(MySQL,mongoDB)
•Dynamic Time Warping
Check list
•Calculate performance in your design
•Estimate data size before you build it
•Good designs are always tailored
•Knows your tools (guava, gs collection,
protobuf, snappy…)
•Share with others

More Related Content

What's hot

Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
Girish L
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Hdfs
HdfsHdfs
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
Anshul Bhatnagar
 
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
leifwalsh
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsHung-yu Lin
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
databloginfo
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
sheetal sharma
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
Yahoo Developer Network
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
Aisha Siddiqa
 

What's hot (20)

Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 

Viewers also liked

Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
Shandra Spears Bombay 2011 2
Shandra Spears Bombay 2011 2Shandra Spears Bombay 2011 2
Shandra Spears Bombay 2011 2
Shandra Spears Bombay
 
ICT2014 Innovaatioturnaus - kokemuksia
ICT2014 Innovaatioturnaus - kokemuksiaICT2014 Innovaatioturnaus - kokemuksia
ICT2014 Innovaatioturnaus - kokemuksia
Mikko Järvilehto
 
El principito ayelen gaspar
El principito ayelen gasparEl principito ayelen gaspar
El principito ayelen gasparayelengaspar
 
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
Atai Rabby
 
Image Fusion - Approaches in Hardware
Image Fusion - Approaches in HardwareImage Fusion - Approaches in Hardware
Image Fusion - Approaches in Hardware
Kshitij Agrawal
 
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Atai Rabby
 
Sales & Cust Service Traning ea4 ss
Sales & Cust Service Traning ea4 ssSales & Cust Service Traning ea4 ss
Sales & Cust Service Traning ea4 ss
AbrahamAW
 
Методика організації медико-педагогічного контролю
Методика організації медико-педагогічного контролюМетодика організації медико-педагогічного контролю
Методика організації медико-педагогічного контролюМарина Д
 
кораблик
корабликкораблик
кораблик
Марина Д
 
LinkedIn Recruiting Solutions
LinkedIn Recruiting SolutionsLinkedIn Recruiting Solutions
LinkedIn Recruiting SolutionsDrew Wills
 
Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016
Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016
Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016
Turistenístico
 
выпуск №1 журнала Астана Innovations
выпуск №1 журнала Астана Innovationsвыпуск №1 журнала Астана Innovations
выпуск №1 журнала Астана InnovationsDana Yesmukhanova
 
Enfermedad del parkinson
Enfermedad del parkinsonEnfermedad del parkinson
Enfermedad del parkinsonvivita1070
 

Viewers also liked (20)

Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Shandra Spears Bombay 2011 2
Shandra Spears Bombay 2011 2Shandra Spears Bombay 2011 2
Shandra Spears Bombay 2011 2
 
Slide Share
Slide Share Slide Share
Slide Share
 
ICT2014 Innovaatioturnaus - kokemuksia
ICT2014 Innovaatioturnaus - kokemuksiaICT2014 Innovaatioturnaus - kokemuksia
ICT2014 Innovaatioturnaus - kokemuksia
 
El principito ayelen gaspar
El principito ayelen gasparEl principito ayelen gaspar
El principito ayelen gaspar
 
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...
 
公司简介
公司简介公司简介
公司简介
 
Image Fusion - Approaches in Hardware
Image Fusion - Approaches in HardwareImage Fusion - Approaches in Hardware
Image Fusion - Approaches in Hardware
 
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...
 
Ideas for Plot
Ideas for PlotIdeas for Plot
Ideas for Plot
 
Sales & Cust Service Traning ea4 ss
Sales & Cust Service Traning ea4 ssSales & Cust Service Traning ea4 ss
Sales & Cust Service Traning ea4 ss
 
Методика організації медико-педагогічного контролю
Методика організації медико-педагогічного контролюМетодика організації медико-педагогічного контролю
Методика організації медико-педагогічного контролю
 
кораблик
корабликкораблик
кораблик
 
Incremento arancelario
Incremento arancelarioIncremento arancelario
Incremento arancelario
 
LinkedIn Recruiting Solutions
LinkedIn Recruiting SolutionsLinkedIn Recruiting Solutions
LinkedIn Recruiting Solutions
 
Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016
Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016
Cinco claves de los mercados hoteleros de Madrid y Barcelona 2016
 
Pixels_Camp
Pixels_CampPixels_Camp
Pixels_Camp
 
выпуск №1 журнала Астана Innovations
выпуск №1 журнала Астана Innovationsвыпуск №1 журнала Астана Innovations
выпуск №1 журнала Астана Innovations
 
Qr code
Qr codeQr code
Qr code
 
Enfermedad del parkinson
Enfermedad del parkinsonEnfermedad del parkinson
Enfermedad del parkinson
 

Similar to Modern software design in Big data era

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
hothyfa
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Cloudera, Inc.
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Cassandra
CassandraCassandra
Cassandra
Upaang Saxena
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
András Fehér
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
Nilesh Salpe
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
HADOOP
HADOOPHADOOP
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
Adnan Siddiqi
 

Similar to Modern software design in Big data era (20)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Cassandra
CassandraCassandra
Cassandra
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
HADOOP
HADOOPHADOOP
HADOOP
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 

Recently uploaded

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 

Recently uploaded (20)

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 

Modern software design in Big data era

  • 2. A quick Demo A simple Java program speeds up 10 times by adding some dummy variables at the end of the class declaration.
  • 4. Numbers Everyone Should Know (taken from Jeff Dean – Google keynote) •L1 cache reference 0.5 ns •Branch mispredict 5 ns •L2 cache reference 7 ns •Mutex lock/unlock 25 ns •Main memory reference 100 ns •Compress 1K w/cheap compression algorithm 3,000 ns •Send 2K bytes over 1 Gbps network 20,000 ns •Read 1 MB sequentially from memory 250,000 ns •Round trip within same datacenter 500,000 ns •Disk seek 10,000,000 ns •Read 1 MB sequentially from disk 20,000,000 ns •Send packet CA->Netherlands->CA 150,000,000 ns
  • 5.
  • 6. Some facts • L1<<L2<<RAM<<Disk • Sequential access is much faster than random access (10 times+) • Cheap Compression is faster than transfer data on the network • Gbps<Disk<100Mbps Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression gzip: encode@25MB/s, decode@200MB/s, 4-6X compression https://code.google.com/p/snappy/
  • 7. Key to Performance- Improve memory efficiency Java is bad at memory efficiency: int (4 bytes) -> Integer (16 bytes): always prefer primary type, but map key must be Object 1M records, each record has 5 string fields: 82M a. Use Map<Map<String, String>>: 706M b. Use Map<String, String[]>: 495M c. Use Map<String, byte[][]>: 292 M d. Use ByteBuffer + Trove map: 92 M http://java-performance.info/overview-of-memory- saving-techniques-java/
  • 8. Bloom Filter – Hash without value Question: How to support remove?
  • 9. Merkle Tree (Tree of Hash) Cassandra gossip
  • 10. Data locality – Key to Performance • On the cache level, CPU always request data at the cache line boundary (64 bytes at once)  Place variables used by a same thread nearby  Place variables used by different threads at least 64 bytes apart (Java 8 introduced @Contended) http://daniel.mitterdorfer.name/articles/2014/false- sharing/
  • 11. Data locality – Key to Performance • On the memory and disk level, repeat using same data set is faster due to warm cache • On the disk level, sequential access is 10 times faster than random access => write data sequentially in blocks Example: CommitLog, Big table row range
  • 12. Data locality – Key to Performance • On the network level, data locality means computing data locally. Instead of moving data to computation, moving computation to data. (CPU is faster than network, so it’s cheaper than data)
  • 13. Data Decoupling – key to Scalability Modeling data in reader/writer perspective to eliminate hotspot instead of group data conceptually Example: • Unlike many traditional file systems, GFS does not have a per- directory data structure that lists all the files in that directory. GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. (agent group, access group vs. agent skills) • column (family) based database. Anti-pattern: User settings in CfgPerson
  • 14. Data Decoupling – key to Scalability Normalization or Denormalization? It’s a question. We are taught for decades Normalization is good: Small size + Consistency But, it makes strong data coupling => hard to be scalable
  • 15. Data Immutability – key to Scalability • Always available, no contention • Always consistent, no need to synchronize • Can be replicated freely whenever needed
  • 16. Data Immutability – key to Scalability • Append instead of update (GFS) • Merge instead of update (SSTable) • Add tombstone instead of delete (Cassandra)
  • 17. SSTable • SSTable : immutable sorted string table, index table is always in memory • Merge to remove tombstone
  • 18. SSTable (LSM-Tree) • Commit Log (node): sequential write to maximize write throughput (vs B+ tree) • SSTable (column family ): immutable sorted string table, index table is always in memory • Merge to remove tombstone
  • 19. Shared nothing architecture • nodes are independent and self-sufficient • no single point of contention across the system • The invention of DHT
  • 20. Hash is great, but inconsistency is a showstopper
  • 21. Consistent Hash- two objects meet at one keyspace Karger (MIT, 2001 - Chord) Cassandra, MapReduce
  • 22. HRW hashing An alternative solution: hashing both data and host, pick the best fit w1 = h(S1, O), w2 = h(S2, O), ..., wn = h(Sn, O) Winner: wO = max {w1, w2, ..., wn} David Thaler and Chinya Ravishankar (University of Michigan, 1996)
  • 23. MapReduce The post office model
  • 24. MapReduce map: (K1, V1) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3) Word count: Split file map: (void, line) → list(word, 1) Shuffle reduce: (word, list(1)) → list(word, count)
  • 26. Apache Spark • Developed by Berkeley AMPLab • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Resilient Distributed Datasets (RDDs)
  • 27. Resilient Distributed Datasets (RDDs) • Hadoop MapReduce is on the disk -> Slow RDDs is a distributed memory model -> Fast • Traditional distributed memory supports fine grained updates -> No fault tolerance or need extensive loggings or replications RDDs are Immutable, created by coarse grained transformations (map, join, filter) -> quickly rebuilt
  • 28. Other interesting algorithms • HyperLogLog (cassandra) •Skip List (lucene,Redis,levelDB) •MurmurHash (google, cassandra) •BallTree (google map) •Fractal Tree(MySQL,mongoDB) •Dynamic Time Warping
  • 29. Check list •Calculate performance in your design •Estimate data size before you build it •Good designs are always tailored •Knows your tools (guava, gs collection, protobuf, snappy…) •Share with others