Scalability broad strokes

Scalability
Broad Strokes - Best practices

Definition
● Concurrency a.k.a number of simultaneous
requests, Latency
● Throughput a.k.a total number of item
processed
● Extensibility - application design for ability to
add new features etc.
● We’d be mostly talking about first two.

Concurrency & Performance
● Scalability is measured as number of
requests/users an application support
without degrading the performance.
● Performance is a measure of individual
request process time mostly.

Handling Scale
● Throttling
● Cache
● Stateful vs. stateless
● Asynchronous vs. synchronous
● Service oriented design

Where (Multi tiered)
● At the client (Browser)
○ Http headers
○ Asynchronous calls
○ local DB
● At the server ( Web tier/application tier)
○ Cache -- distributed
○ Stateless
○ Asynchronous
● DB
○ Cap theorem

Client
● Http headers
○ Pragmatic headers not only cache on browsers but
help with intelligent proxies.
○ YSlow/G page speed guidelines are always useful.
○ e-Tags, long expiry are very good practices.
○ sprites and image maps
● Ajax is good for scalability but some time may cause
performance issues.

Client Server Network
● Always compress response.
● Even on JSON the bandwidth gains are
great.
● In server-server calls consider binary
protocols or more efficient ones
● Even on the web, network layer like spdy
etc. are interesting.

Server -- Numbers all should know
● http://static.googleusercontent.com/media/research.
google.com/en//people/jeff/stanford-295-talk.pdf
● Writes are heavy.
● Disk seeks are heavier than network round trip with
memory seek.
● Global shared data is expensive, if locking is involved.
● Reads do not need to be transactional, just consistent.
● Eventual consistency is useful.

Server - Cache(Low latency)
● Cache
○ Complete HTML response
○ Output from Database
● Cache strategy is determined by
○ is it a broadcast?
○ is it a multicast?
○ A unicast?
● Cache works best for broadcast.
● Distributed Caching with consistent hash works very well.
● Pitfall is cache purge

Server (Concurrency)
● Sequential processing is leaving out CPU and other
resources
● Write parallelism is very important.
● But Shared globals are heavy, hence a trade off.
● In case of Java, JMM understanding is necessary.
● Amdahl’s Law helps in determining the maximum gain
that can be achieved with parallel implementations.
● If making it parallel, even a small fraction of sequential
work can cause loss of throughput

Server (State?full:less)
● Given shared access is expensive, keeping state on
server is heavy.
● Sessions if available on shared memory are great.
● No session and share nothing works best.
● Even cache is better.
● Generally stateless code is modular, easier to unit test
and easier to profile.
● On a function stack than heap.
● Stateless helps in scale out. (Scale out??)

Server Synchronous/Asynchronous
● Waiting for I/O, network connections, DB queries is bad.
● How about “query of death”? on write?
● Writes if not very small should be kept asynchronous.
● Helps on parallelization.
● Reliable queues can improve latency.
● idempotent code helps in avoiding many pitfalls.
● Generally asynchronous is achieved
○ Queue/Topic based infrastructure
■ Good for event processing and propagation of events
○ Incremental batches
● Asynch I/O ? servers, Node.js/ngnix/apache event mpm ??

Debugging for Scale
● Profile
○ In java
■ gc logs
■ JVisualVM
■ Thread and memory dumps
○ GNU
■ hprof
■ strace
■ gdb
■ system utilities

Scale Horizontal vs. vertical
● For a stateless, asynchronous, idempotent
and multithreaded application the horizontal
scaling works , very well.
● Easier to understand with storage a.k.a
databases.

Database
● Which type of DBMS ?
○ RDBMS
○ Key space based multi column family
○ Document based
○ Graph
○ any other NoSQL?
○ Solr and elasticsearch

Database scale out limitation
● CAP theorem
○ Consistency
○ Availability
○ Partition tolerance
○ Not available simultaneously
● Eventual consistency is preferred choice.

RDBMS
● Index based query always
● For RDBMS a query of death is a death knock.
● Generally Write once and read at multiple slaves works
better.
● To normalize or not
● normalize for extensibility
● Use solr/nosql for read scale
● One multiple table join complex query or multiple simple
query?? (performance/scale)

NoSQL
● Several options ranging from document databases to
multiple column family
● We mostly use
○ Mongo
○ Cassandra
○ Neo4j (in some cases)
○ Titan
● Provide very high throughput with manageable
clustering/sharding

Mongo (iBeat)
● Increasing data volumes threatens the
scalability and availability
● Though search is available, it’s not very
efficient.
● The limit of a single document is 16 MB.
● Repair DB and reindexing do impact
performance.

Mongo (iBeat ..)
● Mongo sharding as a solution
● Data volume per replica set decreased.
● For document size limit gridFS was used.
● With less document volume, the overhead of
index etc. reduced.
● But sharding itself with large amount of data
was carried out over a long period of time.

Big Data
● Normally associated with such large and complex data that traditional data
management/visualization tools fail to capture, curate or process.
● Current definition defines 3 aspects a.k.a (3V)
○ Volume
○ Velocity
○ Variety
● General usage is in
○ Genetic algorithms
○ Machine learning
○ Natural language processing
○ Time series analysis (a.k.a attribution analysis)
○ Visualizations

Big Data
● Our usage is
○ Analytics
○ User preference,personalization,profiling
○ Recommendation
○ Decision support system
● The standard known open source eco
systems
○ Hadoop
○ Event processors /stream engines e.g. storm,spark,S4

Big data (Hadoop..)
● Hadoop - Originally a component of Nutch, is now a
biggest driver in big data technologies.
● MapReduce a mechanism/framework to run massively
parallel systems. Published originally by Google.
● Mapreduce - the trick is distributed sorting.
● New languages for statistical computation e.g. R

Hadoop stack components
Image borrowed from http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/

Big data - Real time analysis
● While Map Reduce is great throughput
solution, it doesn’t help with real time or near
real time processing
● Eco system are evolving either coupled with
MapReduce or HDFS.
● Storm/Spark stream for augmenting
Mapreduce based computations.

Most important
● Ability to determine impact of changes
● Seamless deployments

Scalability broad strokes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Scalability broad strokes

Similar to Scalability broad strokes (20)

Recently uploaded

Recently uploaded (20)

Scalability broad strokes