2. Definition
● Concurrency a.k.a number of simultaneous
requests, Latency
● Throughput a.k.a total number of item
processed
● Extensibility - application design for ability to
add new features etc.
● We’d be mostly talking about first two.
3. Concurrency & Performance
● Scalability is measured as number of
requests/users an application support
without degrading the performance.
● Performance is a measure of individual
request process time mostly.
4. Handling Scale
● Throttling
● Cache
● Stateful vs. stateless
● Asynchronous vs. synchronous
● Service oriented design
5. Where (Multi tiered)
● At the client (Browser)
○ Http headers
○ Asynchronous calls
○ local DB
● At the server ( Web tier/application tier)
○ Cache -- distributed
○ Stateless
○ Asynchronous
● DB
○ Cap theorem
6. Client
● Http headers
○ Pragmatic headers not only cache on browsers but
help with intelligent proxies.
○ YSlow/G page speed guidelines are always useful.
○ e-Tags, long expiry are very good practices.
○ sprites and image maps
● Ajax is good for scalability but some time may cause
performance issues.
7. Client Server Network
● Always compress response.
● Even on JSON the bandwidth gains are
great.
● In server-server calls consider binary
protocols or more efficient ones
● Even on the web, network layer like spdy
etc. are interesting.
8. Server -- Numbers all should know
● http://static.googleusercontent.com/media/research.
google.com/en//people/jeff/stanford-295-talk.pdf
● Writes are heavy.
● Disk seeks are heavier than network round trip with
memory seek.
● Global shared data is expensive, if locking is involved.
● Reads do not need to be transactional, just consistent.
● Eventual consistency is useful.
9. Server - Cache(Low latency)
● Cache
○ Complete HTML response
○ Output from Database
● Cache strategy is determined by
○ is it a broadcast?
○ is it a multicast?
○ A unicast?
● Cache works best for broadcast.
● Distributed Caching with consistent hash works very well.
● Pitfall is cache purge
10. Server (Concurrency)
● Sequential processing is leaving out CPU and other
resources
● Write parallelism is very important.
● But Shared globals are heavy, hence a trade off.
● In case of Java, JMM understanding is necessary.
● Amdahl’s Law helps in determining the maximum gain
that can be achieved with parallel implementations.
● If making it parallel, even a small fraction of sequential
work can cause loss of throughput
11. Server (State?full:less)
● Given shared access is expensive, keeping state on
server is heavy.
● Sessions if available on shared memory are great.
● No session and share nothing works best.
● Even cache is better.
● Generally stateless code is modular, easier to unit test
and easier to profile.
● On a function stack than heap.
● Stateless helps in scale out. (Scale out??)
12. Server Synchronous/Asynchronous
● Waiting for I/O, network connections, DB queries is bad.
● How about “query of death”? on write?
● Writes if not very small should be kept asynchronous.
● Helps on parallelization.
● Reliable queues can improve latency.
● idempotent code helps in avoiding many pitfalls.
● Generally asynchronous is achieved
○ Queue/Topic based infrastructure
■ Good for event processing and propagation of events
○ Incremental batches
● Asynch I/O ? servers, Node.js/ngnix/apache event mpm ??
13. Debugging for Scale
● Profile
○ In java
■ gc logs
■ JVisualVM
■ Thread and memory dumps
○ GNU
■ hprof
■ strace
■ gdb
■ system utilities
14. Scale Horizontal vs. vertical
● For a stateless, asynchronous, idempotent
and multithreaded application the horizontal
scaling works , very well.
● Easier to understand with storage a.k.a
databases.
15. Database
● Which type of DBMS ?
○ RDBMS
○ Key space based multi column family
○ Document based
○ Graph
○ any other NoSQL?
○ Solr and elasticsearch
16. Database scale out limitation
● CAP theorem
○ Consistency
○ Availability
○ Partition tolerance
○ Not available simultaneously
● Eventual consistency is preferred choice.
17. RDBMS
● Index based query always
● For RDBMS a query of death is a death knock.
● Generally Write once and read at multiple slaves works
better.
● To normalize or not
● normalize for extensibility
● Use solr/nosql for read scale
● One multiple table join complex query or multiple simple
query?? (performance/scale)
18. NoSQL
● Several options ranging from document databases to
multiple column family
● We mostly use
○ Mongo
○ Cassandra
○ Neo4j (in some cases)
○ Titan
● Provide very high throughput with manageable
clustering/sharding
19. Mongo (iBeat)
● Increasing data volumes threatens the
scalability and availability
● Though search is available, it’s not very
efficient.
● The limit of a single document is 16 MB.
● Repair DB and reindexing do impact
performance.
20. Mongo (iBeat ..)
● Mongo sharding as a solution
● Data volume per replica set decreased.
● For document size limit gridFS was used.
● With less document volume, the overhead of
index etc. reduced.
● But sharding itself with large amount of data
was carried out over a long period of time.
21. Big Data
● Normally associated with such large and complex data that traditional data
management/visualization tools fail to capture, curate or process.
● Current definition defines 3 aspects a.k.a (3V)
○ Volume
○ Velocity
○ Variety
● General usage is in
○ Genetic algorithms
○ Machine learning
○ Natural language processing
○ Time series analysis (a.k.a attribution analysis)
○ Visualizations
22. Big Data
● Our usage is
○ Analytics
○ User preference,personalization,profiling
○ Recommendation
○ Decision support system
● The standard known open source eco
systems
○ Hadoop
○ Event processors /stream engines e.g. storm,spark,S4
23. Big data (Hadoop..)
● Hadoop - Originally a component of Nutch, is now a
biggest driver in big data technologies.
● MapReduce a mechanism/framework to run massively
parallel systems. Published originally by Google.
● Mapreduce - the trick is distributed sorting.
● New languages for statistical computation e.g. R
24. Hadoop stack components
Image borrowed from http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/
25. Big data - Real time analysis
● While Map Reduce is great throughput
solution, it doesn’t help with real time or near
real time processing
● Eco system are evolving either coupled with
MapReduce or HDFS.
● Storm/Spark stream for augmenting
Mapreduce based computations.