- LiveJournal began as a college hobby project in 1999 and has since grown to over 10 million accounts
- As LiveJournal scaled up, they moved from a single server architecture to multiple server clusters to improve load balancing and high availability
- LiveJournal built several open source tools to address scaling needs, including memcached for caching, MogileFS for distributed file storage, and TheSchwartz for asynchronous job dispatch
Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
In this talk Ben will walk you through running Cassandra in a docker environment to give you a flexible development environment that uses only a very small set of resources, both locally and with your favorite cloud provider. Lessons learned running Cassandra with a very small set of resources are applicable to both your local development environment and larger, less constrained production deployments.
Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
In this talk Ben will walk you through running Cassandra in a docker environment to give you a flexible development environment that uses only a very small set of resources, both locally and with your favorite cloud provider. Lessons learned running Cassandra with a very small set of resources are applicable to both your local development environment and larger, less constrained production deployments.
Jaime Piña, @variadico, Software Engineer at Apcera
Microservice issues are networking issues. Fixing code in your app is easy, but the hard part of using microservices is the networking. How do you actually know if you're sending what you think you are? Why does this request fail in my app, but not when I use curl? Is this service very slow or is it up at all?
This talk will help demystify some common problems you might experience while building out your collection of microservices. Once you can find the issue, it becomes way easier to fix.
OSv is a new, high-performance OS for virtual machines in the cloud. Designed to run one application per guest with minimal overhead, OSv eliminates important bottlenecks for NoSQL applications through improvements in memory management, network I/O, and scheduling. And many important bottlenecks for NoSQL applications are tunable on a conventional OS, but do not require tuning in the OSv environment.
OSv is fully stateless and can be configured at runtime with cloud-init or through a REST API, with zero configuration files. OSv offers unified tracing from the application layer through the JVM and the OS kernel. Attendees will learn how to boot Cassandra in one second, and create a simple cluster in a minute.
Netflix tunes Amazon EC2 instances for maximum performance. In this session, you learn how Netflix configures the fastest possible EC2 instances, while reducing latency outliers. This session explores the various Xen modes (e.g., HVM, PV, etc.) and how they are optimized for different workloads. Hear how Netflix chooses Linux kernel versions based on desired performance characteristics and receive a firsthand look at how they set kernel tunables, including hugepages. You also hear about Netflix's use of SR-IOV to enable enhanced networking and their approach to observability, which can exonerate EC2 issues and direct attention back to application performance.
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services
Tuning your EC2 web server will help you to improve application server throughput and cost-efficiency as well as reduce request latency. In this session we will walk through tactics to identify bottlenecks using tools such as CloudWatch in order to drive the appropriate allocation of EC2 and EBS resources. In addition, we will also be reviewing some performance optimizations and best practices for popular web servers such as Nginx and Apache in order to take advantage of the latest EC2 capabilities.
Anatomy of the libvirt virtualization library
http://www.ibm.com/developerworks/library/l-libvirt/
libvirt
http://libvirt.org/index.html
Scheduling
http://docs.openstack.org/icehouse/config-reference/content/section_compute-scheduler.html
Openstack Zoning – Region/Availability Zone/Host Aggregate
https://kimizhang.wordpress.com/2013/08/26/openstack-zoning-regionavailability-zonehost-aggregate/
Availability Zones and Host Aggregates in OpenStack Compute (Nova)
http://blog.russellbryant.net/2013/05/21/availability-zones-and-host-aggregates-in-openstack-compute-nova/
An Introduction to Droplet Metadata
https://www.digitalocean.com/community/tutorials/an-introduction-to-droplet-metadata
HOW WE USE CLOUDINIT IN OPENSTACK HEAT
http://sdake.io/2013/03/03/how-we-use-cloudinit-in-openstack-heat/
How to inject file/meta/ssh key/root password/userdata/config drive to a VM during nova boot
https://kimizhang.wordpress.com/2014/03/18/how-to-inject-filemetassh-keyroot-passworduserdataconfig-drive-to-a-vm-during-nova-boot/
Cloud-init
https://cloudinit.readthedocs.org/en/latest/
This presentation explains how to deploy and use the Integrated Caching feature on Netscaler. I gave this presentation to Citrix staff, customers and partners in worldwide in 2011. The presentation covers best practices and gotchas :) Integrated Caching is an excellent feature that can greatly improve the performance of your website.
Listen up, developers. You are not special. Your infrastructure is not a beautiful and unique snowflake. You have the same tech debt as everyone else. This is a talk about a better way to build and manage infrastructure: Terraform Modules. It goes over how to build infrastructure as code, package that code into reusable modules, design clean and flexible APIs for those modules, write automated tests for the modules, and combine multiple modules into an end-to-end techs tack in minutes.
You can find the video here: https://www.youtube.com/watch?v=LVgP63BkhKQ
Jaime Piña, @variadico, Software Engineer at Apcera
Microservice issues are networking issues. Fixing code in your app is easy, but the hard part of using microservices is the networking. How do you actually know if you're sending what you think you are? Why does this request fail in my app, but not when I use curl? Is this service very slow or is it up at all?
This talk will help demystify some common problems you might experience while building out your collection of microservices. Once you can find the issue, it becomes way easier to fix.
OSv is a new, high-performance OS for virtual machines in the cloud. Designed to run one application per guest with minimal overhead, OSv eliminates important bottlenecks for NoSQL applications through improvements in memory management, network I/O, and scheduling. And many important bottlenecks for NoSQL applications are tunable on a conventional OS, but do not require tuning in the OSv environment.
OSv is fully stateless and can be configured at runtime with cloud-init or through a REST API, with zero configuration files. OSv offers unified tracing from the application layer through the JVM and the OS kernel. Attendees will learn how to boot Cassandra in one second, and create a simple cluster in a minute.
Netflix tunes Amazon EC2 instances for maximum performance. In this session, you learn how Netflix configures the fastest possible EC2 instances, while reducing latency outliers. This session explores the various Xen modes (e.g., HVM, PV, etc.) and how they are optimized for different workloads. Hear how Netflix chooses Linux kernel versions based on desired performance characteristics and receive a firsthand look at how they set kernel tunables, including hugepages. You also hear about Netflix's use of SR-IOV to enable enhanced networking and their approach to observability, which can exonerate EC2 issues and direct attention back to application performance.
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services
Tuning your EC2 web server will help you to improve application server throughput and cost-efficiency as well as reduce request latency. In this session we will walk through tactics to identify bottlenecks using tools such as CloudWatch in order to drive the appropriate allocation of EC2 and EBS resources. In addition, we will also be reviewing some performance optimizations and best practices for popular web servers such as Nginx and Apache in order to take advantage of the latest EC2 capabilities.
Anatomy of the libvirt virtualization library
http://www.ibm.com/developerworks/library/l-libvirt/
libvirt
http://libvirt.org/index.html
Scheduling
http://docs.openstack.org/icehouse/config-reference/content/section_compute-scheduler.html
Openstack Zoning – Region/Availability Zone/Host Aggregate
https://kimizhang.wordpress.com/2013/08/26/openstack-zoning-regionavailability-zonehost-aggregate/
Availability Zones and Host Aggregates in OpenStack Compute (Nova)
http://blog.russellbryant.net/2013/05/21/availability-zones-and-host-aggregates-in-openstack-compute-nova/
An Introduction to Droplet Metadata
https://www.digitalocean.com/community/tutorials/an-introduction-to-droplet-metadata
HOW WE USE CLOUDINIT IN OPENSTACK HEAT
http://sdake.io/2013/03/03/how-we-use-cloudinit-in-openstack-heat/
How to inject file/meta/ssh key/root password/userdata/config drive to a VM during nova boot
https://kimizhang.wordpress.com/2014/03/18/how-to-inject-filemetassh-keyroot-passworduserdataconfig-drive-to-a-vm-during-nova-boot/
Cloud-init
https://cloudinit.readthedocs.org/en/latest/
This presentation explains how to deploy and use the Integrated Caching feature on Netscaler. I gave this presentation to Citrix staff, customers and partners in worldwide in 2011. The presentation covers best practices and gotchas :) Integrated Caching is an excellent feature that can greatly improve the performance of your website.
Listen up, developers. You are not special. Your infrastructure is not a beautiful and unique snowflake. You have the same tech debt as everyone else. This is a talk about a better way to build and manage infrastructure: Terraform Modules. It goes over how to build infrastructure as code, package that code into reusable modules, design clean and flexible APIs for those modules, write automated tests for the modules, and combine multiple modules into an end-to-end techs tack in minutes.
You can find the video here: https://www.youtube.com/watch?v=LVgP63BkhKQ
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
Once the staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, one can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting 10s GB/s bandwidths, less than 100 microseconds of latencies, with millions of IOPS. How does one leverage this phenomenal performance for popular data processing frameworks such as Apache Spark, Flink, Hadoop that we all know and love?
In this talk, I will introduce the Apache Crail (Incubating), which is a fast, distributed data store that is designed specifically for high-performance network and storage devices. The goal of the project is to deliver the true hardware performance to Apache data processing frameworks in the most accessible way. With its modular design, Crail supports multiple storage back ends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Crail provides multiple flexible APIs (file system, KV, HDFS, streaming) for a better integration with the high-level data access operations in Apache compute frameworks. As a result, on a 100 Gbps network infrastructure, Crail delivers all-to-all shuffle operations at 80+ Gbps speed, broadcast operations at less than 10 usec latencies, and more than 8M lookups/namenode, etc. Moreover, Crail is a generic solution that integrates well with the Apache ecosystem including frameworks like Spark, Hadoop, Hive, etc.
I will present the case for Crail, its current status, and future plans. As Crail is a young Apache project, we are seeking to build a community and expand its application to other interesting domains.
Speaker
Animesh Trivedi, IBM Research, Research Staff Member (RSM)
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - SlidesSeveralnines
Galera is a MySQL replication technology that can simplify the design of a high availability application stack. With a true multi-master MySQL setup, an application can now read and write from any database instance without worrying about master/slave roles, data integrity, slave lag or other drawbacks of asynchronous replication.
And that all sounds great until it’s time to go into production. Throw in a live migration from an existing database setup and devops life just got a bit more interesting ...
So if you are in devops, then this webinar is for you!
Operations is not so much about specific technologies, but about the techniques and tools you use to deploy and manage them. Monitoring, managing schema changes and pushing them in production, performance optimizations, configurations, version upgrades, backups; these are all aspects to consider – preferably before going live.
Let us guide you through 9 key tips to consider before taking Galera Cluster into production.
The DrupalCampLA 2011 presentation on backend performance. The slides go over optimizations that can be done through the LAMP (or now VAN LAMMP stack for even more performance) to get everything up and running.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
We will present our Office 365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DSE on Azure.
The presentation will feature demos on how you too can build similar applications.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
This chapter contains information for memory compilers available in STDL80 cell library. These are
complete compilers that consist of various generators to satisfy the requirements of the circuit at hand. Each
of the final building block, the physical layout, will be implemented as a stand-alone, densely packed,
pitch-matched array. Using this complex layout generator and adopting state-of-the-art logic and circuit
design technique, these memory cells can realize extreme density and performance. In each layout
generator, we added an option which makes the aspect ratio of the physical layout selectable so that the
ASIC designers can choose the aspect ratio according to the convenience of the chip level layout.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
PostgreSQL is very differently architected and presents none
of these problems. All PostgreSQL operations are multi-versioned using
Multi-Version Concurrency Control (MVCC). As a result, common
operations such as re-indexing, adding or dropping columns, and
recreating views can be performed online and without excessive locking,
What Exactly Is The Common Rail Direct Injection System & How Does It WorkMotor Cars International
Learn about Common Rail Direct Injection (CRDi) - the revolutionary technology that has made diesel engines more efficient. Explore its workings, advantages like enhanced fuel efficiency and increased power output, along with drawbacks such as complexity and higher initial cost. Compare CRDi with traditional diesel engines and discover why it's the preferred choice for modern engines.
In this presentation, we have discussed a very important feature of BMW X5 cars… the Comfort Access. Things that can significantly limit its functionality. And things that you can try to restore the functionality of such a convenient feature of your vehicle.
Things to remember while upgrading the brakes of your carjennifermiller8137
Upgrading the brakes of your car? Keep these things in mind before doing so. Additionally, start using an OBD 2 GPS tracker so that you never miss a vehicle maintenance appointment. On top of this, a car GPS tracker will also let you master good driving habits that will let you increase the operational life of your car’s brakes.
Comprehensive program for Agricultural Finance, the Automotive Sector, and Empowerment . We will define the full scope and provide a detailed two-week plan for identifying strategic partners in each area within Limpopo, including target areas.:
1. Agricultural : Supporting Primary and Secondary Agriculture
• Scope: Provide support solutions to enhance agricultural productivity and sustainability.
• Target Areas: Polokwane, Tzaneen, Thohoyandou, Makhado, and Giyani.
2. Automotive Sector: Partnerships with Mechanics and Panel Beater Shops
• Scope: Develop collaborations with automotive service providers to improve service quality and business operations.
• Target Areas: Polokwane, Lephalale, Mokopane, Phalaborwa, and Bela-Bela.
3. Empowerment : Focusing on Women Empowerment
• Scope: Provide business support support and training to women-owned businesses, promoting economic inclusion.
• Target Areas: Polokwane, Thohoyandou, Musina, Burgersfort, and Louis Trichardt.
We will also prioritize Industrial Economic Zone areas and their priorities.
Sign up on https://profilesmes.online/welcome/
To be eligible:
1. You must have a registered business and operate in Limpopo
2. Generate revenue
3. Sectors : Agriculture ( primary and secondary) and Automative
Women and Youth are encouraged to apply even if you don't fall in those sectors.
𝘼𝙣𝙩𝙞𝙦𝙪𝙚 𝙋𝙡𝙖𝙨𝙩𝙞𝙘 𝙏𝙧𝙖𝙙𝙚𝙧𝙨 𝙞𝙨 𝙫𝙚𝙧𝙮 𝙛𝙖𝙢𝙤𝙪𝙨 𝙛𝙤𝙧 𝙢𝙖𝙣𝙪𝙛𝙖𝙘𝙩𝙪𝙧𝙞𝙣𝙜 𝙩𝙝𝙚𝙞𝙧 𝙥𝙧𝙤𝙙𝙪𝙘𝙩𝙨. 𝙒𝙚 𝙝𝙖𝙫𝙚 𝙖𝙡𝙡 𝙩𝙝𝙚 𝙥𝙡𝙖𝙨𝙩𝙞𝙘 𝙜𝙧𝙖𝙣𝙪𝙡𝙚𝙨 𝙪𝙨𝙚𝙙 𝙞𝙣 𝙖𝙪𝙩𝙤𝙢𝙤𝙩𝙞𝙫𝙚 𝙖𝙣𝙙 𝙖𝙪𝙩𝙤 𝙥𝙖𝙧𝙩𝙨 𝙖𝙣𝙙 𝙖𝙡𝙡 𝙩𝙝𝙚 𝙛𝙖𝙢𝙤𝙪𝙨 𝙘𝙤𝙢𝙥𝙖𝙣𝙞𝙚𝙨 𝙗𝙪𝙮 𝙩𝙝𝙚 𝙜𝙧𝙖𝙣𝙪𝙡𝙚𝙨 𝙛𝙧𝙤𝙢 𝙪𝙨.
Over the 10 years, we have gained a strong foothold in the market due to our range's high quality, competitive prices, and time-lined delivery schedules.
Symptoms like intermittent starting and key recognition errors signal potential problems with your Mercedes’ EIS. Use diagnostic steps like error code checks and spare key tests. Professional diagnosis and solutions like EIS replacement ensure safe driving. Consult a qualified technician for accurate diagnosis and repair.
Why Is Your BMW X3 Hood Not Responding To Release CommandsDart Auto
Experiencing difficulty opening your BMW X3's hood? This guide explores potential issues like mechanical obstruction, hood release mechanism failure, electrical problems, and emergency release malfunctions. Troubleshooting tips include basic checks, clearing obstructions, applying pressure, and using the emergency release.
Ever been troubled by the blinking sign and didn’t know what to do?
Here’s a handy guide to dashboard symbols so that you’ll never be confused again!
Save them for later and save the trouble!
What Does the PARKTRONIC Inoperative, See Owner's Manual Message Mean for You...Autohaus Service and Sales
Learn what "PARKTRONIC Inoperative, See Owner's Manual" means for your Mercedes-Benz. This message indicates a malfunction in the parking assistance system, potentially due to sensor issues or electrical faults. Prompt attention is crucial to ensure safety and functionality. Follow steps outlined for diagnosis and repair in the owner's manual.
"Trans Failsafe Prog" on your BMW X5 indicates potential transmission issues requiring immediate action. This safety feature activates in response to abnormalities like low fluid levels, leaks, faulty sensors, electrical or mechanical failures, and overheating.
Fleet management these days is next to impossible without connected vehicle solutions. Why? Well, fleet trackers and accompanying connected vehicle management solutions tend to offer quite a few hard-to-ignore benefits to fleet managers and businesses alike. Let’s check them out!
Digital Fleet Management - Why Your Business Need It?
usenix
1. http://danga.com/words/
LiveJournal: Behind The Scenes
Scaling Storytime
June 2007
USENIX
Brad Fitzpatrick
brad@danga.com
danga.com / livejournal.com / sixapart.com
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/1.0/ or send a letter to
Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
1
2. http://danga.com/words/
The plan...
Refer to previous presentations for more
details...
http://danga.com/words/
Questions anytime! Yell. Interrupt.
Part 0:
− show where talk will end up
Part I:
− What is LiveJournal? Quick history.
− LJ’s scaling history
Part II:
− explain all our software,
− explain all the moving parts
2
3. http://danga.com/words/
LiveJournal Backend: Today
(Roughly.)
User DB Cluster 1
uc1a uc1b
User DB Cluster 2
uc2a uc2b
User DB Cluster 3
uc3a uc3b
User DB Cluster N
ucNa ucNb
Job Queues (xN)
jqNa jqNb
Memcached
mc4
mc3
mc2
mcN
...
mc1
mod_perl
web4
web3
web2
webN
...
web1
BIG-IP
bigip2
bigip1
perlbal (httpd/proxy)
proxy4
proxy3
proxy2
proxy5
proxy1
Global Database
slave1
master_a master_b
slave2 ... slave5
MogileFS Database
mog_a mog_b
Mogile Trackers
tracker3tracker1
Mogile Storage Nodes
...
sto2
sto8
sto1
net.
djabberd
djabberd
djabberd
gearmand
gearmand1
gearmandN
“workers”
gearwrkN
theschwkN
slave1 slaveN
3
4. http://danga.com/words/
LiveJournal Overview
college hobby project, Apr 1999
4-in-1:
− blogging
− forums
− social-networking (“friends”)
− aggregator: “friends page”
− “friends” can be external RSS/Atom
10M+ accounts
Open Source!
− server,
− infrastructure,
− original clients,
4
12. http://danga.com/words/
Quick Scaling History
1 server to hundreds...
you can do all this with just 1 server!
− then you’re ready for tons of servers, without pain
− don’t repeat our scaling mistakes
9
13. http://danga.com/words/
Terminology
Scaling:
− NOT: “How fast?”
− But: “When you add twice as many servers, are you
twice as fast (or have twice the capacity)?”
Fast still matters,
− 2x faster: 50 servers instead of 100...
that’s some good money
− but that’s not what scaling is.
10
14. http://danga.com/words/
Terminology
“Cluster”
− varying definitions... basically:
− making a bunch of computers work together for
some purpose
− what purpose?
load balancing (LB),
high availablility (HA)
Load Balancing?
High Availability?
Venn Diagram time!
− I love Venn Diagrams
11
20. http://danga.com/words/
Two Servers - Problems
Two single points of failure!
No hot or cold spares
Site gets slow again.
− CPU-bound on web node
− need more web nodes...
17
21. http://danga.com/words/
Four Servers
3 webs, 1 db
Now we need to load-balance!
LVS, mod_backhand, whackamole, BIG-IP,
Alteon, pound, Perlbal, etc, etc..
− ...
18
28. http://danga.com/words/
Spreading Writes
Our database machines already did RAID
We did backups
So why put user data on 6+ slave machines?
(~12+ disks)
− overkill redundancy
− wasting time writing everywhere!
25
29. http://danga.com/words/
Partition your data!
Spread your databases out, into “roles”
− roles that you never need to join between
different users
or accept you'll have to join in app
Each user assigned to a numbered HA cluster
Each cluster has multiple machines
− writes self-contained in cluster (writing to 2-3 machines, not
6)
26
34. http://danga.com/words/
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
userid: 839
clusterid: 2
SELECT ....
FROM ...
WHERE
userid=839 ...
OMG i like
totally hate
my parents
they just
dont
understand me
and i h8 the
world omg lol
rofl *! :^-
^^;
add me as a
friend!!!
27
35. http://danga.com/words/
Details
per-user numberspaces
− don't use AUTO_INCREMENT
− PRIMARY KEY (user_id, thing_id)
− so:
Can move/upgrade users 1-at-a-time:
− per-user “readonly” flag
− per-user “schema_ver” property
− user-moving harness
job server that coordinates, distributed long-
lived user-mover clients who ask for tasks
− balancing disk I/O, disk space
28
36. http://danga.com/words/
Shared Storage
(SAN, SCSI, DRBD...)
Turn pair of InnoDB machines into a cluster
− looks like 1 box to outside world. floating IP.
One machine at a time mounting fs, running MySQL
Heartbeat to move IP, {un,}mount filesystem, {stop,start}
mysql
filesystem repairs,
innodb repairs,
don’t lose any committed transactions.
No special schema considerations
MySQL 4.1 w/ binlog sync/flush options
− good
− The cluster can be a master or slave as well
29
37. http://danga.com/words/
Shared Storage: DRBD
Linux block device driver
− “Network RAID 1”
− Shared storage without sharing!
− sits atop another block device
− syncs w/ another machine's
block device
cross-over gigabit cable
ideal. network is faster than
random writes on your disks.
InnoDB on DRBD: HA MySQL!
− can hang slaves off HA pair,
− and/or,
− HA pair can be slave of a
master
drbd
sda
ext3
mysql
floater ip
drbd
sda
ext3
mysql
30
38. http://danga.com/words/
MySQL Clustering Options:
Pros & Cons
No magic bullet...
− Master/Slave
doesn’t scale with writes
− Master/Master
special schemas
− DRBD
only HA, not LB
− MySQL Cluster
special-purpose
− ....
lots of options!
− :)
− :(
31
40. http://danga.com/words/
Caching
caching's key to performance
− store result of a computation or I/O for quicker future
access (classic space/time trade-off)
Where to cache?
− mod_perl/php internal caching
memory waste (address space per apache child)
− shared memory
limited to single machine, same with Java/C#/
Mono
− MySQL query cache
flushed per update, small max size
− HEAP tables
fixed length rows, small max size
33
41. http://danga.com/words/
memcached
http://www.danga.com/memcached/
our Open Source, distributed caching system
implements a dictionary ADT, with network API
run instances wherever free memory
two-level hash
− client hashes* to server,
− server has internal dictionary (hash table)
no “master node”, nodes aren’t aware of each
other
protocol simple, XML-free
− clients: c, perl, java, c#, php, python, ruby, ...
popular, fast
scalable
34
53. http://danga.com/words/
Client hashing onto a memcacached
node
Up to client how to pick a memcached node
Traditional way:
− CRC32(<key>) % <num_servers>
− (servers with more memory can own more slots)
− CRC32 was least common denominator for all
languages to implement, allowing cross-language
memcached sharing
− con: can’t add/remove servers without hit rate
crashing
“Consistent hashing”
− can add/remove servers with minimal <key> to
<server> map changes
37
54. http://danga.com/words/
memcached internals
libevent
− epoll, kqueue...
event-based, non-blocking design
− optional multithreading, thread per CPU (not per
client)
slab allocator
referenced counted objects
− slow clients can’t block other clients from altering
namespace or data
LRU
all internal operations O(1)
38
56. http://danga.com/words/
Web Load Balancing
BIG-IP, Alteon, Juniper, Foundry
− good for L4 or minimal L7
− not tricky / fun enough. :-)
Tried a dozen reverse proxies
− none did what we wanted or were fast enough
Wrote Perlbal
− fast, smart, manageable HTTP web server / reverse proxy / LB
− can do internal redirects
and dozen other tricks
40
57. http://danga.com/words/
Perlbal
Perl
parts optionally in C with plugins
single threaded, async event-based
− uses epoll, kqueue, etc.
console / HTTP remote management
− live config changes
handles dead nodes, smart balancing
multiple modes
− static webserver
− reverse proxy
− plug-ins (Javascript message bus.....)
plug-ins
− GIF/PNG altering, ....
41
59. http://danga.com/words/
Perlbal: Persistent Connections
perlbal to backends (mod_perls)
− know exactly when a connection is ready for a new
request
no complex load balancing logic: just use whatever's free.
beats managing “weighted round robin” hell.
clients persistent; not tied to a specific backend
connection
42
60. http://danga.com/words/
Perlbal: Persistent Connections
perlbal to backends (mod_perls)
− know exactly when a connection is ready for a new
request
no complex load balancing logic: just use whatever's free.
beats managing “weighted round robin” hell.
clients persistent; not tied to a specific backend
connection
PB
42
61. http://danga.com/words/
Perlbal: Persistent Connections
perlbal to backends (mod_perls)
− know exactly when a connection is ready for a new
request
no complex load balancing logic: just use whatever's free.
beats managing “weighted round robin” hell.
clients persistent; not tied to a specific backend
connection
PB
Apache
Apache
Client
Client
42
62. http://danga.com/words/
Perlbal: Persistent Connections
perlbal to backends (mod_perls)
− know exactly when a connection is ready for a new
request
no complex load balancing logic: just use whatever's free.
beats managing “weighted round robin” hell.
clients persistent; not tied to a specific backend
connection
PB
Apache
Apache
Client
Client
reqA1, B2
reqB1, A2
reqA1, A2
reqB1, B2
42
63. http://danga.com/words/
Perlbal: can verify new backend
connections
connects to backends are often fast, but...
are you talking to the kernel’s listen queue?
or apache? (did apache accept() yet?)
send OPTIONs request to see if apache is
there
− Apache can reply to OPTIONS request quickly,
− then Perlbal knows that conn is bound to an
apache process, not waiting in a kernel queue
Huge improvement to user-visible latency!
(and more fair/even load balancing)
#include <sys/socket.h>
int listen(int sockfd, int backlog);
43
66. http://danga.com/words/
Perlbal: cooperative large file serving
internal redirects
− mod_perl can pass off serving a big file to Perlbal
either from disk, or from other URL(s)
− client sees no HTTP redirect
− “Friends-only” images
one, clean URL
mod_perl does auth, and is done.
perlbal serves.
46
68. http://danga.com/words/
And the reverse...
Now Perlbal can buffer uploads as well..
− Problems:
LifeBlog uploading
−cellphones are slow
LiveJournal/Friendster photo uploads
−cable/DSL uploads still slow
− decide to buffer to “disk” (tmpfs, likely)
on any of: rate, size, time
blast at backend, only when full request is in
48
72. http://danga.com/words/
MogileFS
our distributed file system
open source
userspace
based all around HTTP (NFS support now removed)
hardly unique
− Google GFS
− Nutch Distributed File System (NDFS)
production-quality
− lot of users
− lot of big installs
52
73. http://danga.com/words/
MogileFS: Why
alternatives at time were either:
− closed, non-existent, expensive, in development,
complicated, ...
− scary/impossible when it came to data recovery
new/uncommon/ unstudied on-disk formats
because it was easy
− initial version = 1 weekend! :)
− current version = many, many weekends :)
53
74. http://danga.com/words/
MogileFS: Main Ideas
files belong to classes,
which dictate:
− replication policy, min
replicas, ...
tracks what disks files
are on
− set disk's state (up,
temp_down, dead)
and host
keep replicas on devices
on different hosts
− (default class policy)
− No RAID!
− multiple tracker
databases
− all share same
database cluster
(MySQL, etc..)
big, cheap disks
− dumb storage nodes
w/ 12, 16 disks, no
RAID
54
78. http://danga.com/words/
Trackers' Database(s)
Abstract as of Mogile 2.x
− MySQL
− SQLite (joke/demo)
− Pg/Oracle coming soon?
− Also future:
wrapper driver, partitioning any above
− small metadata in one driver (MySQL Cluster?),
− large tables partitioned over 2-node HA pairs
Recommend config:
− 2xMySQL InnoDB on DRBD
− 2 slaves underneath HA VIP
1 for backups
read-only slave for during master failover window
58
79. http://danga.com/words/
MogileFS storage nodes
(mogstored)
HTTP transport
− GET
− PUT
− DELETE
mogstored listens on 2 ports...
HTTP. --server={perlbal,lighttpd,...}
configs/manages your webserver of choice.
perlbal is default. some people like apache, etc
− management/status:
iostat interface, AIO control, multi-stat() (for faster
fsck)
files on filesystem, not DB
− sendfile()! future: splice()
− filesystem can be any filesystem
59
86. http://danga.com/words/
Gearman
system to load balance function calls...
scatter/gather bunch of calls in parallel,
different languages,
db connection pooling,
spread CPU usage around your network,
keep heavy libraries out of caller code,
...
...
64
87. http://danga.com/words/
Gearman Pieces
gearmand
− the function call router
− event-loop (epoll, kqueue, etc)
workers.
− Gearman::Worker – perl/ruby
− register/heartbeat/grab jobs
clients
− Gearman::Client[::Async] -- perl
− also Ruby Gearman::Client
− submit jobs to gearmand
− opaque (to server) “funcname” string
− optional opaque (to server) “args” string
− opt coallescing key
65
97. http://danga.com/words/
Gearman Uses
Image::Magick outside of your mod_perls!
DBI connection pooling (DBD::Gofer +
Gearman)
reducing load, improving visibility
“services”
− can all be in different languages, too!
68
98. http://danga.com/words/
Gearman Uses, cont..
running code in parallel
− query ten databases at once
running blocking code from event loops
− DBI from POE/Danga::Socket apps
spreading CPU from ev loop daemons
calling between different languages,
...
69
99. http://danga.com/words/
Gearman Misc
Guarantees:
− none! hah! :)
please wait for your results.
if client goes away, no promises
− all retries on failures are done by client
but server will notify client(s) if working worker
goes away.
No policy/conventions in gearmand
− all policy/meaning between clients <-> workers
...
70
103. http://danga.com/words/
TheSchwartz
Like gearman:
− job queuing system
− opaque function name
− opaque “args” blob
− clients are either:
submitting jobs
workers
But unlike gearman:
− Reliable job queueing system
− not low latency
− fire & forget (as opposed to gearman, where you wait for
result)
currently library, not network service
74
104. http://danga.com/words/
TheSchwartz Primitives
insert job
“grab” job (atomic grab)
− for 'n' seconds.
mark job done
temp fail job for future
− optional notes, rescheduling details..
replace job with 1+ other jobs
− atomic.
...
75
105. http://danga.com/words/
TheSchwartz
backing store:
− a database
− uses Data::ObjectDriver
MySQL,
Postgres,
SQLite,
....
but HA: you tell it @dbs, and it finds one to
insert job into
− likewise, workers foreach (@dbs) to do work
76
106. http://danga.com/words/
TheSchwartz uses
outgoing email (SMTP client)
− millions of emails per day
− TheSchwartz::Worker::SendEmail
− Email::Send::TheSchwartz
LJ notifications
− ESN: event, subscription, notification
one event (new post, etc) -> thousands of emails, SMSes,
XMPP messages, etc...
pinging external services
atomstream injection
.....
dozens of users
shared farm for TypePad, Vox, LJ
77
107. http://danga.com/words/
gearmand + TheSchwartz
gearmand: not reliable, low-latency, no disks
TheSchwartz: latency, reliable, disks
In TypePad:
− TheSchwartz, with gearman to fire off TheSchwartz
workers.
disks, but low-latency
future: no disks, SSD/Flash, MySQL Cluster
78
109. http://danga.com/words/
djabberd
Our Jabber/XMPP server
powers our “LJ Talk” service
S2S: works with GoogleTalk, etc
perl, event-based (epoll, etc)
done 300,000+ conns
tiny per-conn memory overhead
− release XML parser state if possible
80
110. http://danga.com/words/
djabberd hooks
everything is a hook
− not just auth! like, everything.
− auth,
− roster,
− vcard info (avatars),
− presence,
− delivery,
− inter-node cluster delivery,
− ala mod_perl, qpsmtpd, etc.
async hooks
− hooks phases can take as long as they want before
they answer, or decline to next phase in hook chain...
− we use Gearman::Client::Async
81