Narayan Newton, Lead systems engineer at Tag1 Consulting
Matthew Cheney, Chaos Wizard at Pantheon
A non-profit founded almost 100 years ago and we have over 3 million supporters
Our mission is to defend and preserve the individual rights and liberties guaranteed by the Constitution and laws of the United States.
To put it succinctly, we consider ourselves to be the first responders for the Constitution
We take on issues like:
Voting rights
Reproductive Freedom
the intersection of privacy and technology
For example,
Led fight against Japanese-American internment camps during WWII
took on and defeated 1996 Communications Decency Act, which censored the Internet by banning "indecent" speech
Marriage equality - We brought the first lawsuit in the country seeking the freedom to marry for same-sex couples in 1970.
We appear before the Supreme Court more than any other organization except the Department of Justice.
We maintain about 40 Drupal websites at the ACLU
but today we’re going to talk about just one really important website: action.aclu.org
Take action online
Sign an online petition
Send a letter to an elected official
Request legal aid from the ACLU
Where our members can go to support us
Fundraising
Sign up to volunteer
action.aclu.org is currently on Drupal 7 but for the time period we’re talking about Drupal 6
Critical for our organization that our websites are available and performant.
Before Pantheon in like 2013 on dedicated hosting
There was an initiative at the ACLU to build our online presence
But we found our infrastructure wasn’t quite up to the task of handling the increased traffic
site slowness
some site outages
Problems with our infra
Database strain (using core drupal search bc Solr wasn’t set up)
hardware upgrades took weeks and weeks
maintaining test and development environments and varnish involved a lot of developer time
ACLU CTO, Marco Carbone who is an old-school drupal dev heard about Pantheon by attending DrupalCon event
We did our research and decided they’d be a great host for us. Matt’s going to tell you why.
-- This may not be surprising, but hosting websites is hard work.
-- Not as hard as hard as resisting executive overreach through constitutional law of course.
-- Need to Use Lots of Technologies and Do Lots of Things 365 days a year
-- Plus you need to keep it all up to date and adopt NEW stuff when it come
-- What does knowing Git have to do with civil rights?
-- Its about as necessary as this guinea pig wearing sunglasses.
-- I mean its great to know how Git works, but its not necessary
-- The world is full of challenges, why add to things you need to do!
-- Things move quickly. Organizations need to be able to respond.
-- Time/Resources need to be focused on organiational goals.
-- Even more true with “Ambitious Digital Experiences”.
-- Be the Pyramidion you want to be in the world
-- Leverage the expertise of others through reusable modules/libraries
-- Benefit from a community of practice around web development
-- Leveraging the expertise of others is why people use CLOUD
-- Drupal is getting more complciarfed. Web is getting more ambitious.
-- Features You Need Require Spercialized Knowledge to Make. Even More to Maintain.
-- Security is Ongoing Challenge Requiring Lots of Knowledable People
-- Performance/Scalability Takesa a Village
-- Horizontally Scaling PHP is Hard Work
-- Hosting Platforms That Have This Tech Work Really Hard To Make it Awesome
-- It Wont Solve All Your Performance Problems
-- But It will Provide you a SOLID Starting Point
-- Be preapred, you never know what is going to happen
-- Its Not About Having all The Answers, It’s About Having the Right Tools
Pat:
After switching to Pantheon, our site was quite stable… until Nov. 8th 2016
After switching to Pantheon, our site was quite stable… until Nov. 8th 2016
We received $7.2 million in the 5 days after the 2016 election.
Compare that to the $25k in donations we received in the 5 days after the 2012 election
In the 5 days after the election our websites saw over 4 million page view
Compare that to 400,000 page views the year before
Our web traffic increased to more than 10x what we were used to seeing in the days after the election, essentially overnight
This was a great outpouring of support for our organizaion
but we started seeing small performance issues
Those small performance issues turned into a really big performance issue on Nov 16, 2016
The ACLU’s executive director appeared on the Rachel Maddow show.
Rachel Maddow Appearance Nov. 16 2016
500 peak form submissions per minute
~15 minutes site outage
Only able to sustain ~300 submissions per minute
This graph shows the spike in HTTP 500 errors our site was returning during the Maddow appearance
Huge missed opportunity for us.
Supporters were trying to donate to us, send letters to their elected officials via our site and sign up for our email lists, but they were being met by errors
Luckily ACLU mgt realized this wouldn’t be a one-time spike
They realized the Trump era meant that we’d be seeing spikes like this on the the regular for the next 4 - 8 years
But we didn’t have 4 - 8 years to fix these performance issues
The next spike could come at ANY time
so we called in Tag1 to do emergency weekend
Tag1 Brought in to look at outage period
Issue was clearly that we were DB bound, brought in 3 engineers including myself to review new relic traces
Developed indexes, fixed queries, worked in concert with the ACLU team to deploy fixes.
An example of what type of thing we were doing.
This is a fairly typical ubercart-esqe query, with the addition of an og table.
An interesting quirk of this additional table is that it lacks all indexes. This is more common than you might think.
Looked at the table to find the datasets natural key and pushed a primary key and some additional keys for filtering and joining.
Note, we have a key on oid, gid, nid but then I have indexes on specifically gid and nid. Why? Because of the order of gid and nid in the primary index
As you can see, we went from 200k rows to 76.
And here is the result of just that change. You can see the green query being marked fundraiser_og, that is this query and you can see it basically dropping out of the graph.
Put together our fixes as a patchset, tested against multidev
at this point wanted to ensure that the ACLU site would survive larger traffic spikes and find other issues
Turned to pantheon to setup a production-alike environment to enable testing at that capacity
-- performasnce testing is complicated. just ask narayan.
-- important to test in as “close a production parity” as possible
-- but setting all this stuff up is hard!
-- robots will drive our cars. raise our children on ipads. tell us what to believe politically
-- is it really too much to ask that they can create production parity developkment environments on demand?
-- on demand environments are the answer
-- at pantheon we call this “Multidev”, but its basically ONE ENVIRONMENT PER GIT BRANCH
---- integrated with new relic, production parity
-- made possible by Containers and Robots
-- Allowed Tag1 and ACLU to quickly iterate and test features
The emergency improvements Tag1 put in over that weekend in late Nov 2016 were very effective.
Made it through:
Giving Tuesday 2016
end of year fundraising pushes
received 15x more donations in our end of year fundraising than previous year (20,548 gifts)
But we weren’t out of the woods yet
Jan 27 2017, issue Executive Order 13769 (AKA Muslim travel ban)
Barred people from 7 Muslim-majority countries from entering the US
Thousands protested the executive order at airports across the country
The ACLU fulfilled our reputation as first responders for the Constitution
Within hours, the ACLU—and partnering organizations nationwide—obtained the first injunction to block the order
When news broke of what the ACLU had accomplished
People rushed to our websites
That top line on the graph there shows page views in the page views before during and after the executive order
The line at the bottom of the screen shows the same dates from the previous year
The big spike is at almost 4 million hits, on the same day the previous year is at 44,000
85x traffic spike… almost 2 orders of magnitude
Donations over the weekend after the executive order were six times the organization’s yearly average
So how did our websites hold up during this crazy post-executive order weekend?
Rachel Maddow Appearance
Able to sustain 300 submissions per minute
~15 minutes site outage
Executive Order
900 peak form submissions per minute
Sustained 500 submissions per minute for ~8 hours
We did have a 10 minute ‘site outage’
We did 2 smart things to mitigate this outage
New Relic alerts when traffic got high or response times increased
Static CDN-hosted donation page
After the dust settled, we took some time and confirmed what we previously suspected
slow responses from one of our payment gateways was the root cause of the site outage
we still had some issues with database performance to address
Once again, we handed the reins over to Tag1
So at this point we know things are better, but that we are still having issues at very high load. We are past the easy fixes you can detect at low load situations and need actual traffic
I build a botnet
initial results of the patchset
starting seeing issues with DB and external request/curl requests to the payment gateway
We took a two pronged approach to fix these issues
First, we starting look at the external requests. It was very unclear what was actually happening with our payment gateways
Developed curl_log to log the actual responses from the gateway, but also to sanitize
Finally found that there was an issue with CDN
curl_load balance was developed
Turned towards DB issues, which were more over-all load and less bad queries specifically
Legacy deployment, don’t want to patch every module. Fabian developed query_cache
We also developed rate_limit, which is sort of a performance tool and sort of a security tool. It allows us to rate_limit specific actions in Drupal itself.
How well did this second and final round of changes serve us?
We got a chance to find out when the Trump administration’s FCC repealed 'Net Neutrality' rules for internet providers in mid-December 2017
The internet reacted with outrage and once again the action.aclu.org website was a conduit for that outrage
This time, we nailed it.
Nov 2016 Before changes
Rachel Maddow Appearance
Maxed out at 300 form submissions per minute
~15 minute outage
January 2017 After first round of changes
Executive Order
900 peak form submissions per minute
~10 minute mitigated outage
After the second round of changes
we were able to hit a peak of 1,900 form submissions per minute
and easily sustained 500 submissions per minute for 10 hours (probably indefinitely)
This was a big victory for us… 100s of thousands
In his Nov 2016 appearance on the Maddow show our exec dir said about Pres. Trump’s election
While the rest of the organization was ready, the website wasn’t quite prepared.
But after a year of:
work by the developers and management at the ACLU
leveraging Tag1’s expertise
and having Pantheon’s infrastructure having our backs
We’re now confident that our websites are really ready to be used in their full capacity to defend civil liberties in the US