Let me start by giving you context…In 2003 I started the Performance Engineering team at Blackboard. Ratatouille (2007), Pixar
Our CEO’s message to me was make Blackboard the simplest, least complicated Enterprise Learning System on the market.Ratatouille (2007), Pixar
It’s cool to scale, but even cooler if anyone could manage this system. Ratatouille (2007), Pixar
I agreed that it was critical to make the system simple, yet elegant enough to be Enterprise.Ratatouille (2007), Pixar
So the journey beganRatatouille (2007), Pixar
You ask any of these folks and they will tell you that “cooking” is easy if you follow their specific instructions.They subscribe to the anyone can follow instructions and create a masterpiece.
He doesn’t subscribe to the theory that a good chef follows instructions. Good old Gordon Ramsey wants his chef’s to be all-knowing, experienced and risk takers. He wants them to know what they are trying to create, not just create what they read.
In our world thanks to the O’Reilley books, there’s a belief that everyone can write software and manage systemshttp://oreilly.com/
I have a few points…so please hear me out.
Last year I presented a “cookbook” you could say about building and deploying this highly scalable and responsive Blackboard system.http://www.slideshare.net/sfeldman13/scaling-blackboard-learn-for-high-performance-and-delivery
I lead you down this path. The path looked really serene off in the distance.http://www.flickr.com/photos/peterocks/6580193237/sizes/l/in/photostream/
That path as I said a second ago was to build this amazing structure.http://www.flickr.com/photos/uke-003/5465192496/lightbox/
It turns out that for some, the path looked more like this…The path looked stable, yet a little unsettling.http://www.flickr.com/photos/myklroventine/3891088196/sizes/l/in/set-72157612033047648/
For many of you it was really just a bunch of walls that you tried to get around or over.http://www.flickr.com/photos/myklroventine/4144318354/sizes/l/in/set-72157623209598498/
You thought you were creating this perfect meal…but insteadhttp://www.prinsenhof.com/page.asp?iTaalID=3&iPageID=205!207!
…that perfect meal looked more like ClarkeGriswold’s Christmas Turkey. It looked perfect on the outside, but lacking anything on the inside.The reason is that while I told you all of these great things about “WHAT YOU NEED TO DO” from a setup perspective, I failed to tell you what you needed to do to keep it up and running.http://www.imdb.com/title/tt0097958/
What I want to do today is go back to the topic of building a scalable and responsive system. I would like to attempt to convince you that following someone else’s instructions can be very dangerous if over time there’s nothing more gained other than some minimal memories and absent experience. I’ve poked a little bit of fun at myself with the title of my presentation. I’m going to try to convince you that anyone can cook, but it takes a little bit more to be a chef. I want make what I believe to be a convincing argument that managing an enterprise system takes a little bit more work than what’s written in a manual or given in a presentation. I’m going to try my hardest to deliver my message in a somewhat unusual way. I’m going to tell you about 2 abstract stories that have nothing to do with Blackboard or software in general.
The first story takes place a long time ago in 1881 in the country which we know today called Panama. At the time it was a territory of Colombia, called New Granada. Colombia had gained its independence from Spain just prior to 1820. The Isthmus of Panama was long-known as a destination or site for an expressway between the Atlantic and the Pacific. In 1534, Charles V the King of Spain ordered a survey for a route between the Americas which would ease the travel effort. http://www.smplanet.com/imperialism/joining.html
The French were the first to attempt to fund and construct the Panama Canal. In 1869 they had just experienced a major success after spending 10 years constructing the Suez Canal in Egypt connecting the Mediterranean with the Red Sea, thus bypassing all of Africa to bridge Europe with Asia. The Suez Canal is a lockless canal. Water essentially just transfer between the two bodies.
Here’s a quick view of what the Panama canal looks like from the East (Limon Bay) to the West (Panama City)http://bringinghomebeck.blogspot.com/2011/02/panama.html
http://www.panama-guide.com/article.php/20051007164213526Actual construction of a sea-level canal was begun in 1882 by a French company under Ferdinand de Lesseps. He had completed the Suez Canal in 1869. By comparison, however, building the Suez Canal had been simple
Mismanagement, dishonesty, and terrible epidemics of disease in Panama forced the French company into bankruptcy in 1889. During seven years of digging, 22,000 men had died of tropical diseases. This was equivalent to wiping out the entire construction crew twice, for the total number of men employed at any one time did not average more than 10,000.
What you are looking at is a picture of Yellow Fever (left) and Malaria (right). The French canal builders did not know that the deadly malaria and yellow fever were caused by bites of certain mosquitoes. In the region, specifically Cuba around this time conclusions were being made about the causes of yellow fever and malaria. That information was starting to become accessible to the medical community in this centralized region of the Americas. The problem I that the information was ignored. Doctors, nurses and “medical folk” (non-educated practitioners of medicine) still had a strong belief of Miasma theory to explain how diseases spread. Serious errors were made in sanitation. French physicians were said to have ordered the legs of hospital beds placed in water to keep ants and other crawling bugs from the patients. The water became an additional breeding place for mosquitoes, which already were swarming in from marshes, streams, and pools in the hot, rainy region.
So in the early 1900’s, nearly 20 years after the Canal started, the United States negotiated their way into obtaining exclusive rights with Colombia to go finish building the canal. They had attempted to build a canal through Nicaragua, but that attempt failed due to finances. In the gap between 1890 and 1904, a second French initiative took place, but it was mainly an effort to construct a revamped rail system. Canal digging essentially stopped for 14 years.
In 1904, President Theodore Roosevelt appointed John Findlay Wallace, formerly Chief Engineer and finally General Manager of the Illinois Central Railroad, as Chief Engineer of the Panama Canal Project. The project had a to go through a couple of major changes before it could be seriously considered a worthy effort. First, the area had to be eradicated of yellow fever and malaria. It was out of control and it was an absolute necessity to make the project feasible. Second, the Canal itself had to be redesigned to be more practical to support the topographical aspects of Panama. It was never going to be built if it had to be a lockless construction.http://www.panama-guide.com/article.php/20051007164213526
http://www.panama-guide.com/article.php/20051007164213526Credit goes to two United States Army colonels for succeeding where the French had failed. Colonel George Washington Goethals, as engineer in chief after 1907, directed construction. Colonel William Crawford Gorgas of the Medical Corps, as chief sanitary officer, led the battle against disease. Later both men became major generals.Gorgas Conquers the MosquitoTwo medical discoveries had been made that prepared the way for the achievement of Colonel Gorgas. In 1898 Dr. Ronald Ross, an English army surgeon, had discovered that malaria is transmitted by the bite of the Anopheles mosquito. In 1901 Dr. Walter Reed, a surgeon in the United States Army, and his associates had proved that yellow fever is passed from man to man by the Aedes mosquito. Gorgas himself, while serving as chief sanitary officer in Havana, Cuba, had directed the development of practical methods of sanitation based on these discoveries. With this invaluable knowledge and experience as a guide, he set to work to make the Canal Zone a safe place for men to work, live, and raise their families.He drained every lake, swamp, pond, and ditch that could be emptied. Over those that could not be drained, he spread a film of oil to destroy mosquito eggs and larvae. He cut grass jungles to the ground, destroyed vermin, and burned rubbish. He raised all buildings above the ground and screened windows, doors, and porches. He ordered householders to cover every vessel that held water.All railway cars were screened, and a hospital car was added to every train. Hospitals were built for isolation and treatment. Cities were given sewers and pure water. Ships coming from disease-ridden areas were placed under strict quarantine. To guard against bubonic plague, rats and fleas were killed and houses made ratproof.
When he first went to Panama, Gorgas called it the most unhealthful place in the world. Today the former Canal Zone is said to be one of the world's healthiest places. Few disease-carrying or pest insects are now found in the area. Here was one of the most impressive victories ever won by science against disease, and the cost of all the sanitary measures involved was about a penny a day for each inhabitant.
What makes the Panama Canal story even more interesting is when you compare it to the London Cholera Epidemic of the mid 1800s. The Ghost Story has been chronicled over and over. You might be familiar with it’s story as well. This story is a brilliant telling of the importance of understanding “Root Cause”. Whereas in the first story you have the French who failed to ever get to root cause. They were more concerned with corruption and overwhelmed with constant failures. The French were in constant fire fighting mode, after they started of in the mode of pragmatic expert. The French believed that they every aspect of their project, from their design of the canal, to the construction of hospitals (early in construction) was right on track. They had a prescription, specifically their experience of building the Suez Canal, a very different project from both a geographical and topographical perspective. That prescription didn’t make much sense.The US had a very different experience. Realizing that the French had spent nearly 20+ years failing, the US asked the important question of why they failed. They didn’t invest in construction until they were able to make a huge dent in absolving yellow fever and malaria. They were successful with the prior, but the later took extra time. The key is that the US needed to understand the Root Cause of death in the Panama Canal Zone.
The story begins in London in the middle of the 19th century. At the beginning of the Century, London had close to 1 million residents. In 50 years time, it had more than doubled. At the time London had the world’s largest and mostly dense population. The Victorians were building this metropolitan area, but they were building something this big for the first time. The city was becoming more industrial, but the supporting infrastructure was highly primitive dating back to capabilities from the Elizabethan era. I’m specifically talking about sanitation and health.Basements were filled with cesspits. Garbage and livestock were everywhere. There was no separation or distinction between the two. The city was incredibly smelly. The stench of the city was simply unbearable. Cholera, which is a bacterial disease, was the great killer of the 19th century. Cholera is an acute diarrheal disease caused by an infection in the intestines that can kill even a healthy adult in a matter of hours. Symptoms, including severe watery diarrhea, can surface in as little as two hours or up to five days after infection, and can then trigger extreme dehydration and kidney failure. It was taking thousands of people at a given time. It’s spread through drinking water. When contracted, there’s a high percentage of death (close to 50%) due to dehydration and The first big epidemic hit in the 1830’s. The public health authorities were convinced that the stench of London was the cause of the outbreak. It was pretty scary time, as families and neighborhoods would lose someone to death suddenly.http://london1850.com/
There were a couple of public health interventions. At one point the government convinced the residents of London to empty their cesspits into the Thames River as a means of eradicating the smell. There was still a foundational belief at the time that Cholera came from the air (miasma theory) and that by eliminating the waste, the smells would start to go away. What they didn’t realize is that by dumping the waste in the river, they were accelerating the distribution of Cholera amongst its citizens at a faster rate. By 1850, Cholera had killed over 70,000 citizens of the UK. It had ravished 100’s of thousands of Europeans as well. It was at the time still unsolved. At this time, even with the advancement of science, it was still commonly thought that disease and illness was always spread through the air. There’s definitely logic in why people associated Cholera with miasma theory. The first being cholera tended to concentrate in big cities. The second was that big cities like London spelled horribly. http://www.sewerhistory.org/grfx/disease/cholera/cholera.html
But most of Snow’s information was being rejected by both society and the doctors/scientists of the time. The districts of the nine Water Companies which now supply London are distinguished by separate colors on the annexed Map. The sources of supply are also given, the number of tenements, and the position and level of the various Filtering, Storage and Service Reservoirs, and Engine Establishments.This map would be important to Snow’s argument as residents to the north of the Thames were getting their water supply from sources north of the region, whereas if you lived south of the Thames, you were getting your water from the Thames. The problem is that much of the raw sewage was being dumped into the Thames. His research showed that residents in the North were less likely to contract Cholera than residents in the south.
In 1854, Snow was living in Sackville Street, Piccadilly, about 10 minutes walking distance from Broad Street, Golden Square and Berwick Street. A few cases of cholera occurred in the last part of August but the main epidemic started during the night of 31 August. He described it as ‘the most terrible outbreak of cholera which ever occurred in this kingdom’. It was an outbreak that claimed over 500 lives in 10 days, and he believed there would have been more fatalities had the population not left the area so quickly. As soon as he became aware of the outbreak he considered water supplies and became suspicious that there was ‘some contamination of the much-frequented street pump in Broad Street’. On 3 September he collected some samples of water from the pump for analysis. It showed, however, so little impurity that he hesitated to come to a conclusion. Over the next couple of days he did identify some ‘small white flocculent particles’ and decided to investigate the situation thoroughly. This investigation comprised taking a list from the Registrar General’s Office of the deaths from cholera which had been listed during the week ending 2 September. He then undertook detailed enquiries into the circumstances of each death in the area to ascertain where the deceased had obtained their drinking water. In 83% of the cases he found that the dead had been in the habit of drinking the water from the Broad Street pumphttp://thevictorianist.blogspot.com/2010/11/john-snow-and-1854-london-cholera.html
Sometime in 1854 a young baby by the last name of Lewis had contracted Cholera at 40 Broad Street. Her mother washed her diapers in the cesspit while waiting for the doctor to arrive. The cesspit bordered very close to an extremely popular drinking well on Broad Street. Even though the Nuisance act was put in place, they still had the cesspit. It’s believed that the cesspit contaminating Cholera to the drinking supply.Nearly 10% of the neighborhood died within 7 days unexplained other than they contracted Cholera. Snow heard about the outbreak and when right into the affected area to study everything he could about it.What’s important about Lewis is that she’s considered the “Index” patient which could be used to challenge the london medical gazette, which 5 years before questioned Snow’s original thesis that Cholera was contracted through the water. The London Gazette asked for an isolated experiment to prove Snow’s theory.http://www.sewerhistory.org/grfx/disease/cholera/cholera.html
At the end of September the outbreak was all but over, with the death toll standing at 616 Sohoites. But Snow's theories were yet to be proved. There were several unexplained deaths from cholera that did not at first appear to be linked to the Broad Street pump water -- notably, a widow living in West End, Hampstead, who had died of cholera on 2 September, and her niece, who lived in Islington, who had succumbed with the same symptoms the following day. Since neither of these women had been near Soho for a long time, Dr Snow rode up to Hampstead to interview the widow's son. He discovered from him that the widow had once lived in Broad Street, and that she had liked the taste of the well-water there so much that she had sent her servant down to Soho every day to bring back a large bottle of it for her by cart. The last bottle of water -- which her niece had also drunk from -- had been fetched on 31 August, at the very start of the Soho epidemic.http://eagereyes.org/criticism/review-steven-johnson-the-ghost-map
Here’s another visualization of the deaths. It’s represented as a scatter plot in which the color tone reflects the death density. The darker the red, the greater the number of deaths. The data is quite fascinating because it calls into question how are their outliers so far away from the red concentration. It also calls into question why other densely populated areas close to 40 Broad Street don’t have as many deaths. It’s not like people are surviving by drinking the water at the pump, that is unless they are drinking from other sources. http://www.flickr.com/photos/panopticonsoftware/3232167209/
There were many other factors that led Snow to isolate the cause of the cholera to the Broad Street pump. For instance, of the 530 inmates of the Poland Street workhouse, which was only round the corner, only five people had contracted cholera; but no one from the workhouse drank the pump water, for the building had its own well. Among the 70 workers in a Broad Street brewery, where the men were given an allowance of free beer every day and so never drank water at all, there were no fatalities at all. And an army officer living in St John's Wood had died after dining in Wardour Street, where he too had drunk a glass of water from the Broad Street well.Still no one believed Snow. A report by the Board of Health a few months later dismissed his "suggestions" that "the real cause of whatever was peculiar in the case lay in the general use of one particular well, situate [sic] at Broad Street in the middle of the district, and having (it was imagined) its waters contaminated by the rice-water evacuations of cholera patients. After careful inquiry," the report concluded, "we see no reason to adopt this belief."Snow had been working on the summary statistics (ie: 83 deaths). Now he was looking at the data spatially trying to better understand the distribution of deaths in relation to location (two simple data points).Eventually, Snow was able to convince the authorities to remove the handle of the pump. In part because they had already lost 500+ people in a week. Snow’s work gained traction. Within 1 year after the Broad Street pump, society in general started to accept Snow’s theories not as theories, but as proof. They were making major strides in water sanitization by building sewer infrastructure, as well as by advancing thoughts on clean water (ie: boiling)http://www.targetprocess.com/articles/information-visualization/
This is an example of time series data. It’s the 90% of all response times for a consecutive stream of tests. It’s a comparison of the same test daily. There can be a lot of harm if this was the only data being observed. Why? Well because the data is highly eratic. It fluctuates upward and downward through 70% of the samples until it hits a wall. The wall is conclusive that there is a problem and the problem lasts for quite a while. In fact in the first 30% there was a performance issue. It appeared to be resolved somewhere in the middle half of the builds, only to degrade immensely. Nothing was necessarily identified during the middle portion of these samples to know that there was a problem early on. It wasn’t until the dramatic issue toward the later part of the build cycle in which an issue was called-out.
Histograms are important as they teach us about the natural clustering and association of data. What’s important about this data is the immense variability of the data to the right of the 2nd data set. The first 2 data sets account for nearly 98% of all samples. There are dozens of response time bins to the right which demonstrate “how bad” performance of a page request can actually be. This is the data we have to be careful with. Most likely we would look at the first 2 data points and move on. We wouldn’t see a response time issue from this. It’s quite amazing that we have dozens of response times that range from 14.2 seconds to as high as 220.5 seconds. These bins aren’t necessarily uniform. You can see at 42 through 64 seconds we see an uptick in poor responsiveness that’s not as equitable in the 4 bins to the left or the remaining bins to the right.This data to the right of 7.1 is what we have to investigate. We need to explain why the response times can be so bad.
You look at data like this and what immediately comes to my mind is how dangerous this data is. It doesn’t look all that bad. Note that the Y axis says less than 12%. We can only assume the scale is 100%
What does this scatterplot tell us. It shows a clustering of response times for accessing a student or teacher’s course home page. There are a handful of samples that are in the 20 to 80 second range, but why? The same system is returning sub-6 second response times at the same time as servicing 20+ second response times.
Add 2 more pieces of data, application CPU in Blue and database cpu in Red. It only leads more questions. It makes sense that response times are high when CPU is low right? Response times are high after spike/increases of CPU. In this case, CPU is never saturated. It has nearly 50% availability of resources, yet response times are 2X to 10X higher. This is where the data needs to be analyzed deeper. Outliers need to be exposed.
Cookbook for Administrating Blackboard Learn
A Cookbook forAdministration of LearnStephen FeldmanBlackboard, Inc.Product Development
Techniques for Presenting Datao Time series statistics can provide benefit when viewed with appropriate context.o Present data in percentileso Statistical averages should be analyzed with standard deviation.o Histograms can tell a compelling story about groups of outliers.o Scatterplots can tell a compelling story about individual outliers.o Associate different types of data: response times with resource utilization or throughput rates (clusters) o Correlation can lead to causation, but be confident and careful 44
What Have We Learned?o It’s dangerous to use past experience without considering the data of the present.o It’s dangerous to ignore data.o Problems cannot be solved simply with extra capacity.o Looking at time-series statistical averages can be just as dangerous.o Seeking out statistical anomalies as they will tell us so much about how bad or defective the user experience truly is. 51
We value your feedback!Please fill out a session evaluation. 52
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.