3. My name is Chris Wanstrath. I go by @defunkt online.
4. inside githubAnd today I’m going to talk about GitHub.
5. inside githubThat’s me.
6. GitHub is what we like to call “social coding.”
7. You can see what your friends are doing from your dashboard or news feed
8. Everyone has a proﬁle showing off their code and activity
9. And you can do things like leave comments on commits.
10. But it wasn’t always like this.
11. Originally we just wanted to make a git hosting site.In fact, that was the ﬁrst tagline.
12. git repository hostinggit repository hosting.That’s what we wanted to do: give us and our friends a place to share git repositories.
13. a brief historylet’s start with a brief history
14. It’s not easy to setup a git repository. It never was.But back in 2007 I really wanted to.
15. I had seen Torvalds’ talk on YouTube about git.But it wasn’t really about git - it was more about distributed version control.It answered many of my questions and clariﬁed DVCS ideas.I still wasn’t sold on the whole idea, and I had no idea what it was good for.
16. CVS is stupidBut when Torvalds says “CVS is stupid”
17. and so are you“and so are you,” the natural reaction for me is...
18. To start learning git.
19. At the time the biggest and best free hosting site was repo.or.cz.
20. Right after I had seen the Torvalds video, the god project was posted up on repo.or.czI was interested in the project so I ﬁnally got a chance to try it out with some other people.
21. Namely this guy, Tom Preston-Werner.Seen here in his famous “I put ketchup on my ketchup” shirt.
22. I managed to make a few contributions to god before realizing that repo.or.cz was not different.git was not different.Just more of the same - centralized, inﬂexible code hosting.
23. This is what I always imagined.No rules. Project belongs to you, not the site. Share, fork, change - do what you want.Give people tools and get out of their way. Less ceremony.
24. So, we set off to create our own site.A git hub - learning, code hosting, etc.
25. We started with the code browsing and commit viewing...
26. But once we added the current version of the dashboard, we knew this was different.
27. And eventually “git repository hosting” gave way to “social coding”
28. Unleash Your Code Join 500,000 coders with over 1,500,000 repositoriesWhat’s special about GitHub is that people use the site in spite of git.Many git haters use the site because of what it is - more than a place to hostgit repositories, but a place to share code with others.
29. 2007 octoberThe ﬁrst commit was on a Friday night in October, around 10pm.
30. 2008 januaryWe launched the beta in January at Steff’s on 2nd street in San Francisco’s SOMA district.The ﬁrst non-github user was wycats, and the ﬁrst non-github project was merb-core.They wanted to use the site for their refactoring and 0.9 branch.
31. 2008 aprilA few short months after that we launched to the public.
32. Along the way we managed to pick up Scott Chacon, our VP of R&D
33. Tekkub, our level 80 support druid
34. Melissa Severini, who keeps us all in check
35. Kyle Neath, who makes the site pretty
36. Ryan Tomayko, who helps keep the site running smoothly.
37. Zach Holman, head of enterprise
38. Rick Olson, Rails extraordinaire
39. Eston Bond, Design Generalissimo
40. Corey Donohoe, Director of Shipology
41. And Brian Lopez, our bleeding edge cowboy
42. Oh yeah, and the other founders: PJ and Tom.
43. github.comThat’s where we’re at today.So let’s talk about the technical details of the website: github.com
44. .com as opposed to ﬁ, which I’m not going to get into today.You’ll have to invite PJ out if you want to hear about that.
45. We also have a store
46. A job board
47. And do git training
48. the web siteAs everyone knows, a web “site” is really a bunch of different components.Some of them generate and deliver HTML to you, but most of them don’t.Our site consists of four major code “frameworks” or “apps”
49. rails #GitHub.com, Gist, etc 1
50. resque #Background processing, 50ish different job types currently 2
53. railsWe use Ruby on Rails 2.2.2 as our web framework.It’s kept up to date with all the security patches and includes custom patches we’ve addedourselves, as well as patches we’ve cherry-picked from more recent versions of Rails.
54. railsGitHub is about 20,000 lines of Rails code, not counting Rails itself, plugins, or gems.
55. We found out Rails was moving to GitHub in March 2008, after we had reached out tothem and they had turned us down.So it was a bit of a surprise.
56. rails pluginsWe currently have 27 Rails plugins installed, and that number is always changing.
57. shopify / active_merchant
58. lgn21st / s3_swf_upload
59. technoweenie /serialized_attributes
62. rubygemsGitHub depends on about 50 RubyGems
73. rackOne of the big features in Rails 2.3 is Rack support.
74. We badly wanted this, but didn’t want to invest the time upgrading.So using a few open source libraries we’ve wrapped our Rails 2.2.2 instance in Rack.
75. Now we can use awesome Rack middleware like Rack::Bug in GitHub
76. Coders created and submitted dozens of Rack middleware for the Coderack competition last year.I was a judge so I got the see the submissions already. Some of my favoritewere
77. nerdEd / rack-validate
78. webficient / rack-tidy
79. talison / rack-mobile-detectsets the X_MOBILE_DEVICE header to the mobile device, ifrecognized
80. unicornWe use unicorn as our application server- master / worker- 16 workers- preforking
81. unicorn- instant restart after kill- hard 30s request timeouts- control ram growth
82. unicorn- 0 downtime deploys- protects against bad rails startup- migrations handled old fashioned way
83. nginxFor serving static content and slow clients, we use nginxnginx is pretty much the greatest http server everit’s simple, fast, and has a great module system
84. nginxLimit ZoneLimit simultaneous connections from a client
85. nginxLimit RequestsLimit frequency of connections from a clientAnti-DDOS
86. nginxI see many people using Rack to do what the Limit modules do.Don’t.
87. nginxmemcachedmemcached supportcan serve directly from memcached
88. nginxPush Modulecomet!
89. gitThe next major part of GitHub is git
90. gritWe wrote an open source library called Gritwhich lets us use git from Ruby
91. mojombo / grityou can get it hereit originally shelled out to git and just parsed the responses.which worked well for a long time.
92. gritFile.read()Eventually we realized, however, that File.read() can be 100 times faster
93. gritsystem()Than shelling out
94. One of the ﬁrst things Scott worked on was rewriting the core parts of Gritto be pure RubyBasically a Ruby implementation of Git
95. mojombo / gritAnd that’s what we run now
96. smokeKinda.Eventually we needed to move of our git repositories off of our web serversToday our HTTP servers are distinct from our git servers. The two communicate using smoke
97. smoke“Grit in the cloud”Instead of reading and writing from the disk, Grit makes Smoke callsThe reading and writing then happens on our ﬁle servers
98. bert-rpcRather than use Protocol Buffers or Thrift or JSON-RPC, Smoke uses BERT-RPC
100. bert-rpcwe have four ﬁle servers, each running bert-rpc serversour front ends and job queue make RPC calls to the backend servers
101. mojombo / bertrpcYou can grab bert-rpc on GitHub
102. mojombo / bertOr if you just want to play with BERT
103. chimneyWe have a proprietary library called chimneyIt routes the smoke. I know, don’t blame me.
104. chimneyAll user routes are kept in RedisChimney is how our BERT-RPC clients know which server to hitIt falls back to a local cache and auto-detection if Redis is down
105. chimneyIt can also be told a backend is down.Optimized for connection refused but in reality that wasn’t the real problem - timeouts were
106. proxymachineAll anonymous git clones hit the front end machinesthe git-daemon connects to proxymachine, which uses chimney to proxy yourconnection between the front end machine and the back end machine (which holdsthe actual git repository)very fast, transparent to you
107. mojombo / proxymachineproxymachine can be used to proxy any kind of tcp connectionopen source
108. sshSometimes you need to access a repository over sshIn those instances, you ssh to an fe and we tunnel your connection tothe appropriate backendTo ﬁgure that out we use chimney
114. jobsWe do a lot of work in the background at GitHub
115. resqueCurrently we use a system called Resque.
116. defunkt / resqueYou can grab it on GitHub
117. resque- dealing with pushes- web hooks- creating events in the database- generating GitHub Pages- clearing & warmingcaches- search indexing
118. queuesIn Resque, a queue is used as both a priority and a localization techniqueBy localization I mean, “where your workers live”
119. queuescritical,high,lowthese three run on our front end serversResque processes them in this order
120. queuespageGitHub Pages are generated on their own machine using the `page` queue
121. queuesarchiveAnd tarball and zip downloads are created on the ﬂy using the `archive` queueon our archiving machines
122. searchOn GitHub, you can search code, repositories, and people
123. solrSolr is basically an HTTP interface on top of Lucene. This makes it pretty simpleto use in your code.We use solr because of its ability to incrementally add documents toan index.
124. Here I am searching for my name in source code
125. solrWe’ve had some problems making it stable but luckily the guys at Pivotalhave given us some tipsLike bumping the Java heap size.Whatever that means
126. databaseOur database story is pretty uninteresting
127. mysqlWe use mysql 5
128. master / slaveAll reads and writes go to the masterWe use the slave for backups and failover
129. cachingOn the site we do a ton of cachingusing memcached
130. fragmentsWe cache chunks of HTML all overUsually they are invalidated by some action
131. fragmentsFormerly we invalidated most of our fragments using a generation scheme,where you put a number into a bunch of related keys and increment itwhen you want all those caches to be missed (thus creating new cacheentries with fresh data)
132. fragmentsBut we had high cache eviction due to low ram and hardware constraints, and foundthat scheme did more harm than good.We also noticed some cached data we wanted to remain forever was being evicted dueto the slabs with generational keys ﬁlling up fast
133. pageWe cache entire pages using nginx’s memcached moduleLots of HTML, but also other data which gets hit a lot and changes rarely:
134. page- network graph json- participation graph dataAlways looking to stick more into page caches
135. objectWe do basic object caching of ActiveRecord objects such asrepositories and users all over the placeCaches are invalidated whenever the objects are saved
136. associationsWe also cache associations as arrays of IDsGrab the array, then do a get_multi on its contents to get a list of objectsThat way we don’t have to worry about caching stale objects
137. walkerWe also have a proprietary caching library called Walker
138. walkerIt originally walked trees and cached them when someone pushedBut now it caches everything related to git:
140. Every git-related page load hits Walker a lot
141. walkerFor most big apps, you need to write a caching layerthat knows your business domainGeneric, catch-all caching libraries probably won’t do
142. eventsAn example of this is our events system
143. This is one fragment
144. Each of these is a fragment
145. They’re also cached as objects
146. As well as a list of ids
147. And that’s just for the dashboard...
148. optimizationsSo what other optimizations have we done
149. asset serversWell we do the common trick of serving assets from multiple subdomains
150. asset serversassets0.github.comassets1.github.comand so forth
152. sha asset id/css/bundle.css?197d742e9fdec3f7/js/bundle.js?197d742e9fdec3f7Now simple code changes won’t force everyone to re-download the css or js bundles
153. bundlingFor bundling itself, we use
154. bundlingyui’s compressor for css and
156. scripty 301Again, for most of these tricks you need to really payattention to your app.One example is scriptaculous’ wiki
157. scripty 301When we changed our wiki URL structure, we setup dynamic 301 redirectsfor the old urls.Scriptaculous’ old wiki was getting hit so much we put the redirect into nginx itself -this took strain off our web app and made the redirects happen almost instantly
158. ajax loadingWe also load data in via ajax in many places.Sometimes a piece of information will just take too long to retrieveIn those instances, we usually load it in with ajax
159. If Walker sees that it doesn’t have all the information it needs, it kicks off a jobto stick that information in memcached.
160. We then periodically hit a URL which checks if the information is in memcached or not.If it is, we get it and rewrite the page with the new information.
161. We use this same trick on the Network Graph
162. Fork Queue
163. ajax loadingand anywhere else it makes sense.
164. comet loadingvery soon this will all be comet, though
165. monitoringwhat do we use for monitoring?
166. nagiosOur support team monitors the health of our machines and coreservices using nagios.I don’t really touch the thing.
167. Here’s a screenshot from my IE browser, complete with the ICQ plugin
168. resque webWe monitor our queue using Resque’s included Sinatra app
169. haystackWe use an in-house app called Haystack to monitor arbitrary information,tracked as JSON.
170. Here’s an example of Haystack’s “exceptions” view
171. collectdWe also use collectd to monitor load, RAM usage, CPU usage, and otherapp-related metrics
172. pingdompingdom sends us SMSes when the site is downit’s nice
173. tendertender is what we use for customer support
174. it works incredibly well, and they’re constantly improving it
175. testingOur testing setup is pretty standard
176. test unitWe mostly use Ruby’s test/unit.We’ve experimented with other libraries including test/spec, shoulda, and RSpec, but in the endwe keep coming back to test/unit
177. git ﬁxturesAs many of our ﬁxtures are git repositories, we specify in the test what shawe expect to be the HEAD of that ﬁxture.This means we can completely delete a git repository in one test, then have it back inpristine state in another. We plan to move all our ﬁxtures to a similar git-system in the future.
178. machinistWe use machinist for our ﬁxtures
179. notahat / machinist
180. running_manGives us setup_onceUse it to cache machinist ﬁxtures on a per-test-class basis
181. technoweenie / running_man
182. ci joeWe use ci joe, a continuous integration server, to run on tests after each push.He then notiﬁes us if the tests fail.
183. defunkt / cijoeYou can grab him at github
184. stagingWe also always deploy the current branch to stagingThis means you can be working on your branch, someone else can be working on theirs,and you don’t need to worry about reconciling the two to test out a featureOne of the best parts of Git
186. github.com/securityhaving a security page really helps
187. firstname.lastname@example.org get weekly emails to our security email (that people ﬁnd on the security page)and people are always grateful when we can reassure them or a answer their question
188. regular auditsif you can, ﬁnd a security consultant to poke your site for XSS vulnerabilitieshaving your target audience be developers helps, too
189. 24/7 monitoring24/7 monitoring is cool too
190. backupsbackups are incredibly importantdon’t just make backups: ensure you can restore them, as well
191. sqlwe keep nightly, off-site backups of our sql databases