• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Inside GitHub

Inside GitHub



Talk given at the Gilt Groupe Experts Talk in NYC, December 2009.

Talk given at the Gilt Groupe Experts Talk in NYC, December 2009.



Total Views
Views on SlideShare
Embed Views



23 Embeds 483

http://strategylab3.wordpress.com 159
http://www.slideshare.net 103
http://www.bagtheweb.com 39
https://twitter.com 37 31
http://paper.li 29
http://cliveboulton.com 25
http://www.plurk.com 12
http://a0.twimg.com 9
http://www.w3schools.com 8
http://pinterest.com 5
http://us-w1.rockmelt.com 5
http://www.schoox.com 5
http://lj-toys.com 3
http://www.techgig.com 3
http://tweetedtimes.com 2
https://si0.twimg.com 2
http://graasp.epfl.ch 1
https://twimg0-a.akamaihd.net 1
http://www.m.techgig.com 1
http://rails3.bagtheweb.com 1
http://l.lj-toys.com 1
http://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • i like it
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Inside GitHub Inside GitHub Presentation Transcript

    • Hello. Hi everyone.
    • My name is Chris Wanstrath. I go by @defunkt online.
    • inside github And today I’m going to talk about GitHub.
    • inside github That’s me.
    • GitHub is what we like to call “social coding.”
    • You can see what your friends are doing from your dashboard or news feed
    • Everyone has a profile showing off their code and activity
    • And you can do things like leave comments on commits.
    • But it wasn’t always like this.
    • Originally we just wanted to make a git hosting site. In fact, that was the first tagline.
    • git repository hosting git repository hosting. That’s what we wanted to do: give us and our friends a place to share git repositories.
    • It’s not easy to setup a git repository. It never was. But back in 2007 I really wanted to.
    • I had seen Torvalds’ talk on YouTube about git. But it wasn’t really about git - it was more about distributed version control. It answered many of my questions and clarified DVCS ideas. I still wasn’t sold on the whole idea, and I had no idea what it was good for.
    • CVS is stupid But when Torvalds says “CVS is stupid”
    • and so are you “and so are you,” the natural reaction for me is...
    • To start learning git.
    • At the time the biggest and best free hosting site was repo.or.cz.
    • Right after I had seen the Torvalds video, the god project was posted up on repo.or.cz I was interested in the project so I finally got a chance to try it out with some other people.
    • Namely this guy, Tom Preston-Werner. Seen here in his famous “I put ketchup on my ketchup” shirt.
    • I managed to make a few contributions to god before realizing that repo.or.cz was not different. git was not different. Just more of the same - centralized, inflexible code hosting.
    • This is what I always imagined. No rules. Project belongs to you, not the site. Share, fork, change - do what you want. Give people tools and get out of their way. Less ceremony.
    • So, we set off to create our own site. A git hub - learning, code hosting, etc.
    • We started with the code browsing and commit viewing...
    • But once we added the current version of the dashboard, we knew this was different.
    • And eventually “git repository hosting” gave way to “social coding”
    • What’s special about GitHub is that people use the site in spite of git. Many git haters use the site because of what it is - more than a place to host git repositories, but a place to share code with others.
    • a brief history So that’s how it all started. Now I want to (briefly) cover some milestones and events.
    • 2007 october The first commit was on a Friday night in October, around 10pm.
    • 2008 january We launched the beta in January at Steff’s on 2nd street in San Francisco’s SOMA district. The first non-github user was wycats, and the first project was merb-core. They wanted to use the site for their refactoring and 0.9 branch.
    • 2008 april A few short months after that we launched to the public.
    • 2009 january In January of this year, we were awared the “Best Bootstrapped Startup” by TechCrunch.
    • 2009 april Then in April we were featured as some of the best young tech entrepreneurs in BusinessWeek. (Finally something to show mom)
    • 2009 june Our Firewall Install, something we’d been talking about since practically day one, was launched in June of 2009.
    • 2009 september And in September we moved to Rackspace, our current hosting provider. (Which some of you may have noticed.)
    • Along the way we managed to pick up Scott Chacon, our VP of R&D
    • Tekkub, our level 80 support druid
    • Melissa Severini, who keeps us all in check
    • Kyle Neath, who makes the site pretty
    • And Ryan Tomayko, who helps keep the site running smoothly.
    • Oh yeah, and the other founders: PJ and Tom.
    • github.com That’s where we’re at today. So let’s talk about the technical details of the website: github.com
    • .com as opposed to fi, which I’m not going to get into today. You’ll have to invite PJ out if you want to hear about that.
    • the web app As everyone knows, a web “site” is really a bunch of different components. Some of them generate and deliver HTML to you, but most of them don’t. Either way, let’s start with the HTMLy parts.
    • rails We use Ruby on Rails 2.2.2 as our web framework. It’s kept up to date with all the security patches and includes custom patches we’ve added ourselves, as well as patches we’ve cherry-picked from more recent versions of Rails.
    • We found out Rails was moving to GitHub in March 2008, after we had reached out to them and they had turned us down. So it was a bit of a surprise.
    • rails But there are entire presentations on Rails, so I’m not going to get further into it here. As for whether it scales or not, we’ll let you know when we find out. Because so far it hasn’t come close to presenting a problem.
    • rack One of the big features in Rails 2.3 is Rack support.
    • We badly wanted this, but didn’t want to invest the time upgrading. So using a few open source libraries we’ve wrapped our Rails 2.2.2 instance in Rack.
    • Now we can use awesome Rack middleware like Rack::Bug in GitHub
    • In fact, the Coderack competition is about to open voting to the public this week. Coders created and submitted dozens of Rack middleware for the competition. I was a judge so I got the see the submissions already. Some of my favorite were
    • nerdEd / rack-validate
    • webficient / rack-tidy
    • talison / rack-mobile-detect sets the X_MOBILE_DEVICE header to the mobile device, if recognized
    • unicorn We use unicorn as our application server - master / worker - 16 workers - preforking
    • unicorn - instant restart after kill - hard 30s request timeouts - control ram growth
    • unicorn - 0 downtime deploys - protects against bad rails startup - migrations handled old fashioned way
    • nginx For serving static content and slow clients, we use nginx nginx is pretty much the greatest http server ever it’s simple, fast, and has a great module system
    • nginx Limit Zone Limit simultaneous connections from a client
    • nginx Limit Requests Limit frequency of connections from a client Anti-DDOS
    • nginx I see many people using Rack to do what the Limit modules do. Don’t.
    • nginx memcached memcached support can serve directly from memcached
    • nginx Push Module comet!
    • git The next major part of GitHub is git
    • grit We wrote an open source library called Grit which lets us use git from Ruby
    • mojombo / grit you can get it here it originally shelled out to git and just parsed the responses. which worked well for a long time.
    • grit File.read() Eventually we realized, however, that File.read() can be 100 times faster
    • grit system() Than shelling out
    • One of the first things Scott worked on was rewriting the core parts of Grit to be pure Ruby Basically a Ruby implementation of Git
    • mojombo / grit And that’s what we run now
    • smoke Kinda. Eventually we needed to move of our git repositories off of our web servers Today our HTTP servers are distinct from our git servers. The two communicate using smoke
    • smoke “Grit in the cloud” Instead of reading and writing from the disk, Grit makes Smoke calls The reading and writing then happens on our file servers
    • bert-rpc Rather than use Protocol Buffers or Thrift or JSON-RPC, Smoke uses BERT-RPC
    • bert-rpc bert : erlang :: json : javascript BERT is an erlang-based protocol BERT-RPC is really great at dealing with large binaries Which is a lot of what we do
    • bert-rpc we have four file servers, each running bert-rpc servers our front ends and job queue make RPC calls to the backend servers
    • mojombo / bertrpc You can grab bert-rpc on GitHub
    • mojombo / bertrpc Or if you just want to play with BERT
    • chimney We have a proprietary library called chimney It routes the smoke. I know, don’t blame me.
    • chimney All user routes are kept in Redis Chimney is how our BERT-RPC clients know which server to hit It falls back to a local cache and auto-detection if Redis is down
    • chimney It can also be told a backend is down. Optimized for connection refused but in reality that wasn’t the real problem.
    • proxymachine All anonymous git clones hit the front end machines the git-daemon connects to proxymachine, which uses chimney to proxy your connection between the front end machine and the back end machine (which holds the actual git repository) very fast, transparent to you
    • mojombo / proxymachine proxymachine can be used to proxy any kind of tcp connection open source
    • ssh Sometimes you need to access a repository over ssh In those instances, you ssh to an fe and we tunnel your connection to the appropriate backend To figure that out we use chimney
    • jobs We do a lot of work in the background at GitHub
    • resque Currently we use a system called Resque.
    • defunkt / resque You can grab it on GitHub
    • resque - dealing with pushes - web hooks - creating events in the database - generating GitHub Pages - clearing & warmingcaches - search indexing
    • queues In Resque, a queue is used as both a priority and a localization technique By localization I mean, “where your workers live”
    • queues critical,high,low these three run on our front end servers Resque processes them in this order
    • queues page GitHub Pages are generated on their own machine using the `page` queue
    • queues archive And tarball and zip downloads are created on the fly using the `archive` queue on our archiving machines
    • search On GitHub, you can search code, repositories, and people
    • solr Solr is basically an HTTP interface on top of Lucene. This makes it pretty simple to use in your code. We use solr because of its ability to incrementally add documents to an index.
    • Here I am searching for my name in source code
    • solr We’ve had some problems making it stable but luckily the guys at Pivotal have given us some tips Like bumping the Java heap size. Whatever that means
    • database Our database story is pretty uninteresting
    • mysql We use mysql 5
    • master / slave All reads and writes go to the master We use the slave for backups and failover
    • caching On the site we do a ton of caching using memcached
    • fragments We cache chunks of HTML all over Usually they are invalidated by some action
    • fragments Formerly we invalidated most of our fragments using a generation scheme, where you put a number into a bunch of related keys and increment it when you want all those caches to be missed (thus creating new cache entries with fresh data)
    • fragments But we had high cache eviction due to low ram and hardware constraints, and found that scheme did more harm than good. We also noticed some cached data we wanted to remain forever was being evicted due to the slabs with generational keys filling up fast
    • page We cache entire pages using nginx’s memcached module Lots of HTML, but also other data which gets hit a lot and changes rarely:
    • page - network graph json - participation graph data Always looking to stick more into page caches
    • object We do basic object caching of ActiveRecord objects such as repositories and users all over the place Caches are invalidated whenever the objects are saved
    • associations We also cache associations as arrays of IDs Grab the array, then do a get_multi on its contents to get a list of objects That way we don’t have to worry about caching stale objects
    • walker We also have a proprietary caching library called Walker
    • walker It originally walked trees and cached them when someone pushed But now it caches everything related to git:
    • walker - commits - diffs - commit listing - branches - tags - everything
    • Every git-related page load hits Walker a lot
    • walker For most big apps, you need to write a caching layer that knows your business domain Generic, catch-all caching libraries probably won’t do
    • events An example of this is our events system
    • This is one fragment
    • Each of these is a fragment
    • They’re also cached as objects
    • As well as a list of ids
    • And that’s just for the dashboard...
    • optimizations So what other optimizations have we done
    • asset servers Well we do the common trick of serving assets from multiple subdomains
    • asset servers assets0.github.com assets1.github.com and so forth
    • sha asset id Instead of using timestamps for asset ids, which may end up hitting the disk multiple times on each request, we set the asset id to be the sha of the last commit which modified a javascript or css file
    • sha asset id /css/bundle.css?197d742e9fdec3f7 /js/bundle.js?197d742e9fdec3f7 Now simple code changes won’t force everyone to re-download the css or js bundles
    • bundling For bundling itself, we use
    • bundling yui’s compressor for css and
    • bundling google’s closure compiler for javascript we don’t use the most aggressive setting because it means changing your javascript to appease the compression gods, which we haven’t committed to yet
    • scripty 301 Again, for most of these tricks you need to really pay attention to your app. One example is scriptaculous’ wiki
    • scripty 301 When we changed our wiki URL structure, we setup dynamic 301 redirects for the old urls. Scriptaculous’ old wiki was getting hit so much we put the redirect into nginx itself - this took strain off our web app and made the redirects happen almost instantly
    • ajax loading We also load data in via ajax in many places. Sometimes a piece of information will just take too long to retrieve In those instances, we usually load it in with ajax
    • If Walker sees that it doesn’t have all the information it needs, it kicks off a job to stick that information in memcached.
    • We then periodically hit a URL which checks if the information is in memcached or not. If it is, we get it and rewrite the page with the new information.
    • We use this same trick on the Network Graph
    • Fork Queue
    • ajax loading and anywhere else it makes sense.
    • comet loading very soon this will all be comet, though
    • monitoring what do we use for monitoring?
    • nagios Our support team monitors the health of our machines and core services using nagios. I don’t really touch the thing.
    • Here’s a screenshot from my IE browser, complete with the ICQ plugin
    • resque web We monitor our queue using Resque’s included Sinatra app
    • haystack We use an in-house app called Haystack to monitor arbitrary information, tracked as JSON.
    • Here’s an example of Haystack’s “exceptions” view
    • collectd We also use collectd to monitor load, RAM usage, CPU usage, and other app-related metrics
    • pingdom pingdom sends us SMSes when the site is down it’s nice
    • tender tender is what we use for customer support
    • it works incredibly well, and they’re constantly improving it
    • testing Our testing setup is pretty standard
    • test unit We mostly use Ruby’s test/unit. We’ve experimented with other libraries including test/spec, shoulda, and RSpec, but in the end we keep coming back to test/unit
    • git fixtures As many of our fixtures are git repositories, we specify in the test what sha we expect to be the HEAD of that fixture. This means we can completely delete a git repository in one test, then have it back in pristine state in another. We plan to move all our fixtures to a similar git-system in the future.
    • ci joe We use ci joe, a continuous integration server, to run on tests after each push. He then notifies us if the tests fail.
    • defunkt / cijoe You can grab him at github
    • staging We also always deploy the current branch to staging This means you can be working on your branch, someone else can be working on theirs, and you don’t need to worry about reconciling the two to test out a feature One of the best parts of Git
    • security
    • github.com/ security having a security page really helps
    • security@ github.com we get weekly emails to our security email (that people find on the security page) and people are always grateful when we can reassure them or a answer their question
    • consultant if you can, find a security consultant to poke your site for XSS vulnerabilities having your target audience be developers helps, too
    • backups backups are incredibly important don’t just make backups: ensure you can restore them, as well
    • sql we keep nightly, off-site backups of our sql databases
    • git and the same for all our git repositories
    • Thanks. thanks for coming