A guest lecture I gave for the "Internet Technology" course at my old University (Bath). I tried to pull together all of the things I wish I'd been told before I started building things on the Web.
1. How to build the Web
Simon Willison
30th November 2007
2. This talk
• Modern client-side engineering
• Server-side engineering and web frameworks
• Web application security
• Building sites that scale
3. What to build How to build it
Product design
Browsers!
Information architecture
Client-side
engineering
User experience
Social software design
Servers!
Usability Server-side
engineering
Marketing
...
7. “Yahoo! Juku is a comprehensive, 3-6 month
program to train professional front end
developers. The curriculum includes advanced
topics in JavaScript, DOM, HTML, CSS,YUI,
performance, and accessibility.
Why train raw recruits to this degree? Well,
in the San Francisco Bay Area,
including the Silicon Valley, it’s
hard-as-heck to find good front end
programmers and web designers.”
http://developer.yahoo.net/blog/archives/2007/11/the_harvard_of.html
9. (*O/*_/
http://ideology.com.au/polyglot/
Cu #%* )pop mark/CuG 4 def/# 2 def%%%%@@P[TX---PP_SXPY!Ex(mx2ex(quot;SX!Ex4P)Ex=
CuG #%* *+Ex=
CuG #%*------------------------------------------------------------------*+Ex=
Polyglots
CuG #%* POLYGLOT - a program in eight languages 15 February 1991 *+Ex=
CuG #%* 10th Anniversary Edition 1 December 2001 *+Ex=
CuG #%* Written by Kevin Bungard, Peter Lisle, and Chris Tham *+Ex=
CuG #%*------------------------------------------------------------------*QuZ=
CuG #%* *+Ex=
CuG #%*!Mx)ExQX5ZZ5SSP5n*5X!)Ex+ExPQXH,B+ExP[-9A-9B(g?(gA'UTTER_XYZZXX!X *+
CuG #(* *(
C # */); /*(
C # *) program polyglot (output); (*+
C # identification division.
C
C # program-id. polyglot.
C #
C # data division.
Perl
C # procedure division.
C #
Pascal
C # * ))cleartomark /Bookman-Demi findfont 36 scalefont setfont (
C #* (
C #
Fortran
C #* hello polyglots$
C # main.
COBOL
C # perform
C /# * ) 2>_$$; echo quot;hello polyglotsquot;; rm _$$; exit;
C #*(
PostScript
C #
C *0 ) unless print quot;hello polyglotsnquot;; __END__
bash/sh/csh
print
C stop run.
-*, 'hello polyglots'
x86 assembler
C
C print.
C display quot;hello polyglotsquot;. (
C */ int i; /*
C */ main () { /*
C */ i=printf (quot;hello polyglotsnquot;); O= &i; return *O; /*
C *) (*
C *) begin (*
C *) writeln ('hello polyglots'); (*
C *) (* )
C * ) pop 60 360 (
C * ) pop moveto (hello polyglots) show (
C * ) pop showpage ((
C *)
end .(* )
C)pop% program polyglot. *){*/}
11. Rendering engines
Opera desktop Safari
Opera mobile iPhone
Nintendo Wii Nokia Series 60
Nintendo DS Google Android
Firefox
Ice weasel Sadly still 85%
Camino of the market
Galleon
12. IE is the problem child
• Microsoft simply stopped updating it once
they had won the browser wars... IE 6 came
out in 2001!
• Still has shaky support for CSS 2.1
• Many JavaScript APIs developed before
standards even existed
• Requires a disproportionate amount of
development time
• Status of IE 8 is uncertain
13. Recommendations
• Develop to the standards using Firefox
• The cases where IE deviates from the
standards are relatively well understood, and
can usually be worked around
• Avoid CSS hacks; conditional comments are
your friend
<!--[if IE]><link rel=quot;stylesheetquot; type=quot;text/
cssquot; href=quot;/static/ieonly.cssquot;><![endif]-->
14.
15.
16. Accessibility
• Assistive technology thrives on semantic HTML
• <label> elements for forms
• <h1>...<h6> headers for structure
• Avoiding tables for layout
• Watch a video of a screen reader user; they may well
browse faster than you do
• Accessibility is much more than just screen readers -
colour blindness, motor disorders, learning
disabilities, even just poor eyesite
17. JavaScript
“JavaScript was a rushed little hack for
Netscape 2 that was then frozen prematurely
during the browser wars, and evolved
significantly only once by ECMA. So its early
flaws were never fixed, and worse, no
virtuous cycle of fine-grained community
feedback [...] ever occurred.”
-Brendan Eich
18. But despite that...
• JavaScript is actually a really neat little
language
• Functions are first-class objects
• Lexical closures
• Objects are hash tables
• If you take the time to learn it, it will repay
you handsomely
23. AJAX v.s. Ajax
“Any technique that
“Asynchronous
allows the client to
JavaScript + XML”
retrieve more data
from the server
without reloading the
whole page”
24. Unobtrusive JavaScript
• JavaScript isn't always available
• Security conscious organisations (and
users) sometimes disable it
• Some devices may not support it (mobile
phones for example)
• Assistive technologies (screen readers)
may not play well with it
• Search engine crawlers won't execute it
• Unobtrusive: stuff still works without it!
25. Progressive enhancement
• Start with solid markup
• Use CSS to make it look good
• Use JavaScript to enhance the usability of the
page
• The content remains accessible no matter
what
27. labels.js
• One of the earliest examples of this
technique, created by Aaron Boodman (now
of Greasemonkey and Google Gears fame)
28.
29.
30. How it works
<label for=quot;searchquot;>Search</label>
<input type=quot;textquot; id=quot;searchquot; name=quot;qquot;>
• Once the page has loaded, the JavaScript:
• Finds any label elements linked to a text field
• Moves their text in to the associated text field
• Removes them from the DOM
• Sets up the event handlers to remove the
descriptive text when the field is focused
• Clean, simple, reusable
31. easytoggle.js
• An unobtrusive technique for revealing
panels when links are clicked
<ul>
<li><a href=quot;#panel1quot; class=quot;togglequot;>Panel 1</a></li>
<li><a href=quot;#panel2quot; class=quot;togglequot;>Panel 2</a></li>
<li><a href=quot;#panel3quot; class=quot;togglequot;>Panel 3</a></li>
</ul>
<div id=quot;panel1quot;>...</div>
<div id=quot;panel2quot;>...</div>
<div id=quot;panel3quot;>...</div>
32.
33.
34. How it works
• When the page has loaded...
• Find all links with class=quot;togglequot; that reference an
internal anchor
• Collect the elements that are referenced by those
anchors
• Hide all but the first
• Set up event handlers to reveal different panels when a
link is clicked
• Without JavaScript, links still jump to the right point
35. Django filter lists
• Large multi-select boxes aren't much fun
• Painful to scroll through
• Easy to lose track of what you have
selected
• Django's admin interface uses unobtrusive
JavaScript to improve the usability here
36.
37.
38. • Ajax is often used to avoid page refreshes
• So...
• Write an app that uses full page refreshes
• Use unobtrusive JS to quot;hijackquot; links and
form buttons and use Ajax instead
• Jeremy Keith coined the term quot;Hijaxquot; to
describe this
39. JavaScript libraries
“The bad news:
JavaScript is broken.
The good news:
It can be fixed with
more JavaScript!”
- Geek folk saying
40. Main contenders
• Prototype
• The Yahoo! User Interface Library
• The Dojo Toolkit
• jQuery
• It’s worth evaluating these in detail, but if
you only have time to learn one...
42. Client-side performance
• Relatively new field, pioneered by the
performance team at Yahoo!
• A few simple changes can make a huge
difference to perceived loading times
• Example tip: serve your static files (CSS,
images etc) from a separate domain - that
way the cookies from your regular domain
won’t slow down the requests
47. Characteristics of
good URLs
• “Cool URIs don’t change”
• Guessable
• Hackable
• Readable over the phone
• Reflects the hierarchy of the site and its data
48. A good URL
• simonwillison.net/2007/Nov/27/thumbnail/
• Short, hackable, no implementation exposed
• No matter what you’re building, including
the year can be really useful in allowing you
to change your opinion on your URLs later
on without breaking old links
49. The Open Source stack
• The only option I would consider
• Open source means:
• Zero vendor lock-in; many open-source
components are interchangeable
• Better support (fix it yourself, or pay
someone smart to fix it for you)
• Less bugs and better quality code
51. Dynamic languages
• Social applications in particular are almost
impossible to get right first time
• Development only really starts after you’ve
launched something and seen what people use
it for
• Speed and flexibility of development are critical
• Dynamic languages let you get more done with
less lines of code (which means less bugs)
74. How do you build a site
like lawrence.com?
• Interns - unpaid labour!
• A big relational database
• Newspaper people are baffled by these...
• ... so you need a good interface for it
• And as many development shortcuts as possible
75. Characteristics
• Clean URLs
• Loosely coupled components
• Designer-friendly templates
• Less code
• The “good bits” from PHP
77. The Django workflow
• Build the models
• Instant admin! Content people can start
adding data
• Writing the views
• Throw the templates to the designers
78.
79. Open source Django
• Django has been open-source since mid-2005
• The newspaper has been able to hire
excellent developers from the community
• The newspaper CMS is sold as Ellington;
one of the features is that you can hire your
own Django developers to modify it
• Django has been hugely improved by
contributions from outside the newspaper
84. All frameworks provide:
• A recommended way of laying out code
• Separation of application and presentation
logic using a template system
• An ORM, to reduce the amount of code
needed to talk to a database
• Reusable components for common tasks
88. • SQL injection is inexcusable
• If the environment you are using doesn’t
protect against this for you (through
parameterised queries), use a different tool
89. Cross-site scripting
• The most common security hole on the web
http://example.com/search?q=<script>alert(quot;helloquot;);</script>
You searched for <?php echo $_GET['q']; ?>
• Massive security hole!
90. XSS attackers can...
• Replace your logo with something obscene
• Steal your user’s authentication cookies
• Re-target login forms to point to a password
stealing script
• Perform any action that the user is allowed
to perform themselves
• Create self-propagating worms
93. HTML is dangerous
• It’s best not to allow un-trusted users to
submit HTML at all
• If you let them submit HTML, you’ll need an
industrial grade HTML parser (which
emulates browsers, not just the HTML spec)
and a very restrictive whitelist
• CSS can include JavaScript, and even regular
CSS positioning can be used for phishing
94. CSRF
• Much less widely understood than XSS...
• ... but almost certainly more common
• Cross-site request forgery attacks allow
attackers to force your users to take actions on
your site that they didn’t mean to take
• <img src=quot;http://example.com/admin/delete.php?id=5quot;>
• Not just GET; hidden forms allow POST as well
97. Defence against CSRF
• You need to know if the form that is being
submitted is one that you served up from
your own site (as opposed to an evil form
created by an attacker)
• Include a hidden form field with a token
generated by your site and associated with
the logged in user in a non-predictable way
100. Scalability is not
performance
Scalable systems increase their performance
as new hardware is added, proportional to
the hardware’s capacity
101. Vertical v.s. horizontal
• Vertical scaling: buy a bigger machine
• More RAM
• More CPU(s)
• “Big iron” costing $100,000+
• Horizontal scaling: buy more machines
• Almost always better than vertical scaling
• But... software must be designed to scale out
104. “Shared nothing”
• Rasmus Lerdorf, the creator of PHP,
describes this as a key principle of scaling
• Application servers (web servers running
PHP) have no shared state - everything
stateful is pushed out to the database layer
• This lets you trivially horizontally scale your
application servers behind a load balancer
• Now you just have to scale the data layer...
105. Four steps to building a
scalable data layer
• Add caching
• De-normalise where necessary
• Add database replication
• Add sharding
106. Caching
• You could cache to disk or shared memory...
• ... but you’re better off using memcached
• Distributed key/value in-memory caching
system, first developed for LiveJournal
• Facebook,YouTube, Wikipedia, Flickr...
obj = memcache.get(obj_id)
if not obj:
obj = construct_obj_from_database(obj_id)
memcache.put(obj_id, obj)
return obj
107. “Normalised data
is for sissies”
Cal Henderson, Flickr
• You can get a major speed-up by duplicating
some data (e.g. counts) in your database
• Your application logic will need to keep
everything in sync
108. Replication
• Master-slave replication lets you set up
copies of the database to accelerate reads
Writes all go
to master
Master
Slave Slave Slave
Reads spread across all slaves
109. Replication
• Master-master replication provides redundant
masters, but doesn’t really improve write
performance (both still have to make the same
number of writes)
Writes all go
to masters
Master Master
Slave Slave Slave
Reads spread across all slaves
110. Sharding
• Sometimes known as federation
• Users 1-1000 are on database A, 1000-2000
are on database B...
• Often requires a large scale re-write of the
system
• Much harder to do in social applications
where relationships span multiple databases
• WordPress MU is an interesting case-study
111.
112. Scalable business models
• Scaling gets a lot easier if you build it in to your
business model
• 37signals products (Basecamp, Highrise)
shard naturally based on individual customer
accounts - and more customers means more
money for servers
• Second Life shards by land area, and land
has to be bought by users - they’re essentially a
3D web hosting company
113. Build it on Amazon
• S3 - Simple Storage Service
• Cheap, robust key-value storage of both
small and large files
• EC2 - Elastic Compute Cloud
• On-demand instant virtual servers, billed
by the hour
• SQS - Simple Queue Service