Your search crawler grabs the list of external links from the database and begins running it’s process. When it completes a page it stories it in a index.
This application would be resources intensive and couldn’t be run on every page save so you would have to be run in a cron script.
Google App Engine
Myles Braithwaite <myles@monkeyinyoursoul.com>
Monkey in your Soul
Who am I?
• Indie/Hobby Programmer.
What is Google App
Engine?
• Allows you to run your web applications on
Google’s infrastructure
• Scaling is really simple
• No capital for system administrators
Google’s Server
Environment
• Web Server (HTTP and HTTPS)
• Datastore using BigTables (Schema Free)
• Authentication of users though Google
Accounts
• Scheduled Tasks (i.e. Cron)
Sandbox
• Applications can only access other
computers though HTTP and Email.
• Users can only access GAE applications
though HTTP.
• An application cannot write to the
filesystem.
• Application code only runs in response to an
HTTP request and has 30 seconds to run.
Getting Started with
App Engine
• Download the SDK @ code.google.com/
appengine
GAE SDK
GAE Project
Cron/Schedule Tasks
cron.yaml
Deploy your App
Windows or a Mac
GAE Roadmap
• Task queues for performing background
processing
• Ability to receive and process incoming
email
• Support for sending and receiving XMPP
(Jabber) messages
Message Queues
• Lets say you run a Wiki about Widgets.
• And you want to index all the external links.
• But you are on a Shared Host and have
limited resources.
ASITE.COM
Database
Search
BSITE.COM
Crawler
Index
CSITE.COM
Application requests
websites it
wants indexed.
ASITE.COM
Database
GAE
Search Application
BSITE.COM
Crawler
Index
CSITE.COM
GAE response
with crawled data.
Fear of the Elephant
• AppScale: http://appscale.cs.ucsb.edu/
• To run GAE on EC2 or Eucalyptus
• Some what Stable (Except no data
persistence)
• GAE2Django: http://tinyurl.com/gae2d/
• A bridge to run GAE code on Django.
• Some what Stable (Except no database
support)
0 comments
Post a comment