The technology of the Human Protein Reference Database (draft, 2003)
Upcoming SlideShare
Loading in...5
×
 

The technology of the Human Protein Reference Database (draft, 2003)

on

  • 530 views

Between 2002 and 2004, I managed the technology team that built the Human Protein Reference Database (http://hprd.org) at the Institute of Bioinformatics in Bangalore and Johns Hopkins University in ...

Between 2002 and 2004, I managed the technology team that built the Human Protein Reference Database (http://hprd.org) at the Institute of Bioinformatics in Bangalore and Johns Hopkins University in Baltimore. These are my notes on the tech from sometime in 2003, rediscovered in 2014 when I was looking through old files.

Statistics

Views

Total Views
530
Views on SlideShare
529
Embed Views
1

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Insert points here outlining the data requirements of HPRD. <br />
  • Needs more slides before this explaining the organization of a project in Zope. <br />
  • Backup for statements on C and PHP programmers: <br /> In a C function, all variables have to be declared first with an explicit data type before they can be used. Variables cannot be declared just before use. C programmers tend to reuse temporary variables in a long function. <br /> A C programmer new to Python will therefore tend to write C code translated into Python. Examples of this coding style are initializing temporary variables to blank values (“” for strings and 0 for integers) and reusing the same variables instead of deleting them and using new ones, or better, writing nested functions. <br /> An example problem caused by this style is when a temporary variable that is used by a part of a long function expecting it to be initialized to a blank value now suddenly contains something else because another part of the function above this area was extended to use the temporary variable and the programmer forgot to reset it after finishing using it. Such bugs can wreak havoc in code that was functioning perfectly before. <br /> The problem with PHP programmers is not as severe. Because PHP’s object orientedness isn’t very good, PHP programmers again tend to write a bunch of functions when they should have defined a new class instead. Same code management problems follow. <br />

The technology of the Human Protein Reference Database (draft, 2003) The technology of the Human Protein Reference Database (draft, 2003) Presentation Transcript

  • Human Protein Reference Database An analysis of the technology powering the database and website, and how it was developed. Kiran Jonnalagadda
  • Facts About HPRD • HPRD is a database of all disease causing proteins in the human body. • It is the most comprehensive database of its kind in the world today. • Unlike most other biological databases, HPRD is protein-centric, not gene-centric. 2
  • Factors Leading to Choice of DB • The biologists hadn’t settled on what information was to be stored and therefore the data type definitions changed often. • Several data types were fairly similar to others but not the same. • Future extensions had to be built by techsavvy biologists with minimal assistance from programmers. 3 View slide
  • What We Used • The Zope application server, comprising of: – – – – The Web publishing object framework. ZODB, the object database storage system. ZCatalog, the indexing and search system. ZEO, the stand-alone database server for multiple front-end Web servers. 4 View slide
  • Why an RDBMS Was Not Suited • Data type definition changed frequently. In an RDBMS, this would have meant redefining tables every week. • The code currently has about forty data classes. Imagine having that many data tables, plus tables for relationships between them, all under frequent revision. 5
  • How Zope Handled These Issues • Zope is built on Python, which offers dynamic data structures. • ZODB uses this ability to makes the entire database look like one large data structure, transparently swapping unused parts to disk and recovering them as needed. • ZCatalog indexes data for searching. 6
  • At Zope’s Core is Python • Python is a dynamic language. • When I say dynamic, I mean everything is dynamic! • Code, variables, classes, modules, everything can be modified at run-time. • Most of Zope is built around this ability. Zope could not have been implemented in another language. 7
  • Data Storage in Zope • In Zope, data is stored in instances of a data class. • The data class has variables, which are like fields, and methods, which manipulate data. • Instances of a data class (objects) are stored in the ZODB, making the database. • Objects can contain other objects, forming hierarchies. 8
  • Components of Zope • ZServer (formerly Medusa) – Handles incoming requests. – Does HTTP, FTP, WebDAV, XML-RPC; soon SOAP. • ZPublisher – Maps URLs to objects and handles security. • ZODB (Zope Object DataBase) – Stores objects on disk in a transactional DB. • ZEO (Zope Enterprise Objects) – ZODB server for multiple Zope front-end servers. 9
  • Security in Zope • Security is fine grained. • Security is defined around four concepts: – Users, Roles, Permissions and Hierarchies. • A user is assigned one or more roles. • A role is assigned a set of permissions. • This set can be reassigned at different positions in the hierarchy. 10
  • Security Outside Zope • Zope’s security mechanism is limited to the Web front. • It is applied only to objects that directly interface with the end-user. • Code written in a module in the filesystem has no security restrictions. It can do anything. 11
  • Limitations in Zope • The API for creating extensions (called Products) is complicated and poorly documented. • The Property Manager interface is too primitive. It only handles the very basic data types such as strings, integers, boolean fields, selection lists and multi-line text. 12
  • Our Extensions to Zope • A framework for separating Zope specifics from our data types, making it much simpler to add new data types. • An extended property management system that could handle changes in data type definitions and automatically migrate data. 13
  • Part II User Interface The rationale behind decisions affecting how a user experiences the database.
  • User Interface Design • We started with exposing Zope’s hierarchy as the public user interface • But there were some elements such as the category browser and the 15
  • Templates for the Web UI • Choice of DTML and ZPT for templates. • ZPT for templating system. 16
  • Part III Project Management Lessons What we learnt about managing a project across continents and distant time zones.
  • Project Management Issues 1 • We learnt the hard way that a project manager’s place is with his team, not with the client. • Productivity suffers in the absence of an effective collaboration tool. • E-mail and instant messengers are not effective collaboration tools. 18
  • Project Management Issues 2 • Collaboration over e-mail imposes the burden of articulation on the communicator, which many dislike and therefore avoid. • Instant messaging prevents collecting thoughts before presenting them and is therefore a poor planning tool. 19
  • Collaboration Tools • We experimented with several collaboration systems, with varying effectiveness: – – – – – Phone calls. Instant messengers. Wikis. Issue tracking software. Mailing lists. 20
  • Phone Calls • Next best thing to face-to-face discussions. • But only connect two people unless nonstandard equipment is used. • International calls are usually too expensive for the resulting gain. 21
  • Instant Messengers • Provide critical communication between geographically distributed team members. • But the pressure of maintaining continuity in a conversation hinders pausing to gather thoughts. • Typing is much slower than talking. Therefore little else gets done alongside. 22
  • Wikis • The easy hyperlinking system of a wiki combined with structured text makes presenting information a snap. • With a little code thrown in, Wikis could make a wonderful project management tool. • A changed page notification system is needed or changes go unnoticed. 23
  • Issue Tracking Software • We use BugZilla to track issues. • But in eight months using it, only 30 issues have been reported using it. • The other few hundred were reported over email, instant messengers and in person. • Clearly, the problem is with BugZilla’s usability. Search for a new system is on. 24
  • Mailing Lists • E-mail is push media: the latest is always on top of your inbox. • E-mail makes an effective to-do list in an interface the user is comfortable with. • Mailing lists are e-mail in broadcast mode. • Mailing lists have been the most effective collaboration tool we’ve used so far. 25
  • Issues With Programmers • Programmer skill levels and attitudes vary. • C programmers tend to write C code in Python. • PHP programmers tend to write PHP code in Python. • Learning Python is easy but thinking in Python takes a long time. 26
  • Programming Tools We Used • CVS for source control. • ViewCVS for a Web front-end to CVS. • Vim in GUI mode for source editing (preferred editor of everyone in the team). • The print statement for debugging. 27
  • Tools We Should Have Used • WingIDE is a $35 piece of software that provides an interactive Python debugger usable with Zope that would have in a few minutes of usage more than paid for itself for the hours in programmer time we instead spent debugging using the print statement. 28
  • Part IV Things Needing Fixing Mistakes we made during development, how they affect things now, and how they can be fixed.
  • Naming Conventions • We started with assuming HPRD was genecentric and named several things as GeneSomething. • In code, this can be considered just an identifier. • But in a URL, there is potential for confusing users and needs renaming. 30
  • Reusable Modules • All of the code currently sits in one directory. • Several important pieces have nothing to do with how they are being used. • These modules could be separated and contributed independently to the open source code pool. 31
  • Data in Code • There are bits of implementation specific data embedded in code in some places, particularly related to graph generation. • These were introduced as quick patches for a temporary problem but have remained in place for months now. • These need to be taken out so that the code is truly reusable. 32
  • Documentation • DocStrings needed in code. • Consistent language in DocStrings. • HTML documentation files to be distributed with code. 33