Top-down and bottom-up informatics: who has the high ground?


Published on

iEvoBio: Informatics for Phylogenetics, Evolution, and Biodiversity, at the Oregon Convention Center, Portland, Oregon, USA. June 29-30, 2010.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thanks for that introduction, good morning everybody. My name is Vince Smith and I am based at the Natural History Museum in London. For about the past 5 years now, together with my colleague Dave Roberts, we have been involved in a number of large-scale biodiversity informatics projects in the UK and Europe. These projects have employed a ranges of different development styles. Some have been very tightly specified from the outset, identifying in minute detail the precise scope and functions of the project and how these will be implemented. These are classic Top-Down projects where the overall project shape and design is specified from the outset and a project team slowly work together to build this. In contrast we have also been involved in projects that have employed a very different development style. These so called bottom-up projects rather than specifying the overall shape of a problem, specify the component building parts and then assemble these parts together as required. Through experience it has become clear to us that these two different development styles are strongly correlated with the number of people that engage with these projects and their respective impact amongst their target user communities. What I want to do is share with you some data on this and make some general remarks about top-down and bottom up processes when developing projects. I’ll do this by showcasing various UK and European projects we have been involved in. If nothing else I hope to simply raise awareness about these projects in the US, but hopefully also there will be some messages about the way we plan large-scale informatics projects that will improve their long-term impact. Designers have two different starting points when conceiving new structural forms, top-down and bottom-up. Top-down is the classical, Cartesian-center technique of picking the overall shape first and then filling in the parts. Bottom-up, as the name implies, is the opposite: it starts with geometric components as the initial building blocks. Through repetition and variation according to logical rules, they grow to define larger systems. Bottom up and top down could refer to two aspects. The technical aspects of the development process and the sociological patterns of adoption. We will consider both of these. Bottom-up conceptual approaches are found throughout other art disciplines, but it is still rare in informatics.
  • So by way of background, in recent years there has been substantial investment by UK and European research funders into supporting initiatives aims at developing systems to more effectively collect, manage and share biodiversity data. On this slide I have listed here a few of the initiatives my colleague Dave Roberts and I have been involved in. There are in fact many others that could be listed but I’m focusing on those where we have had significant engagement, with the exception of the SYNTHESYS project with is interesting for reasons I will come on to. Some key points to note are that: 1) almost $50M dollars (about 40M Euros at present rates) has been committed on these projects over a roughly 9 nine year period. This represents a very substantial investment in taxonomy and systematics. 2) To be fair not all of this money has gone into informatics and technical development. A significant chunk – perhaps at least half has been spent on overcoming the social challenges of setting up these informatics structures. Nevertheless a very significant amount of money has been spent on technical development. 3) Conceptually a number of these projects have overlapping goals. Usually this is because they are supporting different funding agencies within Europe, but there is also considerable conceptual overlap with projects outside Europe, especially in the US. 4) And finally these projects often employ very different development approaches. The technical implementation of some of these projects has been specified to a fine level of details from the outset. These so called top-down projects are often designed by committee, and take a very long time to deliver their software products, but what is delivered is often to a very high standard and is technically very robust. In contrast some projects, which I’ve labelled as bottom up, have developed much more organically. A basic solution to a technical problem is implemented very quickly, often with known flaws. The user community then comment and engage with software, which is then iteratively fixed by developers. The result is that bottom-up development yields solutions very quickly and cheaply but often with less quality and robustness that through top-down development processes. Having been involved with a number of these projects for about the past 5 years now, employing a mix of development approaches, its fitting to review the status and impact of these projects, and draw some conclusions on the most appropriate development approaches for large scale biodiversity informatics projects.
  • So to recap my main goal of this presentation is to raise awareness of these initiatives, not least because some of them are relatively unknown this side of the Atlantic. I want to review the major informatics products to assess impact (primarily through metrics of user engagement, because this is arguably the most important metric of success for a project. If people show sustained engagement with a project that suggests it was probably worth doing, but if that engagement is minimal (and in the case of some projects absent) it was probably not worth the effort. Through this we can identify those aspects of these informatics projects that have succeeded, and those that have failed. We can also identify some common themes and correlate these with the different styles of development that have been used in these projects. Finally we can identify some of the outstanding challenges that need to be tacked. So what I’m going to do it briefly go through these projects �
  • And I’m going to start with something called EDIT which is a 11.5 million EU funded 5 year projects that is designed to foster institutional integration amongst 27 of the major Natural History museums and organisations based in Europe. The eventual goal was to work toward a European Museum of Natural History in which the partner institutions would band together (or federate) their facilities to create a more efficient research infrastructure. The theory was that this would operate at all levels throughout these organisations from coordinating research and collections activities right way through to the development of policies and managements. Needless to say much of this effort does not involve informatics. But roughly half of the budget was spent on informatics products. Specifically the EDIT cyberplatform for taxonomy, and the Scratchpad project, and I’ll just say a few words on each of these.
  • So the EDIT cyberplatform is a collection of integrated research tools that supports the workflow of generating and managing taxonomic (and to a limited extent systematic) data. Specifically this suite of tools includes systems for managing taxonomic classifications, collection records, descriptions, literature, GIS data etc. At the heart is a database called the Common Data Model (or CDM) which acts as a repository for these data, and is accessed through a Java library allowing developers to build an interface for accessing these data. The software comes in 3 types of installation package depending on whether you plan to use the system as an individual, a local research group or as a set of distributed researchers; and there is an increasingly level of technical knowledge required to use the system, depending upon how you use it. The software was developed in a very top down way by mainly by a core team of developers based in Germany. In consequence it has taken 4 years to deliver this at a cost of about 5.1M. Because of these there are only three user communities and each of these are paid by the EDIT project to use the software. Also there is very limited wider uptake, but again this is probably because it has taken so long to develop. These points have been repeatedly been picked up in the review documents for the project. So much effort has been spent developing these products that it is not clear whether anyone will use them, and there is very little time within the remaining period of EDIT to demonstrate that people will use them.
  • A very different approach to development was taken up by the Scratchpad project. This project was set up because many of the participants in EDIT could not afford to wait 4 or 5 years because a technical platform was developed to support their work. The Scratchpads are essentially websites for taxonomists currently hosted by the Natural History Museum in London. These sites act as a research & publication platform for their users through a series of flexible software modules that support the taxonomic workflow. The entire project uses the Drupal Content Management System as a technical backend that supports the basic functions. And we have written modules that provide the specialist functions needed by the taxonomic research community. The entire project was built from the outset using agile design methodologies that allow us to rapidly and constantly iterate the software in the light of user feedback. The result is that we now have an ecosystem of more than 2000 users working within 150 different user communities. The project is also sufficiently successful that we have successfully been awarded follow on funding in the form of project called ViBRANT, which I won’t go into detail on here.
  • We have quite a lot of information on the level of user engagement with the Scratchpads, which provides some information on the Scratchpads impact. The project has roughly seen linear growth in terms of the number of sites, users, and nodes (pages). Approximately 10% of users return every month across about 1/3rd of all sites. We also use tricks to incentivize growth of the sites by show users their statistics and those of comparable sites. From our own research we know of cases where users effectively compete to maximise their own usage stats. Another key features is that (as much as is possible) we are very fast at responding to user feedback. We have also commissioned sociological studies of our users to understand what they are doing on their sites because what they say they do, does not necessarily correlate with what they actually do. Another point to note, because we get many comments about this is that the sites themselves often don’t look pretty. This is because we have invented a lot more effort in functionality than aesthetics.
  • CATE (Create a Taxonomic e-Science) is a piece of software specifically supporting the construction of taxonomic revision on the web. The system supports the construction of consensus style taxonomic classifications where community contributions are peer reviewed before being accepted as part of the main classification or stored as alternate classifications. The project was constructed in a top down manner, working directly with two user communities that were explicitly funded by the project. In consequence it took most of the project to develop the software and there are no other user communities that have adopted the system at present. However, a follow on project has been funded focused on providing monographic style treatments for monocot plants, and this project will combine the Scratchpad and their community editing functionality with a central database.
  • SYNTHESYS is a very different kind of project to EDIT, although financially is it larger. SYNTHESYS mainly provides small amounts of funding to cover the accommodation and travel expenses of researchers visit the collections of participating research institutions. However, there is a significant informatics component to the project that has supported the development of various data standards & protocols for exchanging biodiversity data. SYNTHESYS has also funded the development of various website portals for accessing collections data. Specifically the BioCASE portal for specimen data and observations, in addition to the GeoCASE providing access to geoscience records. Like the Cyberplatform these informatics aspects have been constructed through largely top-down processes. In consequence there are relatively low levels of user engagement. Precise statistics are hard to come by but these data provide basic information on access to the data portals and its clear that over an entire year some of these portals have had as little as 259 visitors!
  • 3M Euros of planning! Top-down A technical framework to support: -Ecosystem service valuations -Biological observations -Phylogenetics -Species dynamics & distribution; More?) Estimate completion date is 2014 and with an estimated cost 370M Euros!!!
  • So that was a very quick review of a few of the larger European biodiversity informatics initiatives currently ongoing. From these and a few others I have not had time to mention its possible to spot a few commonalities and emerging trends in these projects. Generally speaking, top-down development processes are correlated with low usage rates. In many projects are user community is explicitly funded to use the system. This is good because it guarantees a user community – at least for the duration of the project, but risky because that community may have expectations that might not be matched by the reality of the product. Top-down development is typically slow. Bottom-up development is depends upon user feedback. Bottom-up development encourages sustained user engagement This is not to say that all top-down processes are bad. A key problem for many of these infrastructures is sustaining them, or at least give users the confidence that they can be sustained long enough for users to extract value from them. This is really only something that can be provided top-down by institutions, organisations and societies, that have a vested interest in sustaining the system and perhaps making it part of their core-business, and it not something that universities are typically very good at. What is really crucial in the successful implementation of these projects is bottom-up development agility required to produce a product quickly and then iterate that software rapidly in response to user feedback. Defining a set of basic generic building blocks for a system and then refining these with more and more specialist functions is much for effective that trying to plan the minute details of a project and then finding out these assumptions of the design were based turned out to be wrong, but the lack of development agility means they cannot easily be put right. Another challenge for the future is that there is still as great deal of duplication of effort within the biodiversity informatics community. Many of the systems we are building have significantly overlapping features, even (in some cases) if funded by exactly the same project. This is even more evident when you look for duplication of effort with projects outside of Europe. This duplication can be positive because it breeds competitions with drives up standards. However, its also damaging because many of these systems will ultimately require the benefits of network effects through a critical mass of users and if this userbase is diluted through usage of multiple projects they conceptually overlap, this cannot be achieved. Central to this is having agreed metrics of impact for these systems so we can more effectively identify successful projects. Collecting these kinds of metrics from outside a project can be enormously difficult and without any common standard on what we are collecting, it is difficult to compare projects. One way of addressing these problems is through better international collaboration, particularly with respect to funding. At the moment, many of the problems we are trying to tackle are global but the political constraints of how we fund this work means that we have to emphases national or regional interests, and in particular there is very little if any mechanism for jointly funding anglo-american or Euro-American initiatives at present. This is something our funding agencies need to address if we are more efficiently develop these infrastructures and allow users to benefit from the networking effects in the future. This that I’ll stop there. Thanks very much.
  • So within the taxonomy and systmematic research community we have this enormously ambitious goal to identify and inventory all the earth’s species; document their relationships, (both their evolutionary relationships and ecological relationships) and finally we want to publish these data in such a way that others can readily user it. To date we have described about 1.8 million species. Unfortunately we have not been very consistent in the way we name these species so there are actually at least 10 million names. This information is published in about 300 million pages of published text – that’s over the last 250 years. And it is estimated that there is between 1.5 and 3 billion specimens representing all these species in Natural History Museums and private collections around the world. In a sense this forms our data set for biodiversity science; and we keep adding to this data set at the rate of about 20-30k species every year. The number of people to do all this work is remarkably small. Our best estimates suggest that there are just 4-6,000 professional taxonomists worldwide. On top of this there are a number of unpaid specialists (amateurs), many of whom may very significant contributions to biodiversity science. Finally there are many more so-called citizen scientists who have a passing interest in natural history and at some level have something to contribute. So collectively the scale of the task, the size of the data set and the small number of people doing this work make the challenge formidable. So the rationale is that we can much more efficiently address these problems with technology. Specifically informatics tools can be used to more efficiently document and describe the diversity of life on Earth.
  • The major research funders in the UK and Europe, just like in the US, picked up on this some time ago. Often making very bold and ambitious claims about how technology and transform not only how we do our science; but also the kinds of science we do. With a view to making science more collaborative, supporting distributed and data intensive research using high performance computing and visualisation techniques; and especially using the web as a kind of instrument of scientific research. As a result as far back as 2001 several so-called e-Science initiatives were established which effectively ringfenced funding for using technology to develop infrastructures that support the major sciences. It took some time for the UK and European taxonomic and systematics research communities to respond to this, but eventually they did.
  • Top-down and bottom-up informatics: who has the high ground?

    1. 1. Vincent S. Smith & Dave Roberts Top-down & bottom-up informatics Who has the high ground?
    2. 2. … time to review status & impacts Investment in Biodiversity Informatics EU/UK Initiatives <ul><li>EDIT* (€11.5M, ‘06-’’11) </li></ul><ul><li>CATE* (£487k, ‘06-’09) </li></ul><ul><li>SYNTHESYS I+II (€13M+€7M, ‘04-’13) </li></ul><ul><li>LifeWatch (planning)* (€3M, ‘08-’11) </li></ul><ul><li>ViBRANT* (€4.75M, ‘10-’13) </li></ul><ul><li>eMonocot* (£1.96M, ‘10-’13) </li></ul>Recent… <ul><li>Almost €40M Euros ($50M USD) committed over 9 years </li></ul><ul><li>Not all informatics! (address social & technical issues) </li></ul><ul><li>Some overlap (intra and intra EU) </li></ul><ul><li>Different development approaches (top-down & bottom-up) </li></ul>Key facts
    3. 3. Goals <ul><li>Raise awareness of these initiatives </li></ul><ul><li>Review their major informatics products </li></ul><ul><li>Review their impact (through metrics of user engagement) </li></ul><ul><li>Identify successes & failures, commonalities & trends </li></ul><ul><li>Correlate these with different development models </li></ul><ul><li>Identify outstanding challenges </li></ul>Briefly…
    4. 4. Institute of Taxonomy (EDIT) European Distributed <ul><li>Goals… </li></ul><ul><li>Foster institutional integration </li></ul><ul><li>Towards a European Museum of Natural History </li></ul><ul><li>Functionally federate existing facilities </li></ul><ul><li>Operate at all levels (research, collections, policy, management) </li></ul><ul><li>Key products… </li></ul><ul><li>Cyberplatform </li></ul><ul><li>Scratchpads </li></ul><ul><li>Annual field course (training) </li></ul><ul><li>Integrated loans policy </li></ul><ul><li>Inventory task force </li></ul><ul><li>European e-Journal of Taxonomy </li></ul>27 leading EU taxonomic institutions, 5-years, EU funded, €11.5 million
    5. 5. The Cyberplatform EDIT Informatics products <ul><li>Common Data Model (repository) </li></ul><ul><li>Accessed through a Java library </li></ul><ul><li>3 types of installation package </li></ul><ul><li>Top-down design </li></ul><ul><li>Collection of integrated tools </li></ul><ul><li>Supports the taxonomic workflow </li></ul>(taxonomy, collection records, descriptions, literature, GIS etc) € 5.1M “ Focus on continuous refinement … without being backed by an appropriate marketing and outreach activity will prevent interested newcomers to get hands-on the EDIT toolkit.” EDIT Cyberplatform review, 2010 <ul><li>3 main user groups (paid) </li></ul><ul><li>Currently limited wider uptake </li></ul>
    6. 6. Scratchpads EDIT Informatics products <ul><li>Hosted websites for taxonomists </li></ul><ul><li>Research & publication platform </li></ul><ul><li>Modular (Drupal) & flexible </li></ul><ul><li>Supports the taxonomic workflow </li></ul><ul><li>Bottom-up design, agile dev. </li></ul><ul><li>Ecosystem of communities (150+) </li></ul><ul><li>2,000+ users (unpaid) from 2007 </li></ul><ul><li>Follow on project ( ViBRANT, €4.75M) </li></ul> € 300k
    7. 7. Scratchpads - user engagement statistics EDIT Informatics products <ul><li>Linear growth (sites, users, nodes,) </li></ul><ul><li>10% of users return every month </li></ul><ul><li>Monthly returns across 1/3rd of sites </li></ul><ul><li>Incentivize growth (competition) </li></ul><ul><li>Rapid response to user feedback </li></ul><ul><li>Sociological study </li></ul><ul><li>Sites don’t look pretty </li></ul> 2007 2008 2009 2,000 200,000 1,500 150,000 1,000 100,000 500 50,000 Users Nodes As of 27 June 2010: Sites: 153 Users: 2,015 Nodes: 275,849 40 80 120 160 0 Sites
    8. 8. CATE Creating a Taxonomic e-Science <ul><li>Software supporting web revisions </li></ul><ul><li>Consensus taxonomy </li></ul><ul><li>Peer review workflow </li></ul><ul><li>Top down design (EDIT CDM) </li></ul><ul><li>Two exemplar (paid) user groups </li></ul><ul><li>Aroid plants </li></ul><ul><li>Sphingid moths </li></ul> <ul><li>No uptake by other communities </li></ul><ul><li>Follow on project ( eMonocot €2.4M ) </li></ul>€ 593k
    9. 9. Synthesis of Systematic Resources SYNTHESYS <ul><li>Paid support to access to collections </li></ul><ul><li>Mostly not informatics </li></ul><ul><li>Multiple phases </li></ul><ul><li>Data standards & protocols </li></ul><ul><li>Portal interfaces to data collections </li></ul><ul><li>Top down design </li></ul><ul><li>ABCD (collections & observations) </li></ul><ul><li>Contributions to TAPIR protocol </li></ul><ul><li>National Collections Descriptions </li></ul><ul><li>BioCASE (spec. & observations) </li></ul><ul><li>GeoCASE (geoscience extension) </li></ul> € ??M Low levels of engagement with portals Aug 2008 – July 2009
    10. 10. European Strategy on Research Infrastructures LifeWatch - an EU ESFRI project <ul><li>€ 3M Euros of planning! </li></ul><ul><li>Top-down </li></ul><ul><li>A technical framework to support </li></ul><ul><li>Estimate completion 2014 </li></ul><ul><li>Estimated cost €370M Euros!!! </li></ul><ul><li>Ecosystem service valuations </li></ul><ul><li>Biological observations </li></ul><ul><li>Phylogenetics </li></ul><ul><li>Species dynamics & distribution </li></ul><ul><li>More? </li></ul>
    11. 11. Observations & Conclusions <ul><li>Top-down development correlated with low usage (usually paid) </li></ul><ul><li>Top-down development is typically slow </li></ul><ul><li>Bottom-up development is dependent on user feedback </li></ul><ul><li>Bottom-up development encourages sustained user engagement </li></ul><ul><li>Sustainability (institutional top-down support) </li></ul><ul><li>Development agility (bottom-up), not designed by committee </li></ul><ul><li>Duplication of effort (intra and inter EU) </li></ul><ul><li>Agreed metrics of impact (user engagement) </li></ul><ul><li>International collaboration (& funding) </li></ul>Ongoing challenges
    12. 12. Questions?
    13. 15. Background The challenge <ul><li>Inventory the Earth’s species </li></ul><ul><li>Document their relationships </li></ul><ul><li>“ Publish” & apply these data </li></ul>Goal… <ul><li>1.8 M described spp. (10M names) </li></ul><ul><li>300M pages (over last 250 years) </li></ul><ul><li>1.5-3B specimens </li></ul>Data set… People… <ul><li>4-6,000 taxonomists </li></ul><ul><li>30-40,000 “pro-amateurs” </li></ul><ul><li>Many more citizen scientists? </li></ul>
    14. 16. e-Science Applied to biodiversity research “ the growing capacity of computing, storage, communication and software systems – offers the prospect of not only to automate science… but revolutionise how science is performed” 2001 UK Research Councils e-Science Initiative <ul><li>Ring fenced (e-science) funding in UK and EU </li></ul><ul><li>European taxonomy / systematics community responded (slowly) </li></ul>Result… <ul><li>More collaborative </li></ul><ul><li>Network-based & distributed </li></ul><ul><li>Data intensive with terascale computing </li></ul><ul><li>Support high performance visualisation </li></ul>Solutions to make science…