Perl cures coronary heart disease


Published on

Life sciences in general such as genetics and biology have traditionally benefited from Perl with excellent projects leading the way such as BioPerl. Unfortunately, in medical research and epidemiology, the picture is different. Researchers are struggling with the ever increasing size and complexity of datasets. This presentation will briefly describe the situation I faced when I first joined a research team working on coronary heart disease, what I did to make things better and how I achieved one small victory for Perl.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I come from both a sciency and a commercial environment where large datasets were used from multiple stakeholders and sharing was a good thing.
  • Bioinformatics, - essentially computer science and molecular biology dealing with DNA, RNA etc. They are used to extremely large datasets, the raw human genome was 30TB big There has been substantial innovation in both hardware and software and established standards for storing, searchin, visualizing information The bioinformatics community is international and collaborative and data is shared amongst peers. The bioperl project is an excellent example of this A big collection of perl modules for doing many operatios on bioinformatics data, An international collaboration with many people working on it, cross platform and a plethora tools based on it Very good documentation and there’s even a O’Reilly book on it.
  • Epidemiology on the other hand, clinical epidemiology, is all about collecting and analyzing clinical data on patients Traditionally it is very expencive to follow up people with medical exams, questionnaires etc and the typical study size would have less than 5000 individuals Paper is king and everything is based on it, slowly doing the transition to electronic format for data gathering Times however have changed, there’s a bunch if NHS IT projects going on to bring medical data together, electronic health records have come into play etc There is more and more data available from multiple sources such as GP surgeries, hospitals, office of national statistics and government data sources.
  • So what are people doing. When I first joined my new job I saw that people were not happy. The size of the data is every increasing, I am dealing with a 6m patient database with over 5 bn rows Of course it was delivered as text files Of course I had to sign 40 page forms to obtain the data Data is well kept secret. There is very little sharing going on. Researchers are struggling to actually manage the data rather than analyze it. Data cleaning, formatting, specifications (lack of) Statistical packages are used to manage the data which in my head is not entirely appropriate Only very recently did funding organizations start requiring research teams to actually hire somebody dedicated for managing and curating these data sets. Some common patterns emerged which I examined in an academic fashion.
  • Fear leads to hate, hate leads to anger, anger is the path to suffering. And only one person is happy with all those.
  • So what did I try to do. I took a small step for man and created the medical namespace after emailing the dev list I started thinking of similar ways to create something like bioperl but for medical-specific modules. There already are several modules on CPAN which are of interest. DICOM is a image format widely used and UMLS is a structured ontology used in biomedical sciences The main issue is to expose these, and others, to non-Perl people, aka normal people
  • The NHS deals with 1m patients per 36 hours The nhs number,is a ten digit UID essentially that everybody gets assigned and is based on the mod11 algorithm Of course, this is the NHS so there are 21 different formats of old school NHS numbers floating around I looked on CPAN and could not find anything, but its no problem, I just created medical::nhsnumber which was the first module for the medical namespace
  • The ICD10 coding system is basically one huge ontology for coding diseases, signs, symptoms , test results etc Everytime you visit the hospital, you get a series of codes according to what the problem was, its very very widely used. What do most people do? They try to open it in Excel… And how do you take all the parents of the term if you want to? Weeeeelll, we use this search function and paste results into another spreadsheet and then we use stata to check it… Ok ok stop. Medical::icd10, a very simple module doing very simple things saving people time. Coupled with a very basic web interface.
  • Another thing I looked at was standards used to describe the data Or perhaps more appropriately, the lack of standards to describe the data. Documentation is delivered as a excel file or a email or a word document with cryptic variable names and all that fun So I said, there is an established data documentation standard called the DDI, why don’t we use it and make our lifes easier? I created two modules and a bunch of scripts and turned a flat excel file of little usability into something better, much better. Similarly, for study registration in the interest of transparency, we use all the time. I created a module for people to use.
  • It turns out people do want to do their lives easier. We got activestate and we have the excellent resource of learning perl so I t Finally I introduced perl to people in my team. Most of them got scared away but one of them was happy with Perl. She even considered Python. Still in my book it’s a win.
  • So life after perl, what did I do: I itrodiuced a new namespace I created several modules internal and external I created better data documentation using perl and promoted standards And I introduced perl to normal people? Was all this technically complicated? Probably not, it was very straightforward in the majority of cases. Was it worth it? This is how my work was after Perl.
  • Please help out! Introduce perl to your academic group Contribute to the medical namespace Help design and implement medperl Use more standards at work if you are not already using them And finally, shameless, please join the UCL perl users group if you are from UCL Thanks!
  • Perl cures coronary heart disease

    1. 1. Perl cures coronary heart disease (well, sort of) Spiros Denaxas, @fruit90210 London BioGeeks, 24 th Feb. 2011
    2. 2. Talk outline <ul><li>Who am I? </li></ul><ul><li>Background of two different worlds </li></ul><ul><li>Life before Perl </li></ul><ul><li>What I did </li></ul><ul><li>Life after Perl </li></ul><ul><li>Please help out! </li></ul>
    3. 3. Hi, I am Spiros. <ul><li>Who are you and what do you want? </li></ul><ul><ul><li>Computer Science </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><ul><li>Nestoria </li></ul></ul><ul><ul><li>UCL Clinical Epidemiology </li></ul></ul>
    4. 4. Bioinformatics vs. epidemiology <ul><li>Bioinformatics </li></ul><ul><ul><li>Computer science + molecular biology </li></ul></ul><ul><ul><li>Extremely large datasets </li></ul></ul><ul><ul><li>Both hardware and software innovation </li></ul></ul><ul><ul><li>Established standards (storage, searching,…) </li></ul></ul><ul><ul><li>Data sharing and collaboration </li></ul></ul><ul><li>BioPerl project ( </li></ul><ul><ul><ul><li>Collection of Perl modules </li></ul></ul></ul><ul><ul><ul><li>International collaboration </li></ul></ul></ul><ul><ul><ul><li>Cross-platform </li></ul></ul></ul><ul><ul><ul><li>Plethora of tools based on it </li></ul></ul></ul><ul><ul><ul><li>Good documentation </li></ul></ul></ul>
    5. 5. Bioinformatics vs. epidemiology <ul><li>(Clinical) Epidemiology </li></ul><ul><ul><li>Collect and analyze clinical data on patients </li></ul></ul><ul><ul><li>Traditionally very expensive </li></ul></ul><ul><ul><li>Typical study: less than 5000 individuals. </li></ul></ul><ul><ul><li>Everything was/is based on paper. </li></ul></ul><ul><li>Times changed: </li></ul><ul><ul><li>Electronic Health Records (EHR) </li></ul></ul><ul><ul><li>NHS IT / Connecting For Health (CFH) </li></ul></ul><ul><li>More data available from multiple sources: </li></ul><ul><ul><li>GP, Hospitals, ONS, Government </li></ul></ul>
    6. 6. Epidemiology now <ul><li>Ever increasing size of datasets (6M+) </li></ul><ul><li>Increasing complexity of structured ontologies </li></ul><ul><li>Data is a well kept secret </li></ul><ul><li>Researchers are struggling </li></ul><ul><ul><li>Data quality, formatting, specifications (lack of) </li></ul></ul><ul><li>Focus on analysis, not management. </li></ul><ul><ul><li>Stata / SPSS </li></ul></ul><ul><ul><li>Text is king </li></ul></ul><ul><li>Funding for data management (academia vs. “The Real World”) </li></ul><ul><li>Some common patterns emerged… </li></ul>
    7. 7. Life before Perl
    8. 8. At least he’s happy!
    9. 9. What did I do <ul><li>One small step for man </li></ul><ul><li>Created the “Medical” namespace </li></ul><ul><li>Started thinking of “MedPerl” </li></ul><ul><li>Its all about exposure to non-Perl people (aka normal people) </li></ul><ul><li>We already have some medical modules: </li></ul><ul><ul><li>UMLS::Interface </li></ul></ul><ul><ul><li>Image::ExifTool::DICOM </li></ul></ul><ul><ul><li>UMLS::Similarity </li></ul></ul><ul><li>And I took it from there on… </li></ul>
    10. 10. NHS numbers <ul><li>NHS deals with 1M patients / 36 hours </li></ul><ul><li>(new school) NHS number = 10 digit UID </li></ul><ul><li>Modulus 11 algorithm </li></ul><ul><li>Of course, 21 different formats of old school </li></ul><ul><li>Nothing on CPAN? </li></ul><ul><li>No problem, Medical::NHSNumber </li></ul><ul><ul><li>is_valid() </li></ul></ul>
    11. 11. International Classification of Diseases (ICD10) <ul><li>Created by the World Health Organization (WHO) </li></ul><ul><li>A coding of diseases and signs, symptoms, […] </li></ul><ul><li>Widely used </li></ul><ul><li>15,000 terms, essentially an ontology </li></ul><ul><li>Excel ?%@?(%!?#?@$@!#%@ </li></ul><ul><li>No problem, Medical::ICD10 </li></ul><ul><ul><li>get_term() </li></ul></ul><ul><ul><li>get_parent_term() </li></ul></ul><ul><ul><li>get_child_terms() </li></ul></ul>
    12. 12. Standards? What's that? <ul><li>Lack of formalized standards </li></ul><ul><li>Data is delivered in CSV </li></ul><ul><li>Documentation in Excel, Word, emails, inline comments </li></ul><ul><li>Why don’t we use DDI v. 2? </li></ul><ul><li>MINAP::Describe and Data::DDI </li></ul><ul><ul><li>Clean, recode, describe </li></ul></ul><ul><ul><li>Internal module </li></ul></ul><ul><ul><li>Twiki as output </li></ul></ul><ul><li>Study registration </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>WebService::ClinicalTrialsdotGov </li></ul></ul>
    13. 14. Introduced Perl <ul><li>People do want to make their lives easier </li></ul><ul><li>Excellent resource: </li></ul><ul><li>Introduced Perl to team members </li></ul><ul><ul><li>Most of them scared away </li></ul></ul><ul><ul><li>One was happy (yay!) </li></ul></ul><ul><ul><li>(Also considered Python) </li></ul></ul>
    14. 15. Life after Perl <ul><li>Introduced a new namespace </li></ul><ul><li>Created several modules (internal and external) </li></ul><ul><li>Created better data documentation using Perl </li></ul><ul><li>Promoted standards </li></ul><ul><li>Introduced Perl to “Normal People” </li></ul><ul><li>Was it complicated? </li></ul><ul><li>Was it worth it? </li></ul>
    15. 16. Life after Perl
    16. 17. Please help out! <ul><li>Introduce Perl to your academic group </li></ul><ul><li>Contribute to the “Medical” namespace </li></ul><ul><li>Help design and implement “medperl” </li></ul><ul><ul><li> </li></ul></ul><ul><li>Use standards </li></ul><ul><li>Join the UCL Perl Users Group </li></ul>
    17. 18. Thank you. <ul><li>Questions? </li></ul>