Data Science Provenance: From Drug Discovery to Fake Fans


Published on

Knowledge work adds value to raw data; how this activity is performed is critical for how reliably results can be reproduced and scrutinized. With a brief diversion into epistemology, the presentation will outline the challenges for practitioners and consumers of Big Data analysis, and demonstrate how these were tackled at Inforsense (life sciences workflow analytics platform) and Musicmetric (social media analytics for music).

The talk covers the following issues with concrete examples:
- Representations of provenance
- Considerations to allow analysis computation to be recreated
- Reliable collection of noisy data from the internet
- Archiving of data and accommodating retrospective changes
- Using linked data to direct Big Data analytics

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Science Provenance: From Drug Discovery to Fake Fans

  1. 1. Data Science Provenance: From Drug Discovery to Fake Fans Dr Jameel Syed @tilapia
  2. 2. Overview  Knowledge work adds value to raw data  How determines whether results can be reliably reproduced and scrutinized  Solving parts of the problem - Inforsense (life sciences workflow analytics platform) - Musicmetric (social media analytics for music)  What's Provenance & why its important  Representations of provenance  Considerations to allow analysis computation to be recreated  Reliable collection of noisy data from the Internet  Archiving of data and accommodating retrospective changes  Using linked data to direct Big Data analytics
  3. 3. What is Data (Science) Provenance?  Scientific research is generally held to be of good provenance when it is documented in detail sufficient to allow reproducibility. Scientific workflows assist scientists and programmers with tracking their data through all transformations, analyses, and interpretations. Data sets are reliable when the process used to create them are reproducible and analyzable for defects. Current initiatives to effectively manage, share, and reuse ecological data are indicative of the increasing importance of data provenance.  Reproducibility of data & research process - Explanation - Why were the end conclusions reached? - Debugging and verification – Sharing, auditing - Re-application
  4. 4. The Economist, October 19th 2013  Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers.  Ideally, research protocols should be registered in advance and monitored in virtual notebooks. This would curb the temptation to fiddle with the experiment’s design midstream so as to make the results look more substantial than they are. ... Where possible, trial data also should be open for other researchers to inspect and test.   Nature, Vol 500, 1st August 2013;
  5. 5. Reinhart and Rogoff's spreadsheet error  "Growth in a Time of Debt" paper shaping decisions affecting national economies  BBC; 20 April 2013 - After some correspondence, Reinhart and Rogoff provided Thomas with the actual working spreadsheet they'd used to obtain their results. "Everyone says seeing is believing, but I almost didn't believe my eyes," he says. - The Harvard professors had accidentally only included 15 of the 20 countries under analysis in their key calculation (of average GDP growth in countries with high public debt). Australia, Austria, Belgium, Canada and Denmark were missing. - Businessweek FAQ  "Spreadsheets: The Ununderstood Dark Matter Of IT" - Y2K bug was not just COBOL!
  6. 6. Open Data Science  Open Source Software is the foundation  Open Access to data and methodology - errors happen, but are they found?  Many efforts... - Open Access publication (PubMed, - Mozilla Science Lab @MozillaScience - Open Knowledge Foundation - Open Data Institute  Licensing - Panton Principles - Creative commons license data - Non-commercial API access
  7. 7. Inforsense  Workflow analytics platform for Life Sciences - “in silico” research / e-Science - Process representation and re-use - Which data sets were used, where are they from, how were they computed?  Spin out from research at Imperial College London - Discovery Net e-Science project  Used by pharmaceutical and biotech companies
  8. 8. “Big Data”  Gene chips (DNA microarray) – rather than a PhD on a few genes, 10's of thousands a time (& culmination of Human Genome Project)  High-throughput screening (HTS) – drug discovery; thousands of automated experiments per day  What to do with the data? - Paper published - Data set sometimes published - Reproduce and expand methodology manually
  9. 9. Representations  How to represent or codify ideas? (beyond writing a traditional paper)  Statistician - Business Intelligence Analyst - Data Scientist - Software engineer - Some coding? - How much? - Scientists have been using Fortran for decades (& S+, R, Matlab...) - GRAIL (1969, RAND corporation) flow charts and light pens - - Bioinformaticians (back in the day) Perl hackers, open source, sharing data
  10. 10. Declarative Workflows  Academic paper & data set → encoded as workflow → computed results  What should the set of operations be? - Deterministic, no side effects - Common functions between workflows  Functional composition
  11. 11. Functional Programming  "Functional programming combines the flexibility and power of abstract mathematics with the intuitive clarity of abstract mathematics." 
  12. 12. Declarative vs Imperative  Maths proof scrutiny - Axioms and deductive steps; describe assumptions  Functional composition - No side effects! - The code documents itself!  Combination - no silver bullet (in memory speed, out of core scale) - “e-Lab notebook” - Inline visualisations (see also Mathematica) - Hadoop does the heavy lifting (ETL) - Pig, Hive, Cascading (Scalding, Cascalog), Crunch/Scrunch, Java MR
  13. 13. Live vs static  A static representation of knowledge does not allow for discourse with the data and process - In Phaedrus, Socrates says: - "Writing shares a strange feature with painting. The offspring of painting stand there as if they were alive, but if anyone asks them anything they are solemnly silent"... - "alone, it cannot defend itself or come to its own support"  Writing programs or solving problems?  Encapsulate and generalize specific instance of a process - To run again - To run on similar data (making a tool to solve problems)  Russel Jurney – Agile Data Analyis book
  14. 14. Metadata of datasets  What is this? - 5.1,3.5,1.4,0.2,setosa - 4.9,3.0,1.4,0.2,setosa - 4.7,3.2,1.3,0.2,setosa - 4.6,3.1,1.5,0.2,setosa - 5.0,3.6,1.4,0.2,setosa - 5.4,3.9,1.7,0.4,setosa - 4.6,3.4,1.4,0.3,setosa - 5.0,3.4,1.5,0.2,setosa - 4.4,2.9,1.4,0.2,setosa
  15. 15.  Modified version of  1. Title: Iris Plants Database  Updated Sept 21 by C.Blake - Added discrepency information  2. Sources:  (a) Creator: R.A. Fisher  (b) Donor: Michael Marshall (  (c) Date: July, 1988  3. Past Usage:  - Publications: too many to mention!!! Here are a few.  1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"  Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions  to Mathematical Statistics" (John Wiley, NY, 1950).  2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.  (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. ...
  16. 16. Process Methodology  Mathematical method / Scientific method - Understanding / Characterize from experience & observation - Analysis / Hypothesis: a proposed explanation - Synthesis / Deduction: prediction from the hypothesis - Review/Extend / Test and experiment -  CRISP-DM -  OSEMN ('awesome') (Hilary Mason) - Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data -
  17. 17. Musicmetric (Semetric Ltd)  Analytics for musical artists (and beyond) - Collecting data from the Internet/APIs - Provenance of Data - Linked entities  Hadoop-based Big Data processing → NoSQL → RESTful API - Nathan Marz/”Lambda Architecture” -  Used by record labels, artist managers, brand owners, festivals, publishers, broadcasters
  18. 18. Lots of data about lots of entities
  19. 19. I read it on the Internet, it must be true?  Collection and archiving of web data is not straightforward  Dealing with noisy or incorrect data - Issues with data from APIs - Filter between raw data and data used in analysis (preprocessing/data cleaning) - Data and metadata retrospectively changing - Present processed data, with access to raw data  Sample rate frequency - Collect hourly, present daily - Interpolation to accommodate irregularities in update frequency  Anomalies...
  20. 20. Fake fans  “Fake Fans” or “Fake Followers” - Social media activity caused by artificially created and controlled social media user profiles → fraud - “Buying fans” to get noticed  Fan count goes up - Collect more data, detect and remove anomalous data - “daily diff” time series – how many fans did I gain today? (compared to yesterday)  Fan count goes down - Twitter et al try to fix the problem → Massive removal of fans → This is also a problem!  Data Science for pre-processing - Predict what is normal using all historical data (for artist, for data source) - Death event detector :-/ 
  21. 21. Versioning (raw) Data  Git -  Dat Version control for data (git alternative) - - @maxogden  Figshare  S3 (e.g. Datasift Twitter firehose)  Dropbox? (“consumerisation” of enterprise tools)  TSV not CSV (consider bz2 rather than gzip; don't forget to md5sum)
  22. 22. Tate collection on Github
  23. 23. Using linked data to direct Big Data analytics  Linked data platform - Profiles for The Beatles - Puff Daddy/P Diddy, Prince/TAFKAP - Macklemore & Ryan Lewis, Simon & Garfunkel - Canonicalise URLs - Temporal logic? IDs change; not good, but it happens (Musicbrainz NGS) - RESTful API / UUIDs / external IDs  Manual curation separated from data processing - Resist all temptation for any manual manipulation of data!
  24. 24. Future  Data to knowledge - Value chain of data - Provenance is key to this  Epistemology / Justified True Belief - Semantic Web - Big Metadata: internet of things (the archetypes, not the physical objects)
  25. 25. Summary  How you made your discovery is as important as the discovery - Reproduce, debug, verify, share, re-application  Open Data Science  How to represent (declarative vs imperative, maths vs software engineering) - Separate Data Science from Software Engineering in a well defined way - Don't orphan data from how it was computed  Don't rely on your input data/metadata - a) never changing b) being available - Version control and share your (meta)data  Resist all temptation for any manual manipulation of data!  Consider the entities you are analysing
  26. 26. Thank You!  Any questions?  Jameel Syed  @tilapia