© 2018 KNIME AG. All Rights Reserved.
Interactive and reproducible data
analysis with the open-source KNIME
Analytics Platform
Greg Landrum, Ph.D.
KNIME AG
@dr_greg_landrum
ACS New Orleans
19 March 2018
© 2018 KNIME AG. All Rights Reserved. 2
Topics
• A brief intro to KNIME
– The company
– The software
• Context: some data analysis problems we’re trying
to help with using workflows
• A case study of reproducible interactive data
analysis in KNIME
© 2018 KNIME AG. All Rights Reserved. 3
KNIME, the company
• KNIME AG founded in 2008
• Offices in Zürich (HQ), Konstanz, Berlin, and Austin
• 40+ employees
• Maintainer of the Open Source KNIME Analytics Platform
– comprehensive data loading, processing, analysis, modeling platform
– visual frontend
– open: to all sorts of data, other tools (R and Python, etc.), various user
personas
– 20+ open source releases since 2006
– open source.
• KNIME Server
– 14 commercial product releases since 2008
• KNIME cloud offerings
© 2018 KNIME AG. All Rights Reserved. 4
The KNIME® Analytics Platform
© 2018 KNIME AG. All Rights Reserved. 5
Visual KNIME Workflows
Nodes perform tasks on data
Workflows combine nodes
to model data flow
Status
Input(s)
Outputs
Not Configured
Idle
Executed
Error
© 2018 KNIME AG. All Rights Reserved. 6
Analysis & Mining
Statistics, Machine Learning, Data
Mining, Web Analytics, Text
Mining, Network Analysis, Social
Media Analysis, R, Weka, Python,
Community / 3rd party, ...
Data Access
MySQL, Oracle, ...
SAS, SPSS, ...
Excel, Flat, ...
Hive, Impala, ...
XML, JSON, PMML
Text, Doc, Image, ...
Web Crawlers,
Industry Specific,
Community / 3rd
party ...
Transformation
Row, Column, Matrix
Text, Image, Networks, Time
Series, Java, Python,
Community / 3rd party, ...
Visualization
R, Python,
JFreeChart,
JavaScript,
Community / 3rd party, ...
Deployment
via BIRT
PMML, XML, JSON
Databases, Excel, Flat, etc.
Text, Doc, Image
Industry Specific
Community / 3rd party, ...
Over 2000 native and embedded nodes included:
Big Data
Hive, Impala, HDFS Vertica,
Teradata/Aster, Spark, MLlib,
Community / 3rd party, ...
© 2018 KNIME AG. All Rights Reserved. 7
Free E-Learning Course: Web Page
7
• Hands-on e-learning course
• Data Access, ETL, Analytics, Control
Structures, Visualization
• Around 50 small units
• … with exercises
• … and with solutions on the
EXAMPLES server
• Final exercises to test your
knowledge!
https://www.knime.org/knime-
introductory-course
© 2018 KNIME AG. All Rights Reserved. 8
The KNIME Software Ecosystem
Deployment:
- to Applications
- to Humans
Collaboration:
- Best Practices
- Sharing Expertise
Automation:
- Scheduling
- (Model) Management
KNIME
Analytics
Platform
KNIME
Supported
Extensions
KNIME
Extensions
Partner
Extensions
Community
Extensions
KNIME Server
© 2018 KNIME AG. All Rights Reserved. 9
KNIME Server
Shared Repositories Access Management Web Enablement
Flexible Execution
© 2018 KNIME AG. All Rights Reserved. 10
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
© 2018 KNIME AG. All Rights Reserved. 11
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
• The need to use multiple tools and multiple data
sources
• Collaboration between users with different
sophistication levels
• Deployment
• Just staying organized
© 2018 KNIME AG. All Rights Reserved. 12
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
• The need to use multiple tools and multiple data
sources
• Collaboration between users with different
sophistication levels
• Deployment
• Just staying organized
I think workflows can help
with these
© 2018 KNIME AG. All Rights Reserved. 13
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
• The need to use multiple tools and multiple data
sources
• Collaboration between users with different
sophistication levels
• Deployment
• Just staying organized
I think KNIME can help with
these
© 2018 KNIME AG. All Rights Reserved. 14
Interactive data analysis and modeling
• Fairly often the whole process of data
preprocessing, analysis, and modeling can’t be (or
shouldn’t be) fully automated.
• We want/need a human in the loop
• Would be lovely if this weren’t painful
Interactive
© 2018 KNIME AG. All Rights Reserved. 15
Repeatability and reproducibility
• I can reproduce what I did before or repeat the
same process with a different data set/method
• You can do the same thing
• Not necessarily talking about strict reproducibility
(out to the 15th decimal place), but if we miss that
we should be able discover where deviations come
from
• Would be lovely if this weren’t painful
Reproducible
© 2018 KNIME AG. All Rights Reserved. 16
The need to use multiple tools and multiple data sources
• There is no one-size-fits-all solution (or “one-stop
shop”)
• We’re inevitably going to be using more than one
piece of software and working with data from more
than one source.
• Would be lovely if this weren’t painful
Open
© 2018 KNIME AG. All Rights Reserved. 17
Collaboration between users with different sophistication levels
• Some personae:1
– The scripter/programmer: “I’ve got this great new
method you should try”
– The tool user: “I’ll use software, but there’s no way I’m
writing code”
– The “stakeholder”: “Those folks are doing useful stuff and
I need their results, but I don’t have time to learn some
complex new piece of software.”
• Would be lovely if enabling collaboration between
these different personae wasn’t painful
1 Yes, these are stereotypes
Collaborative
© 2018 KNIME AG. All Rights Reserved. 18
Deployment
• Once I’ve built something I’d like to make it available
to my colleagues
– Sharing models
– Sharing methods
– Sharing results
• Would be lovely if this weren’t painful
Deployable
© 2018 KNIME AG. All Rights Reserved. 19
Just staying organized
• I can usually remember where my scripts are
• There’s no way I can remember where yours are
• It would be lovely if it weren’t painful to find stuff
Findable
© 2018 KNIME AG. All Rights Reserved. 20
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
• The need to use multiple tools and multiple data
sources
• Collaboration between users with different
sophistication levels
• Deployment
• Just staying organized
© 2018 KNIME AG. All Rights Reserved. 21
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
• The need to use multiple tools and multiple data
sources
• Collaboration between users with different
sophistication levels
• Deployment
• Just staying organized
I think workflows can help
with these
© 2018 KNIME AG. All Rights Reserved. 22
Some data analysis problems
• Interactive data analysis and modeling
• Repeatability and reproducibility
• The need to use multiple tools and multiple data
sources
• Collaboration between users with different
sophistication levels
• Deployment
• Just staying organized
I think KNIME can help with
these
© 2018 KNIME AG. All Rights Reserved. 23
The case study: HTS hit list triage
© 2018 KNIME AG. All Rights Reserved. 24
Background
• The problem: Processing a hit list from a high-
throughput phenotypic screen for malaria.
– Clean up the hit list
– Suggest compounds to be sent to a validation assay
• Data source: 2014 Teach-Discover-Treat challenge
http://www.tdtproject.org/challenge-1---malaria-
hts.html
• Additional info:
– https://github.com/sriniker/TDT-tutorial-2014
– Riniker et al. https://f1000research.com/articles/6-1136/v2
© 2018 KNIME AG. All Rights Reserved. 25
Approach we’ll take: cleanup
• Remove ”ugly” molecules:
– PAINS filters1,2: containing substructures that are likely to
interfere with/have interfered with the assay.
– ”Rapid elimination of swill” (REOS)3: Too big, complicated
or greasy.
• Don’t want to apply these filters mindlessly, so we
should always look at the results and allow manual
rescue
1. Baell, J. B. & Holloway, G. A. J. Med. Chem. 53, 2719–40 (2010).
2. http://rdkit.blogspot.ch/2015/08/curating-pains-filters.html
3. Walters, W. P. & Namchuk, M. Nat. Rev. Drug Discov. 2, 259–66 (2003).
© 2018 KNIME AG. All Rights Reserved. 26
Approach we’ll take: selection for validation
• We want good coverage of the chemical space of
the HTS actives, but would ideally also like to learn
something from the validation results
• Approach:
– Start with a diverse subset of the cleaned actives
– Pick neighbors of each of these so that we have some SAR
information in the results
https://github.com/sriniker/TDT-tutorial-2014
© 2018 KNIME AG. All Rights Reserved. 27
The workflows
• Download (with data) from
EXAMPLES folder in KNIME itself
© 2018 KNIME AG. All Rights Reserved. 28
Deploying it
• Both workflows are built from a series of wrapped
metanodes. Each of these becomes a separate page
in a Web Portal app when the workflow is copied to
the KNIME server.
DeployableCollaborative
© 2018 KNIME AG. All Rights Reserved. 29
Cleanup workflow (part 1)
© 2018 KNIME AG. All Rights Reserved. 30
Cleanup workflow (part 1)
Interactive
© 2018 KNIME AG. All Rights Reserved. 31
Cleanup workflow (part 1)
Reproducible
© 2018 KNIME AG. All Rights Reserved. 32
Cleanup workflow (part 1)
Interactive
© 2018 KNIME AG. All Rights Reserved. 33
Cleanup workflow (part 1)
© 2018 KNIME AG. All Rights Reserved. 34
Cleanup workflow (part 1)
Interactive
© 2018 KNIME AG. All Rights Reserved. 35
Cleanup workflow (part 2)
© 2018 KNIME AG. All Rights Reserved. 36
Cleanup workflow (part 2)
© 2018 KNIME AG. All Rights Reserved. 37
The output (in Excel)
© 2018 KNIME AG. All Rights Reserved. 38
Selection workflow
© 2018 KNIME AG. All Rights Reserved. 39
Selection workflow
Interactive
© 2018 KNIME AG. All Rights Reserved. 40
Selection workflow
Interactive
© 2018 KNIME AG. All Rights Reserved. 41
Selection workflow
Interactive
© 2018 KNIME AG. All Rights Reserved. 42
Selection workflow
Interactive
© 2018 KNIME AG. All Rights Reserved. 43
The output (in Excel)
© 2018 KNIME AG. All Rights Reserved. 44
Reviewing…
• Discussed today:
– Interactive data analysis and modeling
– Repeatability and reproducibility
– Collaboration between users with different sophistication
levels
– Deployment
• For the 40 minute version of this presentation:
– Just staying organized
– The need to use multiple tools and multiple data sources
© 2018 KNIME AG. All Rights Reserved. 45
US Roadshow
46© 2018 KNIME AG. All Rights Reserved.
KNIME is hiring!
• Software developers (Java and/or JavaScript)
• Data scientists (Application Scientists)
• Director of product marketing
Positions open in Austin, Berlin, Konstanz, and Zürich
More info: https://www.knime.com/careers
47© 2018 KNIME AG. All Rights Reserved.
Backups
© 2018 KNIME AG. All Rights Reserved. 48
Validating Reproducibility
• Built-in support for validating that results generated
from one run to the next are the same
• Can be automated across multiple workflows or
groups of workflows
© 2018 KNIME AG. All Rights Reserved. 49
Validating Reproducibility
• Built-in support for validating that results generated
from one run to the next are the same
• Can be automated across multiple workflows or
groups of workflows
© 2018 KNIME AG. All Rights Reserved. 50
Validating Reproducibility
• Built-in support for validating that results generated
from one run to the next are the same
• Can be automated across multiple workflows or
groups of workflows
51© 2018 KNIME AG. All Rights Reserved.
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by
KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.

Interactive and reproducible data analysis with the open-source KNIME Analytics Platform

  • 1.
    © 2018 KNIMEAG. All Rights Reserved. Interactive and reproducible data analysis with the open-source KNIME Analytics Platform Greg Landrum, Ph.D. KNIME AG @dr_greg_landrum ACS New Orleans 19 March 2018
  • 2.
    © 2018 KNIMEAG. All Rights Reserved. 2 Topics • A brief intro to KNIME – The company – The software • Context: some data analysis problems we’re trying to help with using workflows • A case study of reproducible interactive data analysis in KNIME
  • 3.
    © 2018 KNIMEAG. All Rights Reserved. 3 KNIME, the company • KNIME AG founded in 2008 • Offices in Zürich (HQ), Konstanz, Berlin, and Austin • 40+ employees • Maintainer of the Open Source KNIME Analytics Platform – comprehensive data loading, processing, analysis, modeling platform – visual frontend – open: to all sorts of data, other tools (R and Python, etc.), various user personas – 20+ open source releases since 2006 – open source. • KNIME Server – 14 commercial product releases since 2008 • KNIME cloud offerings
  • 4.
    © 2018 KNIMEAG. All Rights Reserved. 4 The KNIME® Analytics Platform
  • 5.
    © 2018 KNIMEAG. All Rights Reserved. 5 Visual KNIME Workflows Nodes perform tasks on data Workflows combine nodes to model data flow Status Input(s) Outputs Not Configured Idle Executed Error
  • 6.
    © 2018 KNIMEAG. All Rights Reserved. 6 Analysis & Mining Statistics, Machine Learning, Data Mining, Web Analytics, Text Mining, Network Analysis, Social Media Analysis, R, Weka, Python, Community / 3rd party, ... Data Access MySQL, Oracle, ... SAS, SPSS, ... Excel, Flat, ... Hive, Impala, ... XML, JSON, PMML Text, Doc, Image, ... Web Crawlers, Industry Specific, Community / 3rd party ... Transformation Row, Column, Matrix Text, Image, Networks, Time Series, Java, Python, Community / 3rd party, ... Visualization R, Python, JFreeChart, JavaScript, Community / 3rd party, ... Deployment via BIRT PMML, XML, JSON Databases, Excel, Flat, etc. Text, Doc, Image Industry Specific Community / 3rd party, ... Over 2000 native and embedded nodes included: Big Data Hive, Impala, HDFS Vertica, Teradata/Aster, Spark, MLlib, Community / 3rd party, ...
  • 7.
    © 2018 KNIMEAG. All Rights Reserved. 7 Free E-Learning Course: Web Page 7 • Hands-on e-learning course • Data Access, ETL, Analytics, Control Structures, Visualization • Around 50 small units • … with exercises • … and with solutions on the EXAMPLES server • Final exercises to test your knowledge! https://www.knime.org/knime- introductory-course
  • 8.
    © 2018 KNIMEAG. All Rights Reserved. 8 The KNIME Software Ecosystem Deployment: - to Applications - to Humans Collaboration: - Best Practices - Sharing Expertise Automation: - Scheduling - (Model) Management KNIME Analytics Platform KNIME Supported Extensions KNIME Extensions Partner Extensions Community Extensions KNIME Server
  • 9.
    © 2018 KNIMEAG. All Rights Reserved. 9 KNIME Server Shared Repositories Access Management Web Enablement Flexible Execution
  • 10.
    © 2018 KNIMEAG. All Rights Reserved. 10 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility
  • 11.
    © 2018 KNIMEAG. All Rights Reserved. 11 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility • The need to use multiple tools and multiple data sources • Collaboration between users with different sophistication levels • Deployment • Just staying organized
  • 12.
    © 2018 KNIMEAG. All Rights Reserved. 12 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility • The need to use multiple tools and multiple data sources • Collaboration between users with different sophistication levels • Deployment • Just staying organized I think workflows can help with these
  • 13.
    © 2018 KNIMEAG. All Rights Reserved. 13 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility • The need to use multiple tools and multiple data sources • Collaboration between users with different sophistication levels • Deployment • Just staying organized I think KNIME can help with these
  • 14.
    © 2018 KNIMEAG. All Rights Reserved. 14 Interactive data analysis and modeling • Fairly often the whole process of data preprocessing, analysis, and modeling can’t be (or shouldn’t be) fully automated. • We want/need a human in the loop • Would be lovely if this weren’t painful Interactive
  • 15.
    © 2018 KNIMEAG. All Rights Reserved. 15 Repeatability and reproducibility • I can reproduce what I did before or repeat the same process with a different data set/method • You can do the same thing • Not necessarily talking about strict reproducibility (out to the 15th decimal place), but if we miss that we should be able discover where deviations come from • Would be lovely if this weren’t painful Reproducible
  • 16.
    © 2018 KNIMEAG. All Rights Reserved. 16 The need to use multiple tools and multiple data sources • There is no one-size-fits-all solution (or “one-stop shop”) • We’re inevitably going to be using more than one piece of software and working with data from more than one source. • Would be lovely if this weren’t painful Open
  • 17.
    © 2018 KNIMEAG. All Rights Reserved. 17 Collaboration between users with different sophistication levels • Some personae:1 – The scripter/programmer: “I’ve got this great new method you should try” – The tool user: “I’ll use software, but there’s no way I’m writing code” – The “stakeholder”: “Those folks are doing useful stuff and I need their results, but I don’t have time to learn some complex new piece of software.” • Would be lovely if enabling collaboration between these different personae wasn’t painful 1 Yes, these are stereotypes Collaborative
  • 18.
    © 2018 KNIMEAG. All Rights Reserved. 18 Deployment • Once I’ve built something I’d like to make it available to my colleagues – Sharing models – Sharing methods – Sharing results • Would be lovely if this weren’t painful Deployable
  • 19.
    © 2018 KNIMEAG. All Rights Reserved. 19 Just staying organized • I can usually remember where my scripts are • There’s no way I can remember where yours are • It would be lovely if it weren’t painful to find stuff Findable
  • 20.
    © 2018 KNIMEAG. All Rights Reserved. 20 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility • The need to use multiple tools and multiple data sources • Collaboration between users with different sophistication levels • Deployment • Just staying organized
  • 21.
    © 2018 KNIMEAG. All Rights Reserved. 21 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility • The need to use multiple tools and multiple data sources • Collaboration between users with different sophistication levels • Deployment • Just staying organized I think workflows can help with these
  • 22.
    © 2018 KNIMEAG. All Rights Reserved. 22 Some data analysis problems • Interactive data analysis and modeling • Repeatability and reproducibility • The need to use multiple tools and multiple data sources • Collaboration between users with different sophistication levels • Deployment • Just staying organized I think KNIME can help with these
  • 23.
    © 2018 KNIMEAG. All Rights Reserved. 23 The case study: HTS hit list triage
  • 24.
    © 2018 KNIMEAG. All Rights Reserved. 24 Background • The problem: Processing a hit list from a high- throughput phenotypic screen for malaria. – Clean up the hit list – Suggest compounds to be sent to a validation assay • Data source: 2014 Teach-Discover-Treat challenge http://www.tdtproject.org/challenge-1---malaria- hts.html • Additional info: – https://github.com/sriniker/TDT-tutorial-2014 – Riniker et al. https://f1000research.com/articles/6-1136/v2
  • 25.
    © 2018 KNIMEAG. All Rights Reserved. 25 Approach we’ll take: cleanup • Remove ”ugly” molecules: – PAINS filters1,2: containing substructures that are likely to interfere with/have interfered with the assay. – ”Rapid elimination of swill” (REOS)3: Too big, complicated or greasy. • Don’t want to apply these filters mindlessly, so we should always look at the results and allow manual rescue 1. Baell, J. B. & Holloway, G. A. J. Med. Chem. 53, 2719–40 (2010). 2. http://rdkit.blogspot.ch/2015/08/curating-pains-filters.html 3. Walters, W. P. & Namchuk, M. Nat. Rev. Drug Discov. 2, 259–66 (2003).
  • 26.
    © 2018 KNIMEAG. All Rights Reserved. 26 Approach we’ll take: selection for validation • We want good coverage of the chemical space of the HTS actives, but would ideally also like to learn something from the validation results • Approach: – Start with a diverse subset of the cleaned actives – Pick neighbors of each of these so that we have some SAR information in the results https://github.com/sriniker/TDT-tutorial-2014
  • 27.
    © 2018 KNIMEAG. All Rights Reserved. 27 The workflows • Download (with data) from EXAMPLES folder in KNIME itself
  • 28.
    © 2018 KNIMEAG. All Rights Reserved. 28 Deploying it • Both workflows are built from a series of wrapped metanodes. Each of these becomes a separate page in a Web Portal app when the workflow is copied to the KNIME server. DeployableCollaborative
  • 29.
    © 2018 KNIMEAG. All Rights Reserved. 29 Cleanup workflow (part 1)
  • 30.
    © 2018 KNIMEAG. All Rights Reserved. 30 Cleanup workflow (part 1) Interactive
  • 31.
    © 2018 KNIMEAG. All Rights Reserved. 31 Cleanup workflow (part 1) Reproducible
  • 32.
    © 2018 KNIMEAG. All Rights Reserved. 32 Cleanup workflow (part 1) Interactive
  • 33.
    © 2018 KNIMEAG. All Rights Reserved. 33 Cleanup workflow (part 1)
  • 34.
    © 2018 KNIMEAG. All Rights Reserved. 34 Cleanup workflow (part 1) Interactive
  • 35.
    © 2018 KNIMEAG. All Rights Reserved. 35 Cleanup workflow (part 2)
  • 36.
    © 2018 KNIMEAG. All Rights Reserved. 36 Cleanup workflow (part 2)
  • 37.
    © 2018 KNIMEAG. All Rights Reserved. 37 The output (in Excel)
  • 38.
    © 2018 KNIMEAG. All Rights Reserved. 38 Selection workflow
  • 39.
    © 2018 KNIMEAG. All Rights Reserved. 39 Selection workflow Interactive
  • 40.
    © 2018 KNIMEAG. All Rights Reserved. 40 Selection workflow Interactive
  • 41.
    © 2018 KNIMEAG. All Rights Reserved. 41 Selection workflow Interactive
  • 42.
    © 2018 KNIMEAG. All Rights Reserved. 42 Selection workflow Interactive
  • 43.
    © 2018 KNIMEAG. All Rights Reserved. 43 The output (in Excel)
  • 44.
    © 2018 KNIMEAG. All Rights Reserved. 44 Reviewing… • Discussed today: – Interactive data analysis and modeling – Repeatability and reproducibility – Collaboration between users with different sophistication levels – Deployment • For the 40 minute version of this presentation: – Just staying organized – The need to use multiple tools and multiple data sources
  • 45.
    © 2018 KNIMEAG. All Rights Reserved. 45 US Roadshow
  • 46.
    46© 2018 KNIMEAG. All Rights Reserved. KNIME is hiring! • Software developers (Java and/or JavaScript) • Data scientists (Application Scientists) • Director of product marketing Positions open in Austin, Berlin, Konstanz, and Zürich More info: https://www.knime.com/careers
  • 47.
    47© 2018 KNIMEAG. All Rights Reserved. Backups
  • 48.
    © 2018 KNIMEAG. All Rights Reserved. 48 Validating Reproducibility • Built-in support for validating that results generated from one run to the next are the same • Can be automated across multiple workflows or groups of workflows
  • 49.
    © 2018 KNIMEAG. All Rights Reserved. 49 Validating Reproducibility • Built-in support for validating that results generated from one run to the next are the same • Can be automated across multiple workflows or groups of workflows
  • 50.
    © 2018 KNIMEAG. All Rights Reserved. 50 Validating Reproducibility • Built-in support for validating that results generated from one run to the next are the same • Can be automated across multiple workflows or groups of workflows
  • 51.
    51© 2018 KNIMEAG. All Rights Reserved. The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.