Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Understanding Jupyter notebooks using bioinformatics examples

447 views

Published on

Understanding Jupyter notebooks using examples and tools from the CSRIO bioinformatics team - VariantSpark and GT-Scan2

Published in: Technology
  • Be the first to comment

Understanding Jupyter notebooks using bioinformatics examples

  1. 1. The next terminal – Jupyter With examples from Bioinformatics @lynnlangit
  2. 2. “ ” How often do you use the terminal? @lynnlangit
  3. 3. Terminal Customizations Prompt Output Aesthetics Code Comments Graphics @lynnlangit
  4. 4. Terminalimproved
  5. 5. Terminalimproved
  6. 6. What does this Code do? @lynnlangit
  7. 7. “ ” But it’s not good enough Why not? @lynnlangit
  8. 8. Machine Learning Too much data to process? Or too much code? Can you ‘see’ what is happening? @lynnlangit
  9. 9. What does this Code do? Which algorithm? @lynnlangit
  10. 10. Visualizing Data Processing ML Code Which algorithm? @lynnlangit
  11. 11. Now – more data, much more… IoT increases data volume and complexity exponentially @lynnlangit
  12. 12. “ ” Inspired by Mathematica Thanks Steven Wolfram If you can SEE it (your data and code), you can work with it better @lynnlangit
  13. 13. Next terminal -> a better Python REPL • Fernando Perez in 2001 • IPython (interactive) • Modeled - Mathematica Notebooks • IP(y): Notebook -> in a browser • 2012 IPython -> Jupyter Notebook @lynnlangit
  14. 14. Enter Jupyter Notebooks @lynnlangit
  15. 15. Jupyter Notebooks supports ML Lifecycle 1. Collect Data Retrieve Files Query SQL Databases Call Web Services “Scrape” Web Pages 2. Prepare Data Explore Data Validate Data Clean Data Features / Data 4. Evaluate Model Test Performance Compare Models Validate Model Visualize 5. Deploy Model Export Model File Prepare Job Deploy Container Re-package Model Execute code blocks: - Python, R… code - SQL queries - Shell commands 3. Train Model Prepare Training Set Experiment Test Model Visualize Write Documentation: - Markdown language Visualize Data - Viz tools…
  16. 16. Jupyter Visualizations – so many possibilities
  17. 17. Notebook Customizations Multiple Runtimes Languages Share output Code or Equations LaTex Math Comments Markdown Wiki-like Graphics Visualizations Charting Results LIVE DOCUMENTATION Reproducible Research @lynnlangit
  18. 18. Example Jupyter locally @lynnlangit
  19. 19. Mathematica evolved… Jupyter Notebook Market leader Started for single use Academic community GitHub integration Added Jupyter Hub for collaboration Zeppelin Notebook Start for collaboration Enterprise Security Vendor Notebook Databricks for Apache Spark Jupyter-like, but proprietary format @lynnlangit
  20. 20. Running Notebooks Desktop Install and run Local Server Can use Jupyter Hub for groups Cloud Large number of options @lynnlangit
  21. 21. Extending, Refactoring Open Notebooks • Write functions in one notebook • Link to another notebook • Write extensions (nbextensions.com)
  22. 22. Up the bar Personalized medicine via genomic analysis @lynnlangit
  23. 23. Reproducible Research – Experiments as Code @lynnlangit
  24. 24. Bioinformatics | Denis C. Bauer | @allPowerde| GT-Scan2 How can genome engineering be made more effective? Variant Spark How to find disease genes in population-size cohorts? Genomic Research Tools Two Examples
  25. 25. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Machine learning… on 1.7 Trillion data points https://www.projectmine.com/about/
  26. 26. Bioinformatics | Denis C. Bauer | @allPowerde| VariantSpark - Parallelize Random Forest for scalability • Spark ML’s RF was designed for ‘Big’ low dimensional data. • The full genome-wide profile does NOT fit into the executors memory “Cursed” BigData: e.g. Genomics Moderate number of samples with many features Feature set too large to be handled by single executer
  27. 27. Bioinformatics | Denis C. Bauer | @allPowerde| Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK Flip the matrix: partition by column VariantSpark - Parallelize RF to scale with features
  28. 28. Bioinformatics | Denis C. Bauer | @allPowerde| Wide RF scalable with features and samples
  29. 29. # set up context and input parameters spark = SparkSession(sc) vc = VariantsContext(spark) label = vc.load_label('dius/data/chr22-labels.csv', 'col_name') features = vc.import_vcf('dius/data/chr22_1000.vcf') # instantiate analysis (parameters are type-checked) imp_analysis = features.importance_analysis(label) # get significant factors as both a tuple list and a dataframe imp_vars = imp_analysis.important_variables(20) most_imp_var = imp_vars[0][0] imp_df = imp_analysis.variable_importance() oob_error = imp_analysis.oob_error() # convert to work with common Python tools pandas_imp_df = imp_df.toPandas() New -- Python API for VariantSpark
  30. 30. Demo VariantSpark Jupyter for Genomics Research @lynnlangit
  31. 31. Cloud-based Jupyter PaaS • AWS SageMaker • Azure Notebooks • Others… @lynnlangit
  32. 32. Example - GT-Scan2 Jupyter for Genomics Research @lynnlangit
  33. 33. Tools for Jupyter • Binder for GitHub • Point to your GitHub Repo • Jupyter Notebooks • Requirements.txt • It builds a Docker image • You can run your Notebooks @lynnlangit
  34. 34. Example Binder @lynnlangit
  35. 35. Future of Jupyter for Research Academic Institutions and Research Labs UC Berkeley, Davis, San Diego Cal Poly San Luis Obispo Clemson University UC Boulder U of Illinois, Minnesota, Missouri, Rochester, Texas MIT Michigan State U Texas A & M @lynnlangit

×