Not the best presentation, mostly a stream a consciousness about analyzing samples in a Core facility, although I do list some of my favorite methods. So maybe it will be useful to some? I repacakged this into a better talk (see my Genomeweb 2019 talk on SlideShare)
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Not the best presentation, mostly a stream a consciousness about analyzing samples in a Core facility, although I do list some of my favorite methods. So maybe it will be useful to some? I repacakged this into a better talk (see my Genomeweb 2019 talk on SlideShare)
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Influx/Days 2017 San Francisco | Baron SchwartzInfluxData
WHAT GOOD IS ANOMALY DETECTION?
Static thresholds on metrics have been falling out of fashion for a while, and for good reason. Modern tooling lets you analyze and monitor a lot more data points than you used to be able to, resulting in lots more noise. The hope is that anomaly detection answers some of this, by replacing static thresholds (anomalies) with dynamic ones. But it doesn’t work as well as most people think it will. In this talk I’ll explain how anomaly detection works, so you can understand why it isn’t a good general-purpose solution, and which specific cases it’s good at.
Agent-Based Modelling: Social Science Meets Computer Science?Edmund Chattoe-Brown
Chattoe-Brown, Edmund (2017?) ‘Agent-Based Modelling: Social Science Meets Computer Science?’ presentation at Departmental Seminar, Department of Informatics, University of Leicester, 17 February.
A talk given by Eugene Dubossarsky on predictive analytics at the Big Data Analytics meetup in Sydney this month. The talk is available at http://www.youtube.com/watch?v=aG16YSFgtLY
Michael Bolton - Heuristics: Solving Problems RapidlyTEST Huddle
EuroSTAR Software Testing Conference 2008 presentation on Heuristics: Solving Problems Rapidly by Michael Bolton. See more at conferences.eurostarsoftwaretesting.com/past-presentations/
Lecture 2: Data, pre-processing and post-processing
Chapters 2,3 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman
This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?
Top 10 Data Science Practitioner PitfallsSri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Sequencing run grief counseling: counting kmers at MG-RASTwltrimbl
Talk by Will Trimble of Argonne National Laboratory on April 29, 2014, at UIC's department of Ecology & Evolution on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
More Related Content
Similar to Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
Influx/Days 2017 San Francisco | Baron SchwartzInfluxData
WHAT GOOD IS ANOMALY DETECTION?
Static thresholds on metrics have been falling out of fashion for a while, and for good reason. Modern tooling lets you analyze and monitor a lot more data points than you used to be able to, resulting in lots more noise. The hope is that anomaly detection answers some of this, by replacing static thresholds (anomalies) with dynamic ones. But it doesn’t work as well as most people think it will. In this talk I’ll explain how anomaly detection works, so you can understand why it isn’t a good general-purpose solution, and which specific cases it’s good at.
Agent-Based Modelling: Social Science Meets Computer Science?Edmund Chattoe-Brown
Chattoe-Brown, Edmund (2017?) ‘Agent-Based Modelling: Social Science Meets Computer Science?’ presentation at Departmental Seminar, Department of Informatics, University of Leicester, 17 February.
A talk given by Eugene Dubossarsky on predictive analytics at the Big Data Analytics meetup in Sydney this month. The talk is available at http://www.youtube.com/watch?v=aG16YSFgtLY
Michael Bolton - Heuristics: Solving Problems RapidlyTEST Huddle
EuroSTAR Software Testing Conference 2008 presentation on Heuristics: Solving Problems Rapidly by Michael Bolton. See more at conferences.eurostarsoftwaretesting.com/past-presentations/
Lecture 2: Data, pre-processing and post-processing
Chapters 2,3 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman
This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?
Top 10 Data Science Practitioner PitfallsSri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Sequencing run grief counseling: counting kmers at MG-RASTwltrimbl
Talk by Will Trimble of Argonne National Laboratory on April 29, 2014, at UIC's department of Ecology & Evolution on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.
Similar to Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data (20)
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilitySciAstra
The Indian Statistical Institute (ISI) has extended its application deadline for 2024 admissions to April 2. Known for its excellence in statistics and related fields, ISI offers a range of programs from Bachelor's to Junior Research Fellowships. The admission test is scheduled for May 12, 2024. Eligibility varies by program, generally requiring a background in Mathematics and English for undergraduate courses and specific degrees for postgraduate and research positions. Application fees are ₹1500 for male general category applicants and ₹1000 for females. Applications are open to Indian and OCI candidates.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
Predicting Gene Loss in Plants: Lessons Learned From Laptop-Scale Data
1. Predicting Gene Loss in Plants:
Lessons Learned from Laptop-Scale
Data
@PhilippBayer
Forrest Fellow, Edwards group
School of Biological Sciences
University of Western Australia
1
2. Who am I?
2
• Originally from Germany. PhD in Applied
Bioinformatics at UQ, worked on genotyping by
sequencing methods, finished 2016.
• Now Forrest Fellow at UWA, Perth in Edwards
group
3. My toolbox
3
• Originally did everything in Python – self-taught
• Jupyter notebooks on my laptop, scripts on our
servers
• Scikit-learn, pandas, fastai/keras
• Nowadays lots of R – workflowr, Rstudio, caret
• Whichever works. String fiddling in Python, then
stats analysis/plotting in R.
4. ‘Science’ vs ‘craft’
• I think ML is much more a ‘craft’ than a ‘science’
• It’s very hard to predict whether thing A or thing
B will be more accurate or perform better, in
many cases methods will perform similarly
• At some point you develop a gut feel for what
may and may not work -> craft!
4
5. The project
5
• Used sequencing data for ~300 lines of Brassica
oleracea (cabbage), rapa, napus (canola)
6. XGBoost model
• Can we find out which genomic elements predict
gene variability? Lots of homeologous
recombination, lots of transposon activity
• Build three feature tables for each gene in B.
napus/oleracea/rapa
• Table includes size of chromosome, whether
gene is 1/2/3kb close to various transposons,
whether gene is in a syntenic block etc., to
predict the column ‘is a gene variable’
6
8. XGBoost model
• Used XGBoost, one of the the current state-of-
the-art machine learning approaches for not-so-
big data and feature tables (~ table of numbers)
• Goal of the model: is a given gene ‘core’ or
‘variable’ (lost in at least one plant)?
• Input data:
• 120,000 canola genes (rows)
• Transposons of different classes (columns)
• Position on chromosome (columns)
8
9. XGBoost
9
n_estimators is probably the most important parameter. The higher,
the longer training takes, the more accuracy you get, the more
overfitting you get too! Everything downstream takes longer too
11. … but??
• Can we trust that? We should check the
confusion matrix!
11
Predicted core Predicted variable
Actual core 19914 148
Actual variable 3310 507
12. … but??
• The confusion matrix shows us that in this case,
accuracy is misleading!
• XGBoost mostly predicts ‘core’ and calls it a day.
12
13. Imbalanced classes
• Most real life datasets have heavily imbalanced
classes
• Example: Prediction of a specific cancer, >99%
of people won’t develop that cancer, so a model
just saying ‘no cancer’ will have >99% accuracy
• Class imbalance will make your models look like
they perform well when in reality, they perform
terribly
13
15. Imbalanced classes
• Most models have some kind of parameter for
class imbalance, for XGBoost:
• (‘craft’ – in my experience, other values than the
suggested above had better performance) 15
17. Imbalanced classes
• So after implementing all this stuff, can I get a
better class accuracy?
17
Predicted core Predicted variable
Actual core 16471 3591
Actual variable 1817 2000
18. Base model
• Shouldn’t I make a base model first?
• I need to ‘beat’ something! I shouldn’t just use
XGBoost because it’s the flashy thing to do!
18
19. The base model
• Of all of my genes, 84.02% are core – that’s
what we have to beat!
• VERY different from the 50/50 you might have
assumed for two classes
19
20. Summary of this part
• Not shown: A whole bunch of experimenting with
AUC, ROC, MCC, LightGBM, CatBoost, 10-fold
validation, imbalanced-learn, BayesSearchCV
for parameter optimisation, fiddling with the
probability cut-off, f1 scores (precision/recall)
• (This talk is 15 minutes long, not 15 hours)
• This is – maybe? – all I can get out of this
dataset! At some point you have to walk away.
20
21. What has the model learned?
• That’s the actually interesting part!
• XGBoost has in-built methods for ‘gain’, ‘cover’,
‘weight’ (I always forget what does what) feature
importance
• These treat rare or low-variance variables
differently
21
22. Less confusing: Shapley
values!
• In a (wrong) nutshell: Make all possible
combinations of features, see how the model’s
prediction changes based on what you left out
https://christophm.github.io/interpretable-ml-book/shapley.html#shapley
23. Running SHAP in Python
• Easy to run, but takes a while:
• But takes much longer than training! With
XGBoost, higher model complexity settings
mean (n_estimators) waaaaaay longer runtime
• Comes with three kinds of plots: force plots,
dependence plots, and summary plots
28. Shapley values
28
• Unlike F-values reported by XGBoost’s
plot_importance, you can compare Shapley
values between different models!
Plot_importance tells you only whether a feature
is important, SHAP tells you whether high/low is
important too!
• As expected, in B. napus the further away from
centromeres, the higher Shapley values
30. My ‘sources’
30
• Some I got from books –
• Géron’s Hands-On Machine Learning (2nd ed)
(Tim O’Reilly: ‘one of the best books O’Reilly
has published in our entire history’)
• Müller and Guido’s Introduction to Machine
Learning with Python
• And heaps of googling
(towardsdatascience.com, various Kaggle
notebooks)
33. Summary
33
• Beware class imbalance! Don’t trust any
measurement blindly.
• ALWAYS check your predictions manually,
either by looking at a confusion matrix or by
digging into your raw predictions
• At some point you just have to stop improving
your model. This is a craft, not a science – hard
to predict when to move on. Better to add
features than to fiddle with the model.
34. Summary
34
• SHAP is a fun way to learn more about what the
model actually learned – but the explanation is
only as good as your model. A garbage model
will have garbage explanations.
• In my case: maybe Shapley can explain core
genes, but not variable genes?
• When building your own models, don’t get
discouraged at all the things that can go wrong!
There is a huge community off- and online to
help you!
35. Summary
35
• All code shown today comes from Jupyter
notebooks, all hosted at
https://github.com/AppliedBioinformatics/
36. Acknowledgements
Armin Scheben
Andy Yuan
Habib Rijzaani
Clémentine Mercé
Haifei (Ricky) Hu
Robyn Anderson
Cassie Fernandez
Monica Danilevicz
Jacob Marsh
Nicola & Andrew
Forrest
Paul Johnson
Rochelle Gunn
Dave Edwards
Jacqueline Batley
Jason Williams
Nirav Merchant
Armand Gilles
Brent Verpaalen
Heaps more on Twitter but
Twitter’s Mentions
doesn’t go past last
October
Perth Machine Learning
Group
Shujun Ou
Contact:
Philipp.bayer@uwa.edu.au
@philippbayer
This is a PCA by chromosome – as you can see, some chromosomes ‘diverge’ more than others, mostly caused by how long chromosomes are.
This is a PCA by chromosome – as you can see, some chromosomes ‘diverge’ more than others, mostly caused by how long chromosomes are.
85%! That’s good, right?!?
But in reality, the model mostly predicts just ‘core’, so not much better!
But in reality, the model mostly predicts just ‘core’, so not much better!
Notice the ‘generally’ – in my experience, other values than the generally suggested one can give you higher accuracies!
The accuracy is worse now BUT I have more predicted variable genes! Yay!
As a more intuitive example, SHAP in a model of human mortality – sex is encoded as 0 male 1 female
B. Napus – homeologous block! AAAND NO TRANSPOSONS
This is again a human example. Dependence plots let you zoom into one feature only, compared with another feature
B. Oleracea C on top, B. oleracea C on bottom. In B. oleracea, genes close to centromeres are ‘protected’ from gene loss (low Shapley), but far away has no consequence. In napus, far away genes have high Shapley!