Curators are necessarily detail oriented -- a trait born of, and reinforced by, our efforts to describe biological data accurately and precisely. To ensure comprehensive coverage and meaningful integration of new and existing knowledge, however, it is important to periodically step back from this fine-grained view and assess emergent features in accumulated curation. I will explore how PomBase has used the global "big picture" view of curated data to provide biological summaries, modularise content, and improve data display and access for our users. The global perspective can also be used to detect annotation errors and identify knowledge gaps, thereby improving overall annotation quality. I will also describe the progress we have made in engaging fission yeast researchers in community curation. Finally, I will show that the global curation perspective and community engagement share a common theme: both improve overall understanding, accessibility and reuse of accumulated knowledge by our user community.
Curators are necessarily detail oriented -- a trait born of, and reinforced by, our efforts to describe biological data accurately and precisely. To ensure comprehensive coverage and meaningful integration of new and existing knowledge, however, it is important to periodically step back from this fine-grained view and assess emergent features in accumulated curation. I will explore how PomBase has used the global "big picture" view of curated data to provide biological summaries, modularise content, and improve data display and access for our users. The global perspective can also be used to detect annotation errors and identify knowledge gaps, thereby improving overall annotation quality. I will also describe the progress we have made in engaging fission yeast researchers in community curation. Finally, I will show that the global curation perspective and community engagement share a common theme: both improve overall understanding, accessibility and reuse of accumulated knowledge by our user community.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Maryann Martone
Making Sense of Biological Systems: Using Knowledge Mining to Improve and Validate Models of Living Systems; NIH COBRE Center for the Analysis of Cellular Mechanisms and Systems Biology, Montana State University, Bozeman, MT
August 24, 2012
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...David Talby
A text mining system must go way beyond indexing and search to appear truly intelligent. First, it should understand language beyond keyword matching (for example, distinguishing between “Jane has the flu,” “Jane may have the flu,” “Jane is concerned about the flu," “Jane’s sister has the flu, but she doesn’t,” or “Jane had the flu when she was 9” is of critical importance). This is a natural language processing problem. Second, it should “read between the lines” and make likely inferences even if they’re not explicitly written (for example, if Jane has had a fever, a headache, fatigue, and a runny nose for three days, not as part of an ongoing condition, then she likely has the flu). This is a semi-supervised machine learning problem. And third, it should automatically learn the right contextual inferences to make (for example, learning on its own that fatigue is (sometimes) a flu symptom—only because it appears in many diagnosed patients—without a human ever explicitly stating that rule). This is an association-mining problem, which can be tackled via deep learning or via more guided machine-learning techniques.
This is a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records and provides real-time inferencing at scale. The architecture is built out of open source big data components: Kafka and Spark Streaming for real-time data ingestion and processing, Spark for modeling, and Titan and Elasticsearch for enabling low-latency access to results. The data science components include a UIMA pipeline with custom annotators, machine-learning models for implicit inferences, and dynamic ontologies based on deep learning with Word2Vec for representing and learning new relationships between concepts. Source code is publicly available to enable you to hack away on your own.
Introduction to Gene Mining Part A: BLASTn-off!adcobb
In this lesson, students will learn to use bioinformatics portals and tools to mine plant versions of human genes. Student handout and teacher resource materials are available at www.Araport.org, Teaching Resources (Community tab). Suitable for grades 9-12 or first year undergraduate students.
Can drug repurposing be saved with AI 202405.pdfPaul Agapow
Presented at DigiTechPharma, London May 2024.
What is drug repurposing. Why is it needed? What systematic approaches are there? Is AI a solution? Why not?
IA, la clave de la genomica (May 2024).pdfPaul Agapow
A.k.a. AI, the key to genomics. Presented at 1er Congreso Español de Medicina Genómica. Spanish language.
On the failure of applied genomics. On the complexity of genomics, biology, medicine. The need for AI. Barriers.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Maryann Martone
Making Sense of Biological Systems: Using Knowledge Mining to Improve and Validate Models of Living Systems; NIH COBRE Center for the Analysis of Cellular Mechanisms and Systems Biology, Montana State University, Bozeman, MT
August 24, 2012
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...David Talby
A text mining system must go way beyond indexing and search to appear truly intelligent. First, it should understand language beyond keyword matching (for example, distinguishing between “Jane has the flu,” “Jane may have the flu,” “Jane is concerned about the flu," “Jane’s sister has the flu, but she doesn’t,” or “Jane had the flu when she was 9” is of critical importance). This is a natural language processing problem. Second, it should “read between the lines” and make likely inferences even if they’re not explicitly written (for example, if Jane has had a fever, a headache, fatigue, and a runny nose for three days, not as part of an ongoing condition, then she likely has the flu). This is a semi-supervised machine learning problem. And third, it should automatically learn the right contextual inferences to make (for example, learning on its own that fatigue is (sometimes) a flu symptom—only because it appears in many diagnosed patients—without a human ever explicitly stating that rule). This is an association-mining problem, which can be tackled via deep learning or via more guided machine-learning techniques.
This is a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records and provides real-time inferencing at scale. The architecture is built out of open source big data components: Kafka and Spark Streaming for real-time data ingestion and processing, Spark for modeling, and Titan and Elasticsearch for enabling low-latency access to results. The data science components include a UIMA pipeline with custom annotators, machine-learning models for implicit inferences, and dynamic ontologies based on deep learning with Word2Vec for representing and learning new relationships between concepts. Source code is publicly available to enable you to hack away on your own.
Introduction to Gene Mining Part A: BLASTn-off!adcobb
In this lesson, students will learn to use bioinformatics portals and tools to mine plant versions of human genes. Student handout and teacher resource materials are available at www.Araport.org, Teaching Resources (Community tab). Suitable for grades 9-12 or first year undergraduate students.
Can drug repurposing be saved with AI 202405.pdfPaul Agapow
Presented at DigiTechPharma, London May 2024.
What is drug repurposing. Why is it needed? What systematic approaches are there? Is AI a solution? Why not?
IA, la clave de la genomica (May 2024).pdfPaul Agapow
A.k.a. AI, the key to genomics. Presented at 1er Congreso Español de Medicina Genómica. Spanish language.
On the failure of applied genomics. On the complexity of genomics, biology, medicine. The need for AI. Barriers.
Digital Biomarkers, a (too) brief introduction.pdfPaul Agapow
Presentation at the Artid workshop, U. Bristol, March 2024, on digital biomarkers for improved clinical trials and monitoring of complex diseases, including neurological & movement disorders.
Journal club and talk given to Health Data Analytics MSc, February 2023. Reflecting on how to do good machine learning over biomedical data, the pitfalls and good practices
Where AI will (and won't) revolutionize biomedicinePaul Agapow
Presented AI & Big Data Expo, London, December 2022.
Given the hype and success of machine learning and AI in other fields, its application in healthcare is only natural.
- However, the actual successes in medicine have been limited, with a number of high-profile failures.
- Here, I propose that biology is uniquely complex, with our lack of domain knowledge limiting the application of AI.
- However, there is reason for cautious optimism, with AI-lead approaches shifting the odds in our favour.
Machine learning, health data & the limits of knowledgePaul Agapow
Lecture for Imperial College London's MSc in Health Data Analytics, critiquing a recent paper on COVID diagnosis and moving out to talk about good practices (& limits) in ML and model building
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
27. THANKS!
✘ Nathan Lau
✘ “Dr Bioinformatician”
✘ “Dr A.I.”
✘ Fabian Klötzl
✘ Kevin G
✘ Iddo Friedberg
✘ Stephen Ross
27
✘ Katherine James
✘ Richard Emes
✘ Ming Tang
✘ Ian Holmes
✘ Frederick Ross
✘ Ben van Zwanenberg
✘ And anonymous
contributors ...
Editor's Notes
PAUL: Good morning and welcome to the Festival of Genomics. Every year we gather here to see talks and workshops about the genomic revolution, about how arcane molecular technologies and advanced computation are driving the greatest revolution in healthcare ever. How we are teasing apart the the tangled web of disease, how sequencing is allowing quicker, targeted diagnosis. Every day, we are witness to a deluge of biomedical wonders.
STEPHEN: Unless you actually work in the field and you know the reality - You know how things actually work, how the “analytical sausage” is made - You know about bad software, failing hardware, ignorance of statistics, poorly understood technology, and half-baked experiment, rushed study designs.
You know that behind every groundbreaking Nature paper, there’s a poor harassed bioinformatician in a dim basement room complaining to their PI that those results don’t mean what they think they mean and hoping the PI doesn't open up the raw results in an excel file and ask why their favourite gene is missing from the top 10 results...
STEPHEN: So, who are we? We are two working, card-carrying bioinformaticians. I am Stephen Newhouse [INSERT SHORT BIO]
PAUL: And I am Paul Agapow, until recently I was the lead of Translational Bioinformatics at the Data Science Institute at Imperial College, before in December joining a large pharmaceutical company. I can’t tell you which one, but I can say that it rhymes with ‘AstraZeneca’.
STEPHEN: Together with Nathan Lau of QMUL, we are the organisers of Bioinformatics London, a regular meetup for bioinformaticians, computational biologists and genomicists in capital. At our meetings we usually share announcements of jobs or funding opportunities, have a talk, and then adjourn to the pub where we complain endlessly about our jobs and have a laugh...
Its like AA, but for informaticians...
If this sounds attractive to you, why not come along. If you’d like to give a talk, even better!
PAUL: So today we’re talking back to the rest of the Festival, recounting the small indignities and bad science suffered by bioinformaticians every day. We’ll be asking why bioinformatics is so broken. We’ll be talking about experiments that should not have been done and could not have been done, but were done anyway.
For simplicity, we’ll tell all anecdotes and stories from the first person, although they were gathered from our own experience, from the membership of BioinformaticsLondon and from across the web. We'll complain, criticise and maybe even offer a solution or two. Why is Bioinformatics such a mess? Can we fix it?
PAUL: Like much of the world, we seem obsessed by the phrase “big data”, and that “biomedical big data” will help decipher complex diseases and point the way to patient stratification and precision therapies. All this, despite the fact that “biomedical big data” largely doesn’t exist. Typical clinical trial and transcriptomic datasets have less than 100 subjects, while huge GWAS experiments get a of press, typically they are in the low thousands, as are most real world datasets of interest. Deep learning typically requires thousands, if not 10s of thousands of samples. So does GWAS, although one recent publication alleges for complex neurological diseases, millions of subjects may be required.
Still, I’ve been asked many times to combine or merge trials or datasets to “boost power”. Despite the fact that adding subjects gathered from a different population, under a different protocol, using different measurements and metrics is actually lowering the power of the dataset.
STEPHEN: For a decade, a lot of bioinformatics was driven by illegible, unmaintainable, cryptic Perl scripts, a write-once, read-never language that prides itself on being opaque and never coming within sight of a software engineer or proper coding practice, or version control…. All done in a rush to get the work done quickly and published..
PAUL: Now we use R.
PAUL: A microarray analysis revealed no significant results, with no genes having significantly different gene expression. Despite this, the PI (clinician) insisted we examine the “top hit”.
It was a testis specific Y-encoded gene. The cohort was all female.
STEPHEN: Frequently, clinician scientists and PIs have a tendency to hear words and latch on them, making them a new catchphrase for anything related to bioinformatics and any kind of “advanced” data analysis - seemingly to make themselves sound "informed"...
PAUL: Can I have a Docker? In the cloud? What if we used AI?
PAUL: As a bioinformatician, I’m the goto expert for sequencing, systems biology, genome analysis, phylogenetics, proteomics, microbiomics, lipidomics, high performance computing, machine learning, systems administration, programming, web development, databasing, dev-ops, laying cables, formatting hard drives and finding out why your email isn’t working.
STEPHEN: For a paper on novel ways to interpret and visualise data, another co-author suggested I shouldn’t be an author on a paper because “there would be more bioinformaticians than real scientists in the authorship list”...
Sometimes we get treated worse than PhD students!
PAUL: The PI had a tendency to pose vast, sweeping technical problems and then look at me and say “This is a job for the INFORMATICS TEAM”. There was only one of me and despite my other faults, I have yet to develop multiple personality disorder.
STEPHEN: Let’s do AI, you know, lets build a Deep learning / Baysian / hierarchical model ... and you know all this jargon is just going to give you exactly the same results as a simple correlation, t-test or network analysis, don't you? Because in the end we are just comparing means between groups….and plus we often don’t have the numbers...or the compute (GPUs)
PAUL: I was once forwarded a job ad for sole staff bioinformatician for a busy hospital department. The JD was the result of 6 universities, trusts and departments, listing 73 “key responsibilities”, which is about 30 minutes per key responsibility per week.
The named duties ranging from sequencing, database development, research, writing papers and presenting them at conferences, answering user requests, developing software, installing and networking computers, training, all the way down to keeping track of reagent levels and making sure there was milk in the fridge.
STEPHEN: For a project I was on, the sample processing procedure had been changed part-way through the study, resulting in a massive batch effect that simply couldn’t be corrected for. When I brought this to the attention of PI clinicians, they replied "What do you know about this, you don't have any Nature papers."
For reference, neither did they.
What I had was years of experience working with (bad) data...they weren’t all nature papers (because of batch effects)
PAUL: Hey, friendly bioinformatics guy! I have RNAseq from one case...can you tell me which genes are differentially expressed? Can we do analysis on it?
PAUL: Hey, friendly bioinformatics guy! We've got a really interesting multi-omic dataset. Why don’t we analyse it?
STEPHEN: How many samples?
PAUL: 10.
STEPHEN: Uh, what sort of background?
PAUL: Oh, they’re just random patients - all mixed race, super interesting! right?
STEPHEN: How did you get the samples?
PAUL: Buccal swabs
STEPHEN: This could be difficult ...
PAUL: Why not use deep learning and AI.
STEPHEN: [sigh]....I could stick it in Docker?
STEPHEN: Don’t let journals and peer-reviewers off the hook. We’ve had papers rejected because they don’t use the latest tech and, the latest new sexy software...Sometimes reviewers fail to recognise the approach they're suggesting is wholly inappropriate for the dataset, or might be economically out of reach of the scientists conducting still-worthy science.
PAUL: Famously, the methodology researcher John Ioannidis has published on mis- and over- interpretation of biomedical research, highlighting researcher bias, unrepresentative samples, underpowered studies, cherry-picking, p-hacking, poor control groups, inappropriate experimental design, multiple hypothesis testing, post hoc analysis and mishandling of outliers. He estimates that 60% of research is incorrect.
I heard this and thought “only 60%?”
Most scientific research involving involves researchers painting the bullseye on the wall *after* spraying it with bullets.
STEPHEN: You didn’t consult with me at the start of the experimental design to advise on the samples and/or duplicates for significant results.
You didn’t ask me what was required to do this kind of analysis.
You didn’t budget for any analysis or any of my time.
You stored all your data in Excel or - even worse - Word - or even worse - PDFs.
And yet, here you are, blaming me for not being not being able to prove your hypothesis correct.
STEPHEN: While we’re talking about Excel, let’s commemorate those genes that are mangled in more than 5% of publications.
PAUL: Precision medicine can be defined as repeatedly tweaking your patient cohort, dropping subjects in-and-out, until analysis yields a significant result.
PAUL: Let’s not exclude bioinformatics software authors, who are almost always bioinformaticians themselves.
Science and academia are corrosive to proper programming and software engineering. They are largely unable to pay for professional programmers and engineers, resulting in complex software, platforms and systems being constructed by people who literally learnt their skills from a book. Once, to try and understand the results I was getting from a popular bioinformatics program, I started reading the source code. Each file was several thousand lines long, containing large vestigial lumps of code from previous versions: func_1, func_2, func_4, func_5a. Variables were called h, hh, hh2, foo and - delightfully - thing.
Of course, standard pillars of professional software engineering like version control, refactoring profiling, and unit testing have no value in an environment where only results matter and not the maintainability of those results. How many programs have been validated on publication with a thorough test suite that demonstrates that the program works? (Rather than the results doing little violence to our expectations.) When those programs are updated, how many are retested against the test suite to see that they still work?
STEPHEN: Here is a slightly scandalous piece of advice about academia: you don’t want to be seen as competent.
You don’t want to become known as the person who can do a thing.
Thus it was an grave error of judgment when I helped the grad student in the lab next door by setting up remote access and showing them how to run a program.
This escalated to doing the same for the other students, then their PI, all while slowly becoming known as “the guy who will do X for you!, always, and never says No (but should)”
STEPHEN: How is it possible that people still don’t know how to use blast, one of the oldest and commonly used bioinformatic programs in existence?
(Also effects -num_descriptions and -num_alignments)
PAUL: When confronted with a complex biomedical problem, there’s a peculiar optimism in the idea “let's make a database”.
The irony is increased even more if the database was populated with dubious or untrustworthy data and then that database is used to annotate and identify new additions to the database, resulting in a runaway train.
And the irony hits maximum if the bioinformatician that constructed the db is then charged with making queries on the database.
STEPHEN: Isn’t this all ridiculous?
Data Analysis is basic, foundational to biomedicine and biotech.
Why is it often done so badly?
Why are there so many papers with weak or incorrect results, irreproducible results?
Why is there so much bad software?
PAUL: We have built a system that rewards rapid and frequent publication, that incentivizes novel and startling findings. We’ve put a bunch of educated and intelligent people into that system, told them that their success and career progression depends on chasing those measures. Then we’re disappointed when people chase those measures and not valuable things like careful and thorough checking of analysis, reproducibility, documentation and verification.
Shouldn’t we have higher standards? Shouldn’t funders, research councils and departments insist on higher standards for the research they’re overseeing? Shouldn’t we have higher standards for ourselves?
PAUL: Maybe we should give up. Maybe we should just stop being bioinformaticians. I mean this in several ways.
First, there was a time when bioinformatics and biology and biomedicine were almost entirely practiced within the bounds of the university, the hospital and the research institute. People had little or no career flexibility and so just put up a lot of bad workplaces and bad bosses. This is no longer the case. There is a healthy, expanding commercial sphere for bioinformatics, companies and startups where you can work on interesting and useful problems. The craze for data science has provided another avenue for frustrated bioinformaticians. Those places are not free of problems we’ve outlined, but there are choices. The academy is not the only show in town.
Second, maybe we should stop calling ourselves “bioinformaticians” and stop doing “bioinformatics”. The terms have become so abused and so hopelessly broad as to be meaningless, as to potentially encompass anyone who does something with a computer related to biology in some way: all the way from a Masters students who can use the web version of BLAST, through mathematical ecologists, website designers, genomicists, research software engineers, biostatisticians to computational chemists. The words are not doing us any good, so let’s walk away from the words. Call yourself anything but a bioinformatician and see if your life gets better.
Finally, let's take that to the ultimate end. Let’s kill bioinformatics. Much as how economies have moved from being largely farming and industry to being largely services and “thought work”, biomedical science has moved from lab coats and hospital gowns to being dominated by analysis. We’ve commoditized the manual “wet” side of science and almost everyone does “computer work”. Everyone is a bioinformatician. So many we should stop doing bioinformatics and do science instead.
STEPHEN: solutions?:
Educate the PIs/clinicians/basic scientists
N > 100, analysis plan, time lines
Trust us when we say it will take a week
Set up a career track for us
Expectation management
Unplanned work management
Involve us from the beginning: treat us like a statistician