Multivariate Data analysis Workshop at UC Davis 2012Dmitry Grapov
Introductory Workshop for Multivariate Data Analysis and Visualization
Dmitry Grapov1,2,3*, John W Newman1,2
1 Nutrition, University of California Davis, Davis, CA,
2 USDA/ARS Western Human Nutrition Research Center, Davis, CA
3 Designated Emphasis in Biotechnology, University of California Davis, Davis, CA,
Next generation “omics” tools are harbingers of the golden age of biology. Biologists are on the cusp of breaking through the veil of complexity surrounding the emergent properties of complex biological systems. However these same rapid technological advances are also transforming the study of biology into a data intensive science. The ever growing gap between data and theory necessitates that biologists become familiar with multivariate computational and visualization methods in order to fully understand their experimental results.
We are offering a summer workshop covering introductory concepts and applications of multivariate data analysis (MDA) and visualization techniques. Join us for a week to familiarize yourself with concepts in MDA covering topics in: multiple hypothesis testing, exploratory projection pursuits, multivariate classification and regression modeling, networks and machine learning. Get experience with MDA through hands-on analyses of real-world data using freely available tools. Learn how to make the most of your time and experimental results by quickly understanding your data’s complexity, main features and inter-relationships.
Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
Multivariate Data analysis Workshop at UC Davis 2012Dmitry Grapov
Introductory Workshop for Multivariate Data Analysis and Visualization
Dmitry Grapov1,2,3*, John W Newman1,2
1 Nutrition, University of California Davis, Davis, CA,
2 USDA/ARS Western Human Nutrition Research Center, Davis, CA
3 Designated Emphasis in Biotechnology, University of California Davis, Davis, CA,
Next generation “omics” tools are harbingers of the golden age of biology. Biologists are on the cusp of breaking through the veil of complexity surrounding the emergent properties of complex biological systems. However these same rapid technological advances are also transforming the study of biology into a data intensive science. The ever growing gap between data and theory necessitates that biologists become familiar with multivariate computational and visualization methods in order to fully understand their experimental results.
We are offering a summer workshop covering introductory concepts and applications of multivariate data analysis (MDA) and visualization techniques. Join us for a week to familiarize yourself with concepts in MDA covering topics in: multiple hypothesis testing, exploratory projection pursuits, multivariate classification and regression modeling, networks and machine learning. Get experience with MDA through hands-on analyses of real-world data using freely available tools. Learn how to make the most of your time and experimental results by quickly understanding your data’s complexity, main features and inter-relationships.
Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
Introduction to Research methodology: Orientation for Doctoral Program Course...niloysarkar
Despite this critical importance of research, the research and innovation investment in India is, at the current time, only 0.69% of GDP as compared to 2.8% in the United States of America, 4.3% in Israel and 4.2% in South Korea. (Source: NEP2020, GoI)
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...arx-deidentifier
Presented at IEEE CBMS 2017: When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our methods to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our method enables the creation of privacy-preserving classifiers with optimal prediction accuracy.
GWA studies are perhaps most often used for studying the genetic basis of human diseases, but this technology also has great utility for studying the natural variation of other organisms.
In this webcast, Ashley Hintz, Field Application Scientist, will discuss the utility of SVS for analyzing plant GWA data, using publicly available SNP data for Arabidopsis thaliana as a case study. Along the way, Ashley will demonstrate how SVS can be used to manage data, analyze population structure, perform genotype QA and ultimately replicate a published genetic association in A. thaliana using EMMAX regression. She will also address the flexibility of SVS for analyzing the genomes of other plant and animal species.
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
The human genome project [1], an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, lasted roughly 15 years and cost $5 billion (adjusted for inflation). With the recent advances in genome sequencing technology, that cost has now reduced to a few hundreds dollars [2] and can be done overnight.
Being able to access this kind of information may have a deep impact on the way complex diseases are treated: physicians will shift from general-purpose treatments to specific ones, tailored on the individual patient’s genomic features.This approach is referred to as precision medicine.
There are however several caveats: first of all, due to the nature of the problem, knowledge of both the biomedical and the computer science domain are required in order to correctly approach it; second, unlike more classical scenarios such as image classification or object detection, it is much more difficult to determine the accuracy of the system due to the complex and multifactorial nature of complex diseases such as cancer and neurodegenerative diseases.
Moreover, a black box kind of solution is unlikely to be of any use, due to legal and ethical reasons: interpretability of the model is crucial more than ever.
The goal of this thesis is to explore the possibilities and the limits of techniques based on deep neural networks for the analysis of biomolecular data, experimenting with publicly available datasets.
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...Remedy Informatics
The presentation describes how Remedy Informatics is advocating and innovating "flexible standardization" through an ontology-driven approach to clinical research. You will see in greater detail how a foundational, standardized Mosaic Ontology can be extended for more specific research applications and even more specific and focused disease research.
Drug discovery and development is a long and expensive process over time has notoriously bucked Moore's law that it now has its own law called Eroom's Law named after it (the opposite of Moore). It is estimated that the attrition rate of drug candidates is up to 96% and the average cost to develop a new drug has reached almost $2.5 billion in recent years. One of the major causes for the high attrition rate is drug safety, which accounts for 30% of drug failures. Even if a drug is approved in market, it could be withdrawn due to safety problems. Therefore, evaluating drug safety extensively as early as possible becomes all the more important to accelerate drug discovery and development. This talk provides a high-level overview of the current process of rational drug design that has been in place for many decades and covers some of the major areas where the application of AI, Deep learning and ML based techniques have had the most gains. Specifically, this talk covers a variety of drug safety related AI and ML based techniques currently in use which can generally divided into 3 main categories: 1. Classification 2. Regression 3. Read-across. The talk will also cover how by using a hierarchical classification methodology you can simplify the problem of assessing toxicity of any given chemical compound. We will also address recent progress of predictive models and techniques built for various toxicities. It will also cover some publicly available databases, tools and platforms available to easily leverage them. We will also compare and contrast various modeling techniques including deep learning techniques and their accuracy using recent research. Finally, the talk will also address some of the remaining challenges and limitations yet to be addressed in the area of drug safety assessment.
Slides of the 2015 Bio Data World Congress show how our analyzegenomes.com services are combined to support precision medicine in the context of modern oncology treatment.
Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
Introduction to Research methodology: Orientation for Doctoral Program Course...niloysarkar
Despite this critical importance of research, the research and innovation investment in India is, at the current time, only 0.69% of GDP as compared to 2.8% in the United States of America, 4.3% in Israel and 4.2% in South Korea. (Source: NEP2020, GoI)
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...arx-deidentifier
Presented at IEEE CBMS 2017: When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our methods to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our method enables the creation of privacy-preserving classifiers with optimal prediction accuracy.
GWA studies are perhaps most often used for studying the genetic basis of human diseases, but this technology also has great utility for studying the natural variation of other organisms.
In this webcast, Ashley Hintz, Field Application Scientist, will discuss the utility of SVS for analyzing plant GWA data, using publicly available SNP data for Arabidopsis thaliana as a case study. Along the way, Ashley will demonstrate how SVS can be used to manage data, analyze population structure, perform genotype QA and ultimately replicate a published genetic association in A. thaliana using EMMAX regression. She will also address the flexibility of SVS for analyzing the genomes of other plant and animal species.
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
The human genome project [1], an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, lasted roughly 15 years and cost $5 billion (adjusted for inflation). With the recent advances in genome sequencing technology, that cost has now reduced to a few hundreds dollars [2] and can be done overnight.
Being able to access this kind of information may have a deep impact on the way complex diseases are treated: physicians will shift from general-purpose treatments to specific ones, tailored on the individual patient’s genomic features.This approach is referred to as precision medicine.
There are however several caveats: first of all, due to the nature of the problem, knowledge of both the biomedical and the computer science domain are required in order to correctly approach it; second, unlike more classical scenarios such as image classification or object detection, it is much more difficult to determine the accuracy of the system due to the complex and multifactorial nature of complex diseases such as cancer and neurodegenerative diseases.
Moreover, a black box kind of solution is unlikely to be of any use, due to legal and ethical reasons: interpretability of the model is crucial more than ever.
The goal of this thesis is to explore the possibilities and the limits of techniques based on deep neural networks for the analysis of biomolecular data, experimenting with publicly available datasets.
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...Remedy Informatics
The presentation describes how Remedy Informatics is advocating and innovating "flexible standardization" through an ontology-driven approach to clinical research. You will see in greater detail how a foundational, standardized Mosaic Ontology can be extended for more specific research applications and even more specific and focused disease research.
Drug discovery and development is a long and expensive process over time has notoriously bucked Moore's law that it now has its own law called Eroom's Law named after it (the opposite of Moore). It is estimated that the attrition rate of drug candidates is up to 96% and the average cost to develop a new drug has reached almost $2.5 billion in recent years. One of the major causes for the high attrition rate is drug safety, which accounts for 30% of drug failures. Even if a drug is approved in market, it could be withdrawn due to safety problems. Therefore, evaluating drug safety extensively as early as possible becomes all the more important to accelerate drug discovery and development. This talk provides a high-level overview of the current process of rational drug design that has been in place for many decades and covers some of the major areas where the application of AI, Deep learning and ML based techniques have had the most gains. Specifically, this talk covers a variety of drug safety related AI and ML based techniques currently in use which can generally divided into 3 main categories: 1. Classification 2. Regression 3. Read-across. The talk will also cover how by using a hierarchical classification methodology you can simplify the problem of assessing toxicity of any given chemical compound. We will also address recent progress of predictive models and techniques built for various toxicities. It will also cover some publicly available databases, tools and platforms available to easily leverage them. We will also compare and contrast various modeling techniques including deep learning techniques and their accuracy using recent research. Finally, the talk will also address some of the remaining challenges and limitations yet to be addressed in the area of drug safety assessment.
Slides of the 2015 Bio Data World Congress show how our analyzegenomes.com services are combined to support precision medicine in the context of modern oncology treatment.
Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
With a focus on scalable architecture and optimized native code that fully utilizes the CPU and RAM available, we can scale genomic analysis into sizes conventionally considered Big Data on a single host. In this webcast, we demonstrate recent innovations and features in Golden Helix solutions that enable the analysis of big data on your own terms.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
Similarity check of real world entities is a necessary factor in these days which is named as Data Replica Detection.
Time is an critical factor today in tracking Data Replica Detection for large data sets, without having impact over quality
of Dataset. In this we primarily introduce two Data Replica Detection algorithms , where in these contribute enhanced
procedural standards in finding Data Replication at limited execution periods.This contribute better improvised state
of time than conventional techniques . We propose two Data Replica Detection algorithms namely progressive sorted
neighborhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB),
which performs best on large and very grimy datasets. Both enhance the efficiency of duplicate detection even on very
large datasets.
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
EUGM 2013 - Andrea de Souza (Broad Institute): Setting the stage for the “SD”...ChemAxon
Building on the success of the Molecular Libraries Program (MLP), the Broad Institute MLP team is co-leading with the National Center for Advancing Translational Sciences (NCATS) an NIH-sponsored project across 7 institutions to augment the data in PubChem with the creation of the Bioassay Research Database (BARD). The BARD platform standardizes the representation of bioassays in a next-generation repository and provides a user-friendly interface that supports sophisticated queries and data mining. Data originating from publicly-funded chemical biology research efforts will be presented with appropriate context including structured assay and result annotations. These annotations use relevant ontologies including, for example, the BioAssay Ontology, Gene Ontology, and the Unit Ontology. We simplified the representation of ontologies into a hierarchical data dictionary to enable data producers to more easily create and upload projects, assays, and results, while creating two separate user interfaces for data consumers. The BARD WebQuery Interface leverages a Google-like interface with auto-suggest functionality for complex queries, such as retrieval of all assays, and results for biological pathways such as “DNA repair” or “oxidative stress”; presentation of this information in a rich-user interface that includes spreadsheet support for structure-activity relationship analyses. Compounds, projects, and assays can be exported into an Amazon-like query cart for refining queries, and additional computations can be executed on datasets via community-developed plug-ins including promiscuity analyses via the BioActivity Data Associative Promiscuity Pattern Learning Engine (BADAPPLE) and a CYP450 metabolism site prediction plugin (hgp://www.farma.ku.dk/smartcyp/) using 2D structure fingerprints. Integration between the WebQuery and Desktop clients enables power users to initiate analyses in WebQuery and gain more insight via the Desktop client.
Lastly, as industry and academia work together to innovate in small-molecule therapeutics, we have created an initial specification for the Assay Definition Standard. This standard through the Assay Definition Format has been used as the medium of data file transfer for data upload. We expect that the Chemical Biology community now has an opportunity to leverage this standard to routinely transfer assay and result data within and between information systems and organizations.
This presentation will highlight the BARD platform with a focus on representing the cumulative body of work that exploits the ChemAxon toolkit.
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
Introductory lecture to multivariate analysis of proteomic data.
Material from the UC Davis 2014 Proteomics Workshop.
See more at: http://sourceforge.net/projects/teachingdemos/files/2014%20UC%20Davis%20Proteomics%20Workshop/
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
A crucial task in modern biology is the prediction of complex phenotypes, such as breast cancer prognosis, from genome-wide measurements. Machine learning algorithms can sometimes infer predictive patterns, but there is rarely enough data to train and test them effectively and the patterns that they identify are often expressed in forms (e.g. support vector machines, neural networks, random forests composed of 10s of thousands of trees) that are highly difficult to understand. In addition, it is generally unclear how to include prior knowledge in the course of their construction.
Decision trees provide an intuitive visual form that can capture complex interactions between multiple variables. Effective methods exist for inferring decision trees automatically but it has been shown that these techniques can be improved upon via the manual interventions of experts. Here, we introduce Branch, a new Web-based tool for the interactive construction of decision trees from genomic datasets. Branch offers the ability to: (1) upload and share datasets intended for classification tasks (in progress), (2) construct decision trees by manually selecting features such as genes for a gene expression dataset, (3) collaboratively edit decision trees, (4) create feature functions that aggregate content from multiple independent features into single decision nodes (e.g. pathways) and (5) evaluate decision tree classifiers in terms of precision and recall. The tool is optimized for genomic use cases through the inclusion of gene and pathway-based search functions.
Branch enables expert biologists to easily engage directly with high-throughput datasets without the need for a team of bioinformaticians. The tree building process allows researchers to rapidly test hypotheses about interactions between biological variables and phenotypes in ways that would otherwise require extensive computational sophistication. In so doing, this tool can both inform biological research and help to produce more accurate, more meaningful classifiers.
A prototype of Branch is available at http://biobranch.org/
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Building bioinformatics resources for the global communityExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building bioinformatics resources for the global community. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Peter (Yun-shao) Sung's Resume 2016III
1. PETER YUN- SHAO SUNG
112-20 72th Drive, Apt D13 U.S. Permanent Resident 347.393.9026
Forest Hills, NY, 11375 LinkedIn GitHub GoogleScholar yss265@nyu.edu
EDUCATION
NEW YORK UNIVERSITY, COURANT INSTITUTE 2016
Master of Science, Degree in Computer Science, GPA: 3.74
CORNELL UNIVERSITY 2010
Master of Engineering, Degree in Biomedical Engineering, GPA: 3.5
NATIONAL TAIWAN UNIVERSITY 2007
Bachelor of Science, Degree in Engineering Science/Ocean Engineering, GPA: 3.2
RELATED COURSES
Fundamental Algorithms Computational Machine Learning Operation System Heuristic Problem Solving(link)
Real Time Big Data(link) Production Quality Software Deep Learning Search Engine Architecture(link)
PROFESSIONAL EXPERIENCES
Bioinformatic Analyst 2010-present
Department of Pathology, Memorial Sloan-Kettering Cancer Center, NYC
• Designed and automated novel analysis pipeline for genome mutation diagnosis from data (>100M reads) of clinical tumors
• Published over 30 papers on identifying novel genomic signatures leading to sarcoma development
Software Engineer Intern 2015-2015 Sep
Orderhood, LLC (link)
• Developed on-demand delivery service by Node and React. Implemented and realtime rendered runner best route
• Achieved 52% dashboard efficiency improvement by backend refactoring and developed API for robust RPC handling
Software Engineer 2014-2015
Massive Bio, LLC (link)
• Designed and developed open-source software modules for various steps in a standard bioinformatics pipeline
• Designed and benchmarked performance of NGS tools for mutation detection up to 99% of sensitivity and 90% specificity
Bioinformatician 2013-2014
Institute of Computational Biomedicine, Weill Cornell Medical College, NYC
• Improved C++ open soured tools for handling > 100M reads and developed pipelines for >50 times efficiency increment
MACHINE LEARNING PROJECTS
Learning on Music Structure with Spectral Clustering (link)
• Invested a novel model for training machine to identify batches of musical structures based on Laplacian spectral clustering
Music Genre Classification (link)
• Invented novel scatter feature extraction with VLAD for efficient learning, 82% accuracy achieved than original 60%
Search Engine Based Movie Recommendation System (link)
• Implemented and deployed our MapReduce method on AWS wrapped with self-designed scalable file system
• Build RESTful website classify user preferences and made recommendation accordingly
Big Data for Stock Analysis (link)
• MapReduce on US stock correlation analysis, and sentiment analysis for news from the past 5 years
SELECTED PUBLICATIONS, over 33 publications Complete list
1. Identification of Recurrent NAB2-STAT6 Gene Fusions in Solitary Fibrous Tumor/Hemangiopericytoma by Clinical
Sequencing. Natural Genetics 45, 180–185 (2013). (Impact factor: 35.5)
2. Monoclonality of Multifocal Epithelioid Hemangioendothelioma of the Liver by Analysis of WWTR1-CAMTA1
Breakpoints. Cancer Genetics. 2012 (Equally Contributing Author)
AWARD AND ACHIEVEMENTS
Tuition reimbursement, Memorial Sloan-Kettering Cancer Center, NYC 2013
Top 0.97% over 70,000 test takers on Mathematic of Department Required Test, Taiwan 2004
Member in the school team of International Physics Olympiad, Taiwan 2003
LEADERSHIP
Student Leader 2005-2006
Ten Outstanding Young Leaders Foundation, Taipei, Taiwan
• Conducted workshops and reported achievements to board of directors including president in Legislative Yuan of Taiwan
EXPERIENCE AND SKILLS
Technology: Python, C++, Torch, JavaScript, React, Node, Perl, R, FLUX, Shell, MapReduce, EC2, Beanstalk, S3, Hive, Hadoop
Skills: CNN, RNN, LSTM, GAN, VLAD, SGD, kmeans, kmeans++, Principal Component Analysis
Language: Mandarin (fluent), English (fluent), Taiwanese (fluent)
1