Presented this slide deck to analytic and evaluation professionals at the Ohio Program Evaluators' Group's bi-annual conference. Discussed how to reduce large, complex datasets into smaller, manageable projects.
Scholarly Communication for Bioinformatics StudentsPhilip Bourne
Presentation made to the incoming bioinformatics and systems biology students at UCSD on how they could get involved in changing scholarly communication. Given February 28, 2011
Scholarly Communication for Bioinformatics StudentsPhilip Bourne
Presentation made to the incoming bioinformatics and systems biology students at UCSD on how they could get involved in changing scholarly communication. Given February 28, 2011
GWA studies are perhaps most often used for studying the genetic basis of human diseases, but this technology also has great utility for studying the natural variation of other organisms.
In this webcast, Ashley Hintz, Field Application Scientist, will discuss the utility of SVS for analyzing plant GWA data, using publicly available SNP data for Arabidopsis thaliana as a case study. Along the way, Ashley will demonstrate how SVS can be used to manage data, analyze population structure, perform genotype QA and ultimately replicate a published genetic association in A. thaliana using EMMAX regression. She will also address the flexibility of SVS for analyzing the genomes of other plant and animal species.
The ChemSpider database is an online resource containing >26 million chemicals sourced from over 400 data sources. As a result the database is a rich resource supporting the verification and elucidation of chemical structures and is utilized by mass spectrometrists around the world using the online user interface as well as the application programming interface. This presentation will provide an overview of how ChemSpider can be used for the purpose of structure identification and will include (1) direct interaction with the online interface; (2) integration to mass spectrometry vendor software; (3) applications to the identification of “known unknowns” and a comparison with the capabilities of CAS Scifinder and (4) the hosting of online mass spectral data.
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
Biomedical researchers will remain stymied in their ability to take full advantage of the Big Data revolution if they can never find the datasets that they need to analyze, if there is lack of clarity about what particular datasets contain, and if data are insufficiently described.
CEDAR, an NIH BD2K Center of Excellence, aims to develop methods and tools to vastly ease the burden of authoring good experimental metadata, and to maximally use this information to zero in on datasets of interest.
Natural Language Processing on Non-Textual Datagpano
Talk by Casey Stella, presented at the SF Data Mining Hadoop Summit Meetup, on June 8, 2015. Notebook available at https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/ipython/clinical2vec.ipynb
Dear Editor: I read your publication ethics issue on “bogus impact factors” with great interest (1). I would like to initiate a new trend in manipulating the citation counts. There are several ethical approaches to increase the number of citations for a published paper (2). However, it is apparent that some manipulation of the number of citations is occurring (3, 4). Self - citations, “those in which the authors cite their own works” account for a significant portion of all citations (5). With the advent of information technology, it is easy to identify unusual trends for citations in a paper or a journal. A web application to calculate the single publication h - index based on (6) is available online (7, 8). A tool developed by Francisco Couto (9) can measure authors’ citation impact by excluding the self - citations. Self - citation is ethical when it is a necessity. Nevertheless, there is a threshold for self - citations. Thomson Reuters’ resource, known as the Web of Science (WoS) and currently lists journal impact factors, considers self - citation to be acceptable up to a rate of 20%; anything over that is considered suspect (10). In some journals, even 5% is considered to be a high rate of self - citations. The ‘Journal Citation Report’ is a reliable source for checking the acceptable level of self - citation in any field of study. The Public Policy Group of the London School of Economics (LSE) published a handbook for “Maximizing the Impacts of Your Research” and described self - citation rates across different groups of disciplines, indicating that they vary up to 40% (11). Unfortunately, there is no significant penalty for the most frequent self - citers, and the effect of self - citation remains positive even for very high rates of self - citation (5). However, WoS has dropped some journals from its database because of untrue trends in the citations (4). The same policy also should be applied for the most frequent self - citers. The ethics of publications should be adhered to by those who wish to conduct research and publish their findings.
Database Of Rose Varieties Eucarpia Leiden 2009renesmulders
A presentation on the use of microsatellite markers to genotype over 700 rose varieties for identification purposes, given at the 23rd Intl. Eucarpia Symp. (Sec. Ornamentals) on
“Colourful Breeding and Genetics” in Leiden, The Netherlands, September 2009. Published in Acta Horticulturae (ISHS) 836: 169-174 (2009)
GWA studies are perhaps most often used for studying the genetic basis of human diseases, but this technology also has great utility for studying the natural variation of other organisms.
In this webcast, Ashley Hintz, Field Application Scientist, will discuss the utility of SVS for analyzing plant GWA data, using publicly available SNP data for Arabidopsis thaliana as a case study. Along the way, Ashley will demonstrate how SVS can be used to manage data, analyze population structure, perform genotype QA and ultimately replicate a published genetic association in A. thaliana using EMMAX regression. She will also address the flexibility of SVS for analyzing the genomes of other plant and animal species.
The ChemSpider database is an online resource containing >26 million chemicals sourced from over 400 data sources. As a result the database is a rich resource supporting the verification and elucidation of chemical structures and is utilized by mass spectrometrists around the world using the online user interface as well as the application programming interface. This presentation will provide an overview of how ChemSpider can be used for the purpose of structure identification and will include (1) direct interaction with the online interface; (2) integration to mass spectrometry vendor software; (3) applications to the identification of “known unknowns” and a comparison with the capabilities of CAS Scifinder and (4) the hosting of online mass spectral data.
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
Biomedical researchers will remain stymied in their ability to take full advantage of the Big Data revolution if they can never find the datasets that they need to analyze, if there is lack of clarity about what particular datasets contain, and if data are insufficiently described.
CEDAR, an NIH BD2K Center of Excellence, aims to develop methods and tools to vastly ease the burden of authoring good experimental metadata, and to maximally use this information to zero in on datasets of interest.
Natural Language Processing on Non-Textual Datagpano
Talk by Casey Stella, presented at the SF Data Mining Hadoop Summit Meetup, on June 8, 2015. Notebook available at https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/ipython/clinical2vec.ipynb
Dear Editor: I read your publication ethics issue on “bogus impact factors” with great interest (1). I would like to initiate a new trend in manipulating the citation counts. There are several ethical approaches to increase the number of citations for a published paper (2). However, it is apparent that some manipulation of the number of citations is occurring (3, 4). Self - citations, “those in which the authors cite their own works” account for a significant portion of all citations (5). With the advent of information technology, it is easy to identify unusual trends for citations in a paper or a journal. A web application to calculate the single publication h - index based on (6) is available online (7, 8). A tool developed by Francisco Couto (9) can measure authors’ citation impact by excluding the self - citations. Self - citation is ethical when it is a necessity. Nevertheless, there is a threshold for self - citations. Thomson Reuters’ resource, known as the Web of Science (WoS) and currently lists journal impact factors, considers self - citation to be acceptable up to a rate of 20%; anything over that is considered suspect (10). In some journals, even 5% is considered to be a high rate of self - citations. The ‘Journal Citation Report’ is a reliable source for checking the acceptable level of self - citation in any field of study. The Public Policy Group of the London School of Economics (LSE) published a handbook for “Maximizing the Impacts of Your Research” and described self - citation rates across different groups of disciplines, indicating that they vary up to 40% (11). Unfortunately, there is no significant penalty for the most frequent self - citers, and the effect of self - citation remains positive even for very high rates of self - citation (5). However, WoS has dropped some journals from its database because of untrue trends in the citations (4). The same policy also should be applied for the most frequent self - citers. The ethics of publications should be adhered to by those who wish to conduct research and publish their findings.
Database Of Rose Varieties Eucarpia Leiden 2009renesmulders
A presentation on the use of microsatellite markers to genotype over 700 rose varieties for identification purposes, given at the 23rd Intl. Eucarpia Symp. (Sec. Ornamentals) on
“Colourful Breeding and Genetics” in Leiden, The Netherlands, September 2009. Published in Acta Horticulturae (ISHS) 836: 169-174 (2009)
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...Remedy Informatics
The presentation describes how Remedy Informatics is advocating and innovating "flexible standardization" through an ontology-driven approach to clinical research. You will see in greater detail how a foundational, standardized Mosaic Ontology can be extended for more specific research applications and even more specific and focused disease research.
Building Biomedical Knowledge Graphs for In-Silico Drug DiscoveryVaticle
The rapid development and spread of analytical tools in the biomedical sciences has produced a variety of information about all sorts of biological components and their functions. Though important individually, their biological characteristics need to be understood in relation to the interactions they have with other biological components, which requires the integration of vast amounts of complex, semantically-rich, heterogenous data.
Traditional systems are inadequate at accurately modelling and handling data at this scale and complexity, making solutions that speed up the integration and querying of such data a necessity.
In this talk, we present various approaches being used in organisations to build biomedical computational pipelines to address these problems using tools such as Machine Learning and TypeDB. In particular, we discuss how to create an accurate and scalable semantic representation of molecular level biomedical data by presenting examples from drug discovery, precision medicine and competitive intelligence.
Speaker: Tomás Sabat
Tomás is the Chief Operating Officer at Vaticle, dedicated to building a strongly-typed database for intelligent systems. He works directly with TypeDB's open source and enterprise users so they can fulfil their potential with TypeDB and change the world. He focuses mainly in life sciences, cyber security, finance and robotics.
Enabling the Computational Future of Biology.pdfVaticle
Computational biology has revolutionised biomedicine. The volume of data it is generating is growing exponentially. This requires tools that enable computational and non-computational biologists to collaborate and derive meaningful insights. However, traditional systems are inadequate to accurately model and handle data at this scale and complexity.
In this talk, we discuss how TypeDB enables biologists to build a deeper understanding of life, and increase the probability of groundbreaking discoveries, across the life sciences.
Speaker: Tomás Sabat
Tomás is the Chief Operating Officer at Vaticle. He works closely with TypeDB's open source and enterprise users who use TypeDB to build applications in a wide number of industries including financial services, life sciences, cybersecurity and supply chain management. A graduate of the University of Cambridge, Tomás has spent the last seven years founding and building businesses in the technology industry.
This slide deck covers and introduction to using Excel in machine learning. Specifically, it reviews supervised learning with categorical data and continuous data.
Presented this slide deck to digital analytics professionals at Columbus Web Analytics' monthly networking event. Discussed how to interpret correlation values.
I presented this slide deck to the Columbus Web Analytics Wednesday group on 11.12.14. I created the presentation based on a journal article I published with David Taylor in the International Journal of Retail & Distribution Management.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
5. Descriptive Statistics
Data Permissible Statistics
Nominal Description Mode
Ordinal Order Median
Interval Distance Mean, Std Dev, Skew, Kurt
Ratio Zero Harmonic mean
16. Pairing Big Data in Bite Size Morsels
Dr. Michael A. Levin Otterbein University @MichaelALevin
Editor's Notes
Clean your data. All datasets need a good cleaning.
Experts can help you reduce columns or variables.
Also, they can help makes sense of the data or the underlying model.
Look for lack of variance in your responses.
Look for ties. Too many ties creates problems when running Spearman Rank Rho
Handle missing and inconsistent data.
Are you measuring the same variables or constructs or ideas twice?
Treat your large dataset as a known population and calculate a mean and a standard deviation for a variable of interest. Create a confidence interval or mean based sample size.