ODAM is an Experiment Data Table Management System (EDTMS) that gives you an open access to your data and make them ready to be mined - A data explorer as bonus
The document provides an overview of data mining concepts including association rules, classification, and clustering algorithms. It introduces data mining and knowledge discovery processes. Association rule mining aims to find relationships between variables in large datasets using the Apriori and FP-growth algorithms. Classification algorithms build a model to predict class membership for new records based on a decision tree. Clustering algorithms group similar records together without predefined classes.
The document discusses association rule mining and the Apriori algorithm. It defines key concepts in association rule mining such as frequent itemsets, support, confidence, and association rules. It also explains the steps in the Apriori algorithm to generate frequent itemsets and rules, including candidate generation, pruning infrequent subsets, and determining support. An example transaction database is used to demonstrate calculating support and confidence for rules and illustrate the Apriori algorithm.
presentation on recent data mining Techniques ,and future directions of research from the recent research papers made in Pre-master ,in Cairo University under supervision of Dr. Rabie
The document defines data mining as extracting useful information from large datasets. It discusses two main types of data mining tasks: descriptive tasks like frequent pattern mining and classification/prediction tasks like decision trees. Several data mining techniques are covered, including association, classification, clustering, prediction, sequential patterns, and decision trees. Real-world applications of data mining are also outlined, such as market basket analysis, fraud detection, healthcare, education, and CRM.
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses Chapter 5 from the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of basic concepts such as frequent patterns and association rules. It also describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to frequent pattern mining.
This document provides an outline and overview of key concepts related to data mining. It begins with an introduction to data mining and related tasks such as classification, clustering, and association rule mining. It then discusses concepts that are related to data mining, such as databases, information retrieval, and dimensional modeling. Finally, it outlines common data mining techniques including statistics, similarity measures, decision trees, and neural networks. The overall goal is to introduce the main components and approaches used in data mining.
The document provides an overview of data mining concepts including association rules, classification, and clustering algorithms. It introduces data mining and knowledge discovery processes. Association rule mining aims to find relationships between variables in large datasets using the Apriori and FP-growth algorithms. Classification algorithms build a model to predict class membership for new records based on a decision tree. Clustering algorithms group similar records together without predefined classes.
The document discusses association rule mining and the Apriori algorithm. It defines key concepts in association rule mining such as frequent itemsets, support, confidence, and association rules. It also explains the steps in the Apriori algorithm to generate frequent itemsets and rules, including candidate generation, pruning infrequent subsets, and determining support. An example transaction database is used to demonstrate calculating support and confidence for rules and illustrate the Apriori algorithm.
presentation on recent data mining Techniques ,and future directions of research from the recent research papers made in Pre-master ,in Cairo University under supervision of Dr. Rabie
The document defines data mining as extracting useful information from large datasets. It discusses two main types of data mining tasks: descriptive tasks like frequent pattern mining and classification/prediction tasks like decision trees. Several data mining techniques are covered, including association, classification, clustering, prediction, sequential patterns, and decision trees. Real-world applications of data mining are also outlined, such as market basket analysis, fraud detection, healthcare, education, and CRM.
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses Chapter 5 from the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of basic concepts such as frequent patterns and association rules. It also describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to frequent pattern mining.
This document provides an outline and overview of key concepts related to data mining. It begins with an introduction to data mining and related tasks such as classification, clustering, and association rule mining. It then discusses concepts that are related to data mining, such as databases, information retrieval, and dimensional modeling. Finally, it outlines common data mining techniques including statistics, similarity measures, decision trees, and neural networks. The overall goal is to introduce the main components and approaches used in data mining.
The document introduces data mining and knowledge discovery in databases. It discusses why data mining is needed due to large datasets that cannot be analyzed manually. It also covers the data mining process, common data mining techniques like association rules and decision trees, applications of data mining in various domains, and some popular data mining tools.
The document discusses data preprocessing techniques for data mining. It covers why preprocessing is important due to real-world data often being incomplete, noisy, and inconsistent. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data binning are also summarized.
The document discusses various data mining techniques including association rules, classification, clustering, and approaches to discovering patterns in datasets. It covers clustering algorithms like partition and hierarchical clustering. It also explains different data mining problems like discovering sequential patterns, patterns in time series data, and classification and regression rules.
Data mining involves analyzing large amounts of data to discover patterns that can be used for purposes such as increasing sales, reducing costs, or detecting fraud. It allows companies to better understand customer behavior and develop more effective marketing strategies. Common data mining techniques used by retailers include loyalty programs to track purchasing patterns and target customers with personalized coupons. Data mining software uses techniques like classification, clustering, and prediction to analyze data from different perspectives and extract useful information and patterns.
Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.
The document discusses data mining and its history and applications. It explains that data mining involves extracting useful patterns from large amounts of data, and has been used in applications like market analysis, fraud detection, and science. The document outlines the data mining process, including data selection, cleaning, transformation, algorithm selection, and pattern evaluation. It also discusses common data mining techniques like association rule mining to find frequent patterns in transactional data.
This document discusses data mining techniques including classification, clustering, regression, and association rules. It provides examples of how each technique works and areas where they are applied, such as marketing, risk assessment, fraud detection, and customer care. The advantages of data mining are that it provides new knowledge from existing data that can improve products, services and profits. However, privacy is a concern when linking multiple data sources to gain a wide range of information about individuals.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
This document discusses data warehousing and OLAP technology for data mining. It defines what a data warehouse is, including that it is a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. It also discusses data warehouse architectures like star schemas and snowflake schemas, which organize data into fact and dimension tables. Finally, it discusses OLAP and multidimensional data modeling using data cubes to enable complex analyses of data in multiple dimensions.
This document provides an overview of data mining and data warehousing. It discusses the history and evolution of databases from the 1960s to today. Data mining is defined as using automated tools to extract hidden patterns from large databases to address the problem of data explosion. Descriptive and predictive models are used in data mining. Data warehousing involves integrating data from multiple sources into a centralized database to support analysis and decision making.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
This document outlines a course on knowledge acquisition in decision making, including the course objectives of introducing data mining techniques and enhancing skills in applying tools like SAS Enterprise Miner and WEKA to solve problems. The course content is described, covering topics like the knowledge discovery process, predictive and descriptive modeling, and a project presentation. Evaluation includes assignments, case studies, and a final exam.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
This document provides an introduction to data mining. It defines data mining as extracting useful information from large datasets. Key domains that benefit include market analysis, risk management, and fraud detection. Common data mining techniques are discussed such as association, classification, clustering, prediction, and decision trees. Both open source tools like RapidMiner, WEKA, and R, as well commercial tools like SQL Server, IBM Cognos, and Dundas BI are introduced for performing data mining.
The document provides an overview of fundamentals of database design including definitions of key concepts like data, information, and databases. It discusses the purpose of databases and database management systems. It also covers topics like selecting a database system, database development best practices, and data entry considerations.
This document provides an overview of data mining, including definitions, processes, tasks, and algorithms. It defines data mining as a process that takes data as input and outputs knowledge. The main steps in the data mining process are data preparation, data mining (applying algorithms to identify patterns), and evaluation/interpretation. Common data mining tasks are classification, regression, association rule mining, clustering, and text/link mining. Popular algorithms described are decision trees, rule-based classifiers, artificial neural networks, and nearest neighbor methods. Each have advantages and disadvantages related to predictive power, speed, and interpretability.
This document discusses data warehousing and online analytical processing (OLAP) technology. It defines a data warehouse, compares it to operational databases, and explains how OLAP systems organize and present data for analysis. The document also describes multidimensional data models, common OLAP operations, and the steps to design and construct a data warehouse. Finally, it discusses applications of data warehouses and efficient processing of OLAP queries.
The document discusses data mining and data preprocessing. It notes that vast amounts of data are collected daily and analyzing this data is important. One emerging data repository is the data warehouse, which contains multiple heterogeneous data sources. Data mining transforms large collections of data into useful knowledge. Effective data preprocessing is important for improving data quality and mining accuracy. Techniques like data cleaning, integration, reduction, and transformation are used to handle issues like missing values, noise, inconsistencies and improve the overall quality of the data.
This document describes how the author analyzed their text message history using data mining techniques. They backed up their phone to extract the SQLite databases containing the text messages. Using R, RStudio, SQLite, and Bash, they analyzed tables within the databases to learn information like their most frequent contacts, the distribution of texts by month, and the frequency of different words. They produced histograms and other visualizations to explore patterns in their text data and gain insights into their communication habits.
The document introduces data mining and knowledge discovery in databases. It discusses why data mining is needed due to large datasets that cannot be analyzed manually. It also covers the data mining process, common data mining techniques like association rules and decision trees, applications of data mining in various domains, and some popular data mining tools.
The document discusses data preprocessing techniques for data mining. It covers why preprocessing is important due to real-world data often being incomplete, noisy, and inconsistent. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data binning are also summarized.
The document discusses various data mining techniques including association rules, classification, clustering, and approaches to discovering patterns in datasets. It covers clustering algorithms like partition and hierarchical clustering. It also explains different data mining problems like discovering sequential patterns, patterns in time series data, and classification and regression rules.
Data mining involves analyzing large amounts of data to discover patterns that can be used for purposes such as increasing sales, reducing costs, or detecting fraud. It allows companies to better understand customer behavior and develop more effective marketing strategies. Common data mining techniques used by retailers include loyalty programs to track purchasing patterns and target customers with personalized coupons. Data mining software uses techniques like classification, clustering, and prediction to analyze data from different perspectives and extract useful information and patterns.
Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.
The document discusses data mining and its history and applications. It explains that data mining involves extracting useful patterns from large amounts of data, and has been used in applications like market analysis, fraud detection, and science. The document outlines the data mining process, including data selection, cleaning, transformation, algorithm selection, and pattern evaluation. It also discusses common data mining techniques like association rule mining to find frequent patterns in transactional data.
This document discusses data mining techniques including classification, clustering, regression, and association rules. It provides examples of how each technique works and areas where they are applied, such as marketing, risk assessment, fraud detection, and customer care. The advantages of data mining are that it provides new knowledge from existing data that can improve products, services and profits. However, privacy is a concern when linking multiple data sources to gain a wide range of information about individuals.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
This document discusses data warehousing and OLAP technology for data mining. It defines what a data warehouse is, including that it is a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. It also discusses data warehouse architectures like star schemas and snowflake schemas, which organize data into fact and dimension tables. Finally, it discusses OLAP and multidimensional data modeling using data cubes to enable complex analyses of data in multiple dimensions.
This document provides an overview of data mining and data warehousing. It discusses the history and evolution of databases from the 1960s to today. Data mining is defined as using automated tools to extract hidden patterns from large databases to address the problem of data explosion. Descriptive and predictive models are used in data mining. Data warehousing involves integrating data from multiple sources into a centralized database to support analysis and decision making.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
This document outlines a course on knowledge acquisition in decision making, including the course objectives of introducing data mining techniques and enhancing skills in applying tools like SAS Enterprise Miner and WEKA to solve problems. The course content is described, covering topics like the knowledge discovery process, predictive and descriptive modeling, and a project presentation. Evaluation includes assignments, case studies, and a final exam.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
This document provides an introduction to data mining. It defines data mining as extracting useful information from large datasets. Key domains that benefit include market analysis, risk management, and fraud detection. Common data mining techniques are discussed such as association, classification, clustering, prediction, and decision trees. Both open source tools like RapidMiner, WEKA, and R, as well commercial tools like SQL Server, IBM Cognos, and Dundas BI are introduced for performing data mining.
The document provides an overview of fundamentals of database design including definitions of key concepts like data, information, and databases. It discusses the purpose of databases and database management systems. It also covers topics like selecting a database system, database development best practices, and data entry considerations.
This document provides an overview of data mining, including definitions, processes, tasks, and algorithms. It defines data mining as a process that takes data as input and outputs knowledge. The main steps in the data mining process are data preparation, data mining (applying algorithms to identify patterns), and evaluation/interpretation. Common data mining tasks are classification, regression, association rule mining, clustering, and text/link mining. Popular algorithms described are decision trees, rule-based classifiers, artificial neural networks, and nearest neighbor methods. Each have advantages and disadvantages related to predictive power, speed, and interpretability.
This document discusses data warehousing and online analytical processing (OLAP) technology. It defines a data warehouse, compares it to operational databases, and explains how OLAP systems organize and present data for analysis. The document also describes multidimensional data models, common OLAP operations, and the steps to design and construct a data warehouse. Finally, it discusses applications of data warehouses and efficient processing of OLAP queries.
The document discusses data mining and data preprocessing. It notes that vast amounts of data are collected daily and analyzing this data is important. One emerging data repository is the data warehouse, which contains multiple heterogeneous data sources. Data mining transforms large collections of data into useful knowledge. Effective data preprocessing is important for improving data quality and mining accuracy. Techniques like data cleaning, integration, reduction, and transformation are used to handle issues like missing values, noise, inconsistencies and improve the overall quality of the data.
This document describes how the author analyzed their text message history using data mining techniques. They backed up their phone to extract the SQLite databases containing the text messages. Using R, RStudio, SQLite, and Bash, they analyzed tables within the databases to learn information like their most frequent contacts, the distribution of texts by month, and the frequency of different words. They produced histograms and other visualizations to explore patterns in their text data and gain insights into their communication habits.
Data Mining: Mining ,associations, and correlationsDatamining Tools
Market basket analysis examines customer purchasing patterns to determine which items are commonly bought together. This can help retailers with marketing strategies like product bundling and complementary product placement. Association rule mining is a two-step process that first finds frequent item sets that occur together above a minimum support threshold, and then generates strong association rules from these frequent item sets based on minimum support and confidence. Various techniques can improve the efficiency of the Apriori algorithm for mining association rules, such as hashing, transaction reduction, partitioning, sampling, and dynamic item-set counting. Pruning strategies like item merging, sub-item-set pruning, and item skipping can also enhance efficiency. Constraint-based mining allows users to specify constraints on the type of
The document discusses various clustering approaches including partitioning, hierarchical, density-based, grid-based, model-based, frequent pattern-based, and constraint-based methods. It focuses on partitioning methods such as k-means and k-medoids clustering. K-means clustering aims to partition objects into k clusters by minimizing total intra-cluster variance, representing each cluster by its centroid. K-medoids clustering is a more robust variant that represents each cluster by its medoid or most centrally located object. The document also covers algorithms for implementing k-means and k-medoids clustering.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
This document discusses data visualization. It begins by defining data visualization as conveying information through visual representations and reinforcing human cognition to gain knowledge about data. The document then outlines three main functions of visualization: to record information, analyze information, and communicate information to others. Finally, it discusses various frameworks, tools, and examples of inspiring data visualizations.
1. Discretization involves dividing the range of continuous attributes into intervals to reduce data size. Concept hierarchy formation recursively groups low-level concepts like numeric values into higher-level concepts like age groups.
2. Common techniques for discretization and concept hierarchy generation include binning, histogram analysis, clustering analysis, and entropy-based discretization. These techniques can be applied recursively to generate hierarchies.
3. Discretization and concept hierarchies reduce data size, provide more meaningful interpretations, and make data mining and analysis easier.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
Data mining involves classification, cluster analysis, outlier mining, and evolution analysis. Classification models data to distinguish classes using techniques like decision trees or neural networks. Cluster analysis groups similar objects without labels, while outlier mining finds irregular objects. Evolution analysis models changes over time. Data mining performance considers algorithm efficiency, scalability, and handling diverse and complex data types from multiple sources.
Data cube computation involves precomputing aggregations to enable fast query performance. There are different materialization strategies like full cubes, iceberg cubes, and shell cubes. Full cubes precompute all aggregations but require significant storage, while iceberg cubes only store aggregations that meet a threshold. Computation strategies include sorting and grouping to aggregate similar values, caching intermediate results, and aggregating from smallest child cuboids first. The Apriori pruning method can efficiently compute iceberg cubes by avoiding computing descendants of cells that do not meet the minimum support threshold.
In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.
Association rule mining finds frequent patterns and correlations among items in transaction databases. It involves two main steps:
1) Frequent itemset generation: Finds itemsets that occur together in a minimum number of transactions (above a support threshold). This is done efficiently using the Apriori algorithm.
2) Rule generation: Generates rules from frequent itemsets where the confidence (fraction of transactions with left hand side that also contain right hand side) is above a minimum threshold. Rules are a partitioning of an itemset into left and right sides.
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
This document provides an overview of data mining concepts from Chapter 1 of the textbook "Data Mining: Concepts and Techniques". It discusses the motivation for data mining due to increasing data collection, defines data mining as the extraction of useful patterns from large datasets, and outlines some common applications like market analysis, risk management, and fraud detection. It also introduces the key steps in a typical data mining process including data selection, cleaning, mining, and evaluation.
Data mining is an important part of business intelligence and refers to discovering interesting patterns from large amounts of data. It involves applying techniques from multiple disciplines like statistics, machine learning, and information science to large datasets. While organizations collect vast amounts of data, data mining is needed to extract useful knowledge and insights from it. Some common techniques of data mining include classification, clustering, association analysis, and outlier detection. Data mining tools can help organizations apply these techniques to gain intelligence from their data warehouses.
The document is a chapter from a textbook on data mining written by Akannsha A. Totewar, a professor at YCCE in Nagpur, India. It provides an introduction to data mining, including definitions of data mining, the motivation and evolution of the field, common data mining tasks, and major issues in data mining such as methodology, performance, and privacy.
Vinod Chachra discussed improving discovery systems through post-processing harvested data. He outlined key players like data providers, service providers, and users. The harvesting, enrichment, and indexing processes were described. Facets, knowledge bases, and branding were discussed as ways to enhance discovery. Chachra concluded that progress has been made but more work is needed, and data and service providers should collaborate on standards.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
Environment Canada's Data Management ServiceSafe Software
A brief history in TimeSeries data at Environment Canada. An Enterprise view of how FME can be integrated into departmental data management activities.
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...LEARN Project
Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group, by Andreas Rauber – 2nd LEARN Workshop, Vienna, 6th April 2016
The document discusses the objectives and outcomes of the FAIRport Skunkworks team so far. The team is exploring existing technologies to build prototype FAIRport code components using existing standards. They aim to enable findable, accessible, interoperable, and reusable data across repositories. However, repositories use different metadata schemas and standards like DCAT in incomplete ways. The team proposes "FAIR Profiles" - a generic way to describe metadata fields and constraints for any repository using a standardized vocabulary and structure. This would enable rich queries across repositories. They define a FAIR Profile Schema to serve as a lightweight meta-meta-descriptor for describing diverse repository metadata schemas in a consistent way.
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.
To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.
Dataset description: DCAT and other vocabulariesValeria Pesce
This document discusses metadata needed to describe datasets for applications to find and understand them when stored in data catalogs or repositories. It examines existing dataset description vocabularies like DCAT and their limitations in fully capturing necessary metadata.
Key points made:
- Machine-readable metadata is important for datasets to be discoverable and usable by applications when stored across repositories.
- Metadata should describe the dataset, distributions, dimensions, semantics, protocols/APIs, subsets etc.
- Vocabularies like DCAT provide some metadata but don't fully cover dimensions, semantics, protocols/APIs or subsets.
- No single vocabulary or data catalog solution currently provides all necessary metadata for full semantic interoperability.
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
BioPharma and the broader research community is faced with the challenge of simply finding the appropriate internal and external datasets for downstream analytics, knowledge-generation and collaboration. With datasets as the core asset, we wanted to promote both human and machine exploitability, using web-centric data cataloguing principles as described in the W3C Data on the Web Best Practices. To do so, we adopted DCAT (Data CATalog Vocabulary) and VoID (Vocabulary of Interlinked Datasets) for both RDF and non-RDF datasets at summary, version and distribution levels. Further, we’ve described datasets using a limited set of well-vetted public vocabularies, focused on cross-omics analytes and clinical features of the catalogued datasets.
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-‐...Edward Blurock
ChemConnect is a database that interconnects fine-grained information extracted from chemical kinetic and thermodynamic sources such as
CHEMKIN mechanism files, NASA polynomial files, and even the information behind automatic generation files.
The key to the interconnection is the Resource Description Framework (RDF) from Semantic Web technologies. The RDF is a triplet where an object item (first) is associated through a descriptor (second) to a subject item.
In this way the information of the object is connected (through the descriptor) to the subject.
In ChemConnect the object is word (text) and the subject can be text or a database item. The search mechanism within ChemConnect uses the object and subject text as search strings.
The presentation also contains an brief introduction to cloud computing.
This was presented at the COST Action 1404 SMARTCATS workshop on Databases and Systems Use Cases (http//http://www.smartcats.eu/wg4ws1dp/)
DataFinder concepts and example: General (20100503)Data Finder
DataFinder is a lightweight client-server solution for centralized data management. It was created by the German Aerospace Center (DLR) to address the problems of absent data organization structures and no centralized policy for data management. DataFinder provides graphical user interfaces and uses a logical data store concept to organize data across distributed storage locations according to a configurable data model. It can be customized through Python scripts to integrate with different environments and automate tasks like data migration.
The document describes CAMERA 2.0, a cyberinfrastructure for microbial ecology research. CAMERA 2.0 aims to provide scalable databases of metagenomic datasets and metadata, enable querying of metadata across locations, and integrate various analysis tools and data sources. It utilizes semantic databases and standards to organize genomic and environmental data. Workflows are also supported to automate multi-step analyses and track provenance of results.
The document describes the eTRIKS Data Harmonization Service Platform, which aims to provide a common infrastructure and services to support cross-institutional translational research. It discusses challenges around data integration and harmonization. The platform utilizes standards and controlled vocabularies to syntactically and semantically harmonize data from various sources. It employs a metadata framework and modular workflow to structure, standardize, and integrate observational data into a harmonized repository for exploration and analysis. A demo of the platform's capabilities for project setup, data staging, exploration, export, and integration with tranSMART is also provided.
This document provides an overview of big data and the Hadoop framework. It discusses the challenges of big data, including different data types and why data is being collected. It then describes the Hadoop Distributed File System (HDFS) and how it stores and replicates large files across clusters of commodity hardware. MapReduce is also summarized, including how it allows processing of large datasets in parallel by distributing work across clusters.
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Metadata & brokering - a modern approach #2Daniele Bailo
The second episode of metadata and brokering.
Topics covered:
1. additional definition (ontology, relational database and others)
2. the wide picture: data fabric elements from Research Data Alliance (RDA) and possible concrete implementations of those guidelines
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Similar to Odam: Open Data, Access and Mining (20)
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
1. Give an open access to your data
and make them ready to be mined
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
A data explorer as bonus
EDTMS
ODAM
2. Daniel Jacob – INRA UMR 1332 –May 2016
The experimental context: needs / wishesseeding harvesting
samples
preparation
samples analysis
Sample
identifiers
2
Experiment
Data Tables
Experiment Design
Web API
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI
(R shiny)
Make both metadata and data
available for data mining
identifiers centrally
managed
data sharing & data availability
facilitate the subsequent
data mining
1
2
3
EDTMS
ODAM Open Data for Access and Mining : The core idea in one shot
3. Daniel Jacob – INRA UMR 1332 –May 2016
Data repository
Data capture Minimal effort (PUT)
PUT
myhost.org
http://myhost.org/
mount
GET
Implementation of an
Experiment Data Tables Management System
(EDTMS)
Experiment
Data Tables
Merely dropping data files in a data
repository (e.g. a local NAS or distant
storage space) should allow users to
access them by web API
Data can be downloaded,
explored and mined
No database schema, no programming code and no additional configuration on the server side.
Open Data for Access and Mining : The core idea in one shot
EDTMS
ODAM
3
4. Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Data subset files
enzymes.tsv
• Whatever the kind of experiment, this assumes a design of experiment
(DoE) involving individuals, samples or whatever things, as the main
objects of study (e.g. plants, tissues, bacteria, …)
• This also assumes the observation of dependent variables resulting of
effects of some controlled experimental factors.
• Moreover, the objects of study have usually an identifier for each of
them, and the variables can be quantitative or qualitative.
• We can have either one object type of study or several kinds, but in
this latter case, it must exist a relationship between object types that
we assume of “obtainedFrom" type.
Preparation and cleaning of the data sub-sets of files
EDTMS
ODAM
4
5. Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Classification of each column within its right category
enzymes.tsv
Data subset files
factor
quantitative
qualitative
identifier
link
categories
EDTMS
ODAM
5
Data subsets files and their associated metadata files must be compliant
with the TSV standard (Tab-Separator-Values)
• You have to organize your data subsets so that links could be established between them.
• In practical, it means to add a column containing the identifiers corresponding to the entity
to which you want to connect the subset, implying a ‘obtainedFrom’ relation.
• It is to be noted that this duplication of identifiers must be the only redundant
information, through all data subsets.
6. Daniel Jacob – INRA UMR 1332 –May 2016
plants.tsv harvests.tsv
samples.tsv
enzymes.tsv
Data subset files
compounds.tsv
Plants
Harvests
Samples
Compounds
Enzymes
Connections between the dataset files based on identifiers
Entities
(concepts)
Link between 2 subsets being carried out from identifiers
(implies a ‘obtainedFrom’ relation)
Identifier of the central entity of the subset
EDTMS
ODAM
factor
quantitative
qualitative
identifier
link
categories
6
7. Daniel Jacob – INRA UMR 1332 –May 2016
Supplementary files
In order to allow data to be explored and mined, we have to adjoin some
minimal but relevant metadata:
For that, 2 metadata files are required
• s_subsets.tsv: a file allowing to associate with each subset of data a key
concept corresponding to the main entity of the subset and the relations
of the type "obtainedFrom" between these concepts
• a_attributes.tsv: a metadata file allowing each attribute
(concept/variable) to be annotated with some minimal but relevant
metadata
Creation of the metadata files
EDTMS
ODAM
7
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
8. Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv This metadata file allows to associate a key concept to each data subset file
Creation of the metadata files
EDTMS
ODAM
8
Plants
Compounds
Enzymes
Harvests
Samples
plants.tsv
PlanteID
harvests.tsv
Lot samples.tsv
SampleID
compounds.tsv
enzymes.tsv
SampleID
SampleID
1
2
3
4
5
Identifier of the central entity of the subset
Link between 2 subsets (implies a ‘obtainedFrom’ relation)
Unique rank number of the data subset
Key concept (i.e. the main entity) associated to the subset in the form of a short name
Plants1
factor
quantitative
qualitative
identifier
categories
PlanteID plants.tsv
Data file name
9. Daniel Jacob – INRA UMR 1332 –May 2016
a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with
some minimal but relevant metadata
Creation of the metadata files
EDTMS
ODAM
9
factor
quantitative
qualitative
identifier
categories
Plants
Harvests
Samples
Compounds
…
…
10. Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv
a_attributes.tsv
…
…
Additional subsets/ attributes can be
added step by step, as soon as data
are produced.
Updating the metadata files
EDTMS
ODAM
11. Daniel Jacob – INRA UMR 1332 –May 2016
Uploading your datasets in the data repository
EDTMS
ODAM
No database schema, no programming code and no additional configuration on the server side.
Your data subset files
Your dataset entry (named
‘frim1’ as example) within
the data repository
Z: (Storage)
Merely dropping data files on the data repository (e.g. NAS) should allow
users to access them by web API
Data subsets files and their
associated metadata files must be
compliant with the TSV standard
(Tab-Separator-Values)
Data repository
PUT
myhost.orgmount
GET
Data capture
Minimal effort (PUT)
12. Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/check/frim1
myhost.org
StorageDataRepos
NAS
Checking online if your the data subset files are consistent
EDTMS
ODAM
Many test checks can
be automatically
done for you
13. Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM
Data storage
seeding
harvesting samples analysis
samples
preparation
13
GET
, maximal efficiency (GET)
After depositing your complete dataset as described previously:
• An open access is given to your data through web API
• They are ready to be mined
• No specific code or additional configuration are needed (*) https://www.erasysbio.net/index.php?index=266
minimal effort (PUT)
PUT
Format
TSV
Data
Data Linking
Preparation and cleaning of the data sub-sets of files
FRIM1(*)
Check
Open Data, Access and Mining : web API
14. Daniel Jacob – INRA UMR 1332 –May 2016
Data
Format
TSV
EDTMS
ODAM
Data linking
Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
Retrieving data
Retrieving metadata
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
factor
quantitative
qualitative
identifier
link
categories
FRIM1 (*)
xml/tsv/json
frim1
14
(*) https://doi.org/10.5281/zenodo.154041
15. Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
15
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
Field Description Examples
<data format> format of the retrieved data; possible values are: 'xml' or 'csv' xml
<dataset name> Short name (tag) of your dataset frim1
<subset> Short name of a data subset samples
<entry> Name of an attribute entry (defined by the user in the a_attribute file
(column ‘entry’)
sampleid
<category> Name of the attribute category; (assigned by the user in the a_attribute file
(column ‘category’)
possible values are: ‘identifier’, ‘factor’, ‘qualitative’, ‘quantitative’
quantitative
(<subset>) Set of data subsets by merging all the subsets with lower rank than the
specified subset and following the pathway defined by the "is_part_of"
links.
(samples)
plants + harvests
+ samples
<value> Exact value of the desired entry or category 1, factor
16. Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM Open Data, Access and Mining : web API
REST Services: hierarchical tree of resource naming (URL)
16
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
http://myhost.org/getdata/<data format>/<dataset name>/<subset>/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<category>
http://myhost.org/getdata/<data format>/<dataset name>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)/<entry>/<value>
http://myhost.org/getdata/<data format>/<dataset name>/<subset>
http://myhost.org/getdata/<data format>/<dataset name>/(<subset>)
• Get the subset list of a dataset
• Get all values within a data subset
• Get values within a data subset for a specific value of an entry
• Get all values within a set of data subsets
• Get values within a set of data subsets for a specific value of an entry
• Get the attribute list within a set of data subsets for a specific category
17. Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants
http://myhost.org/getdata/xml/frim1/harvests/lot/1
http://myhost.org/getdata/xml/frim1/(compounds)/quantitative
Metadata
Metadata
Data
Data
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
17
18. Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control
Set of data subsets by merging all the subsets with lower rank than the specified
subset and following the pathway defined by the “obtainedFrom" links.
(samples) plants + harvests + samples
Open Data Access via web API: Examples based on FRIM1
EDTMS
ODAM
FRIM1
18
19. Daniel Jacob – INRA UMR 1332 –May 2016
Data
Format
TSV
minimal effort, maximal efficiency
EDTMS
ODAM
Data linking
Open Data Access via web API: Application layer
FRIM1
19
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
20. Daniel Jacob – INRA UMR 1332 –May 2016
Retrieving Data within R
Open Data Access via web API: Application layer
The R package
Rodam
EDTMS
ODAM
20
21. Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API Rodam package
21
<data format>
<dataset name>
<subset>
(<subset>)
<entry><category>
<value> <value> <value>
<entry>
tsv
frim1
samples
sample
365
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
22. Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API
Read metadata
i.e. category types within the data
Get the data subset ‘activome’
along with its metadata
22
<data format>
<dataset name>
<subset>
(<subset>)
<entry>
<category>
<value>
<value>
<entry>
tsv
frim1
activome
factor
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor
Rodam package
23. Daniel Jacob – INRA UMR 1332 –May 2016
Open Data Access via web API
23
Rodam package
24. Daniel Jacob – INRA UMR 1332 –May 2016
Data / Metadata
Data Mining
?
Make both
metadata and data
available for
data mining.
Experimentation
/ Analysis
MFA
rCCA
pLDA
…
Open Data Access via web API
activome qNMR_metabo
Water StressControl
ODAM facilitates the subsequent data mining
All Dev. Stages
All Treatments
ODAM facilitates the subsequent data mining
(log10 transformed)
24
Rodam package
25. Daniel Jacob – INRA UMR 1332 –May 2016
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI (R shiny)
minimal effort, maximal efficiency
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
EDTMS
ODAM
Data
Format
TSV
Data linking
Open Data Access via web API: Application layer
FRIM1
25
26. Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
26
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
27. Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
27
http://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
28. Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
28
29. Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
29
To remove an item
from the selection: i)
click on it, and then
ii) click on the
‘Suppr’ key
30. Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
30
31. Daniel Jacob – INRA UMR 1332 –May 2016
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
31
Explore several
possibilities by
interacting with
the graph
32. Daniel Jacob – INRA UMR 1332 –May 2016
To summarize
1. Preparation and cleaning of the data sub-sets of files
2. Classification of each column within its right category
3. Connections between the dataset files based on identifiers
4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv
5. Deposit of the dataset files in the data repository
6. Checking online if your the data subset files are consistent
7. Testing online the web-services on your dataset
8. Use of the web API through an application layer (R scripts, data explorer, ... )
EDTMS
ODAM
Data subsets files and their associated metadata files must be
compliant with the TSV standard (Tab-Separator-Values)
Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
(See https://en.wikipedia.org/wiki/Tab-separated_values)
33. Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
data sharing & data availability
- The array of the "plants" may be created even before planting the seeds.
- Similarly, the array of the "harvests" can be created as soon as the harvests are done,
and this before any analysis.
- Thus, these arrays are generated only once in the project and we can set up the
sharing soon the seed planting. Then each analysis comes to complement the set of
data as soon as they produce their own sub-dataset.
- data are accessible to everyone as soon as they are produced,
identifiers centrally managed
- data are archived and compiled, so that it becomes useless to proceed a laborious
investigation to find out who possesses the right identifiers, etc.
EDTMS
ODAM
seeding harvesting samples analysis
Sample
identifiers
samples
preparation
34. Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
facilitate the subsequent publication of data
- data are already readily available online by web API,
- But nothing prevents to take this data to fill in existing databases, by adjoining more
elaborate annotations.
- Neither administrator privileges nor any programmatic skills are required
EDTMS
ODAM
Data
Format
TSV
Data linking
PUT
GET
Data capture
Minimal effortData analysis/mining
Maximum efficiency
35. Daniel Jacob – INRA UMR 1332 –May 2016
minimal effort, maximum efficiency
Format the data
- Based on TSV: choice to keep the good old way of scientist to use
worksheets, thus i) using the same tool for both data files and metadata
definition files, ii) no programmatic skill are required
Give an access through a web services layer
- based on current standards (REST)
Use existing tools
- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
Develop if needed, lightweight tools
- R scripts, lightweight GUI (R shiny)
Advantages of this approach
biostatflow.org
EDTMS
ODAM
36. Daniel Jacob – INRA UMR 1332 –May 2016
Have a good fun !!
Daniel Jacob
UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
May 2016
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://www.bordeaux.inra.fr/pmb/dataexplorer/
https://github.com/INRA/ODAM
https://cran.r-project.org/package=Rodam
https://zenodo.org/record/154041
An online example