More ways of symbol grounding for knowledge graphs?Paul Groth
This document discusses various ways to ground the symbols used in knowledge graphs. It describes the traditional "symbol grounding problem" where symbols are defined based only on other symbols. It then outlines several approaches to grounding symbols in non-symbolic ways, such as by linking them to perceptual modalities like images, audio, and simulation. It also discusses grounding symbols via embeddings, relationships to physical entities, and operational semantics. The document argues that richer grounding could help integrate these notions and enhance interoperability, exchange, identity, and reasoning over knowledge graphs.
This document discusses issues, opportunities, and challenges related to big data. It provides an overview of big data characteristics like volume, variety, velocity, and veracity. It also describes Hadoop and HDFS for distributed storage and processing of big data. The document outlines issues in big data like storage, management, and processing challenges due to scale. Opportunities in big data analytics are also presented. Finally, challenges like heterogeneity, scale, timeliness, and ownership are discussed along with approaches like Hadoop, Spark, NoSQL databases, and Presto for tackling big data problems.
This presentation sets out some of the challenges around citing and identifying datasets and introduces DataCite, the international data citation initiative. DataCite was founded on 1-December 2009 to support researchers by
providing methods for them to locate, identify, and cite
research datasets with confidence.
This presentation was given by Adam Farquhar at the STM Publishers Association Innovation Conference on 4-Dec-2009.
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
OpenAIRE Interoperability Workshop (8 Feb. 2013).
DataCite – Bridging the gap and helping to find, access and reuse data – Herbert Gruttemeier, INIST-CNRS
EUDAT Service Suite Overview - EUDAT Summer School (Shaun de Witt, CCFE)EUDAT
Shaun will give an overview to the EUDAT service suite, explaining the key function and role of the different B2 services and how they interconnect. Examples will be given of how each service has been used by communities to explain the nature and scale of the service provision. We show how the B2 Service Suite can be linked to the data lifecycle and the role of each component in any data management planning. By the end of this talk, users should have a good overview of each of the B2 services and how they do, or will, fit together, and how they can be used as a part of a coherent data management plan.
Visit https://eudat.eu/eudat-summer-school
More ways of symbol grounding for knowledge graphs?Paul Groth
This document discusses various ways to ground the symbols used in knowledge graphs. It describes the traditional "symbol grounding problem" where symbols are defined based only on other symbols. It then outlines several approaches to grounding symbols in non-symbolic ways, such as by linking them to perceptual modalities like images, audio, and simulation. It also discusses grounding symbols via embeddings, relationships to physical entities, and operational semantics. The document argues that richer grounding could help integrate these notions and enhance interoperability, exchange, identity, and reasoning over knowledge graphs.
This document discusses issues, opportunities, and challenges related to big data. It provides an overview of big data characteristics like volume, variety, velocity, and veracity. It also describes Hadoop and HDFS for distributed storage and processing of big data. The document outlines issues in big data like storage, management, and processing challenges due to scale. Opportunities in big data analytics are also presented. Finally, challenges like heterogeneity, scale, timeliness, and ownership are discussed along with approaches like Hadoop, Spark, NoSQL databases, and Presto for tackling big data problems.
This presentation sets out some of the challenges around citing and identifying datasets and introduces DataCite, the international data citation initiative. DataCite was founded on 1-December 2009 to support researchers by
providing methods for them to locate, identify, and cite
research datasets with confidence.
This presentation was given by Adam Farquhar at the STM Publishers Association Innovation Conference on 4-Dec-2009.
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
OpenAIRE Interoperability Workshop (8 Feb. 2013).
DataCite – Bridging the gap and helping to find, access and reuse data – Herbert Gruttemeier, INIST-CNRS
EUDAT Service Suite Overview - EUDAT Summer School (Shaun de Witt, CCFE)EUDAT
Shaun will give an overview to the EUDAT service suite, explaining the key function and role of the different B2 services and how they interconnect. Examples will be given of how each service has been used by communities to explain the nature and scale of the service provision. We show how the B2 Service Suite can be linked to the data lifecycle and the role of each component in any data management planning. By the end of this talk, users should have a good overview of each of the B2 services and how they do, or will, fit together, and how they can be used as a part of a coherent data management plan.
Visit https://eudat.eu/eudat-summer-school
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
Big Data consists of large-volume, complex, growing data sets with multiple, heterogenous sources. With the
tremendous development of networking, data storage, and the data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains, including physical, biological and biomedical sciences. The
MapReduce programming mode which has parallel processing ability to analyze the large-scale network.
MapReduce is a programming model that allows easy development of scalable parallel applications to process
big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop
is a powerful tool for building such applications.
CyVerse Austria provides computational infrastructure and data management services through open source code from CyVerse US. It is a cooperation initiative between the University of Graz, Medical University of Graz, and Graz University of Technology to conduct mutual research and bundle existing resources in molecular biomedicine, neurosciences, pharmaceutical/medical technology, biotechnology, and quantitative biomedicine/modeling. CyVerse was originally developed in 2008 by the University of Arizona and receives funding from the National Science Foundation, maintaining over 50,000 active users from thousands of academic institutions.
Today libraries face more and new challenges when enabling access to information. The growing amount of information in combination with new non-textual media-types demands a constant changing of grown workflows and standard definitions. Knowledge, as published through scientific literature, is the last step in a process originating from primary scientific data. These data are analysed, synthesised, interpreted, and the outcome of this process is published as a scientific article. Access to the original data as the foundation of knowledge has become an important issue throughout the world and different projects have started to find solutions.
Nevertheless science itself is international; scientists are involved in global unions and projects, they share their scientific information with colleagues all over the world, they use national as well as foreign information providers.
When facing the challenge of increasing access to research data, a possible approach should be global cooperation for data access via national representatives:
* a global cooperation, because scientists work globally, scientific data are created and accessed globally.
* with national representatives, because most scientists are embedded in their national funding structures and research organisations.
DataCite was officially launched on December 1st 2009 in London and has 12 information institutions and libraries from nine countries as members. By assigning DOI names to data sets, data becomes citable and can easily be linked to from scientific publications.
Data integration with text is an important aspect of scientific collaboration. DataCite takes global leadership for promoting the use of persistent identifiers for datasets, to satisfy the needs of scientists. Through its members, it establishs and promotes common methods, best practices, and guidance. The member organisations work independently with data centres and other holders of research data sets in their own domains. Based on the work of the German National Library of Science and Technology (TIB) as the first DOI-Registration Agency for data, DataCite has registered over 850,000 research objects with DOI names, thus starting to bridge the gap between data centers, publishers and libraries.
This presentation will introduce the work of DataCite and give examples how scientific data can be included in library catalogues and linked to from scholarly publications.
David Carter publications and associated research grants 2015Dave Carter
David Carter is an E.C. researcher at the Digital Learning Research Group in Plymouth. He has published several papers on topics like modeling police futures, constructing management simulation models, and using groups to support judgment parameter estimation. He has also received research grants totaling over £3 million from various sources like the ESRC, Joint Opto-Electronic Research Scheme, and the Home Office to lead projects on developing composite structures with optical fibers, contributing to a polymer optical fiber project, coordinating technology improvements for a command and staff trainer system, providing analytical support for business process development in policing, and contributing to an online role play trainer as a non-delivery partner.
The document discusses the need for an NIH Data Commons to address challenges with data sharing and storage. It describes how factors like increasing data volumes, availability of cloud technologies, and emphasis on FAIR data principles are driving the need for a centralized data platform. The proposed NIH Data Commons would provide findable, accessible, interoperable and reusable data through cloud-based services and tools. It would enable data-driven science by facilitating discovery, access and analysis of biomedical data across different sources. Plans are outlined to develop and test an initial Data Commons pilot using existing genomic and other biomedical datasets.
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)EUDAT
Yann will introduce the notion of data life cycles (DLCs) as an overarching framework for the workshop. This presentation will explain the key activities and roles identified by EUDAT and undertaken by researchers and data service providers in the process of creating, analysing, managing, sharing and archiving research data. It will highlight how the EUDAT service suite addresses this data lifecycle to support researchers with their key data requirements. He will then present the current research work undertaken in EUDAT to model community specific DLCs, the relation with the concept of provenance and the prototype services being currently developed to bridge the identified gaps in DLC coverage.
Visit https://eudat.eu/eudat-summer-school
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
Digital transformation is driving a new wave of large-scale datafication in every aspect of our world. Today our society creates data ecosystems where data moves among actors within complex information supply chains that can form around an organization, community, sector, or smart environment. These ecosystems of data can be exploited to transform our world and present new challenges and opportunities in the design of intelligent systems. This talk presents my recent work on using the dataspace paradigm as a best-effort approach to data management within data ecosystems. The talk explores the theoretical foundations and principles of dataspaces and details a set of specialized best-effort techniques and models to enable loose administrative proximity and semantic integration of heterogeneous data sources. Finally, I share my perspectives on future dataspace research challenges, including multimedia data, data governance and the role of dataspaces to enable large-scale data sharing within Europe to power data-driven AI.
The Australian National Data Service (ANDS) aims to establish an Australian Research Data Commons by providing services for researchers to manage and share research data. Key services discussed include Identify My Data, which allows researchers to allocate persistent identifiers to datasets, and Register My Data, which registers public descriptions of data collections. ANDS uses persistent identifiers and the Handle system to ensure datasets remain identifiable even if their location changes, and is joining DataCite to offer digital object identifiers for published data.
This document provides an introduction to a course on big data analytics. It discusses the characteristics of big data, including large scale, variety of data types and formats, and fast data generation speeds. It defines big data as data that requires new techniques to manage and analyze due to its scale, diversity and complexity. The document outlines some of the key challenges in handling big data and introduces Hadoop and MapReduce as technologies for managing large datasets in a scalable way. It provides an overview of what topics will be covered in the course, including programming models for Hadoop, analytics tools, and state-of-the-art research on big data technologies and optimizations.
Big Stream Processing Systems, Big GraphsPetr Novotný
Big Data, a recent phenomenon. Everyone talks about it, but do you really know what Big Data is? Join our four-part series about Big Data and you will get answers to your questions!
We will cover Introduction to Big Data and available platforms which we can use to deal with Big Data. And in the end, we are going to give you an insight into the possible future of dealing with Big Data.
After the two previous episodes you know the basics about Big Data. Yet, it might get a bit more complicated than that. Usually when you have to deal with data which is generated in real-time. In this case, you are dealing with Big Stream.
This episode of our series will be focussed on processing systems capable of dealing with Big Streams. But analysing data lacking graphical representation will not be very convenient for us. And this is where we have to use a platform capable of visualising Big Graphs. All these topics will be covered in today’s presentation.
#CHEDTEB
www.chedteb.eu
Data repositories are the core components of an Open Data Ecosystem. To gain a comprehensive model of the data ecosystem supporting tools and services, FAIR principles, joint storage of open data and clinical data and the integration of analysis tools should be considered. The aim was to create a data ecosystem model suitable for the sharing of open data together with sensitive data. For this purpose several tools and services were included in our data ecosystem model: Research Data Marts, I2b2 / tranSMART, CKAN, Dataverse, figshare, OSF (Open Science Framework), ... This multitude of services supports research data repositories. Different types of repositories are connected and supplement each other in the storage, release and sharing of data with different degrees of protection and data ownership. Tools to analyze, browse and visualize data are integrated in the data flow between repositories. Results of our ecosystem analysis:
It doesn‘t matter where one stores data, because everything is connected for data sharing: institutional repositories with dataverses, data marts, general repositories, domain specific repositories, figshare etc. Data governance and privacy protection is integrated at the early stage of data generation.
This document discusses data, data curation, and data visualization. It begins by providing background on the speaker and their experience. It then covers topics like how much data is generated daily on the internet, by organizations like Twitter and Facebook. It discusses what data is, challenges of data curation, and tools that can be used for data curation. It also touches on semantic web, open data protocols, and examples of great data visualization. It emphasizes thinking about how to best share and visualize data for users in an understandable way.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
This document discusses big data and NoSQL databases. It defines big data as data with high volume, velocity, and variety that is difficult for traditional databases to handle. NoSQL databases are presented as an alternative designed for big data by allowing flexible schemas and easy scaling across data centers. The document uses Apache Cassandra as an example of a NoSQL database that can serve as a primary data store, handle real-time and batch analytics, and accommodate structured and unstructured data.
1) Data life cycles describe the stages data passes through from creation to obsolescence, including creating, processing, analyzing, preserving, accessing, and reusing data.
2) The document proposes modeling data life cycles and their relations to EUDAT services using the W3C PROV standard to track provenance.
3) A proof-of-concept service is being built to allow graphical representation of data life cycles, create life cycle plans and templates, and capture provenance during execution by filling templates.
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
How cloud computing can accelerate your research. Presentation given at Moscow State University on 19th May 2015.
Apply for Azure for Research Awards at http://research.microsoft.com/en-US/projects/azure/awards.aspx
1. Big Data solutions are useful for web analytics problems that can be parallelized, but may not be as effective for more complex computations.
2. When starting with a new Big Data system, businesses should determine if it can do what existing solutions cannot, if existing solutions can be improved, and if it can integrate with current systems.
3. Context is important when choosing a Big Data solution, as different business needs may require different approaches.
Research data sharing enables validation and new analyses of results, ensures efficient use of public funds, and counters misconduct. Funding agencies can encourage open data practices by requiring long-term storage, promoting data publication, and helping make data findable through catalogs. They should work with research communities to understand infrastructure needs, partner with libraries on preservation, and consider discipline-specific approaches rather than one-size-fits-all solutions.
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...e-ROSA
This document discusses challenges and opportunities around big data and open science in agricultural and environmental research. It provides a historic perspective on the evolution of data and modeling capabilities over time. While new technologies promise to make data access and analysis easier, realities often involve continuing to use existing approaches and hybrid solutions. The document recommends a focus on improving methodologies to semantically link diverse data sources. Adopting open science practices will require changes to research culture as well as technologies. The workshop aims to discuss needed services and integration across generic and domain-specific research infrastructures to advance open science in agriculture.
Big data Mining Using Very-Large-Scale Data Processing PlatformsIJERA Editor
Big Data consists of large-volume, complex, growing data sets with multiple, heterogenous sources. With the
tremendous development of networking, data storage, and the data collection capacity, Big Data are now rapidly
expanding in all science and engineering domains, including physical, biological and biomedical sciences. The
MapReduce programming mode which has parallel processing ability to analyze the large-scale network.
MapReduce is a programming model that allows easy development of scalable parallel applications to process
big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop
is a powerful tool for building such applications.
CyVerse Austria provides computational infrastructure and data management services through open source code from CyVerse US. It is a cooperation initiative between the University of Graz, Medical University of Graz, and Graz University of Technology to conduct mutual research and bundle existing resources in molecular biomedicine, neurosciences, pharmaceutical/medical technology, biotechnology, and quantitative biomedicine/modeling. CyVerse was originally developed in 2008 by the University of Arizona and receives funding from the National Science Foundation, maintaining over 50,000 active users from thousands of academic institutions.
Today libraries face more and new challenges when enabling access to information. The growing amount of information in combination with new non-textual media-types demands a constant changing of grown workflows and standard definitions. Knowledge, as published through scientific literature, is the last step in a process originating from primary scientific data. These data are analysed, synthesised, interpreted, and the outcome of this process is published as a scientific article. Access to the original data as the foundation of knowledge has become an important issue throughout the world and different projects have started to find solutions.
Nevertheless science itself is international; scientists are involved in global unions and projects, they share their scientific information with colleagues all over the world, they use national as well as foreign information providers.
When facing the challenge of increasing access to research data, a possible approach should be global cooperation for data access via national representatives:
* a global cooperation, because scientists work globally, scientific data are created and accessed globally.
* with national representatives, because most scientists are embedded in their national funding structures and research organisations.
DataCite was officially launched on December 1st 2009 in London and has 12 information institutions and libraries from nine countries as members. By assigning DOI names to data sets, data becomes citable and can easily be linked to from scientific publications.
Data integration with text is an important aspect of scientific collaboration. DataCite takes global leadership for promoting the use of persistent identifiers for datasets, to satisfy the needs of scientists. Through its members, it establishs and promotes common methods, best practices, and guidance. The member organisations work independently with data centres and other holders of research data sets in their own domains. Based on the work of the German National Library of Science and Technology (TIB) as the first DOI-Registration Agency for data, DataCite has registered over 850,000 research objects with DOI names, thus starting to bridge the gap between data centers, publishers and libraries.
This presentation will introduce the work of DataCite and give examples how scientific data can be included in library catalogues and linked to from scholarly publications.
David Carter publications and associated research grants 2015Dave Carter
David Carter is an E.C. researcher at the Digital Learning Research Group in Plymouth. He has published several papers on topics like modeling police futures, constructing management simulation models, and using groups to support judgment parameter estimation. He has also received research grants totaling over £3 million from various sources like the ESRC, Joint Opto-Electronic Research Scheme, and the Home Office to lead projects on developing composite structures with optical fibers, contributing to a polymer optical fiber project, coordinating technology improvements for a command and staff trainer system, providing analytical support for business process development in policing, and contributing to an online role play trainer as a non-delivery partner.
The document discusses the need for an NIH Data Commons to address challenges with data sharing and storage. It describes how factors like increasing data volumes, availability of cloud technologies, and emphasis on FAIR data principles are driving the need for a centralized data platform. The proposed NIH Data Commons would provide findable, accessible, interoperable and reusable data through cloud-based services and tools. It would enable data-driven science by facilitating discovery, access and analysis of biomedical data across different sources. Plans are outlined to develop and test an initial Data Commons pilot using existing genomic and other biomedical datasets.
The Data Lifecycle - EUDAT Summer School (Yann Le Franc)EUDAT
Yann will introduce the notion of data life cycles (DLCs) as an overarching framework for the workshop. This presentation will explain the key activities and roles identified by EUDAT and undertaken by researchers and data service providers in the process of creating, analysing, managing, sharing and archiving research data. It will highlight how the EUDAT service suite addresses this data lifecycle to support researchers with their key data requirements. He will then present the current research work undertaken in EUDAT to model community specific DLCs, the relation with the concept of provenance and the prototype services being currently developed to bridge the identified gaps in DLC coverage.
Visit https://eudat.eu/eudat-summer-school
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
Digital transformation is driving a new wave of large-scale datafication in every aspect of our world. Today our society creates data ecosystems where data moves among actors within complex information supply chains that can form around an organization, community, sector, or smart environment. These ecosystems of data can be exploited to transform our world and present new challenges and opportunities in the design of intelligent systems. This talk presents my recent work on using the dataspace paradigm as a best-effort approach to data management within data ecosystems. The talk explores the theoretical foundations and principles of dataspaces and details a set of specialized best-effort techniques and models to enable loose administrative proximity and semantic integration of heterogeneous data sources. Finally, I share my perspectives on future dataspace research challenges, including multimedia data, data governance and the role of dataspaces to enable large-scale data sharing within Europe to power data-driven AI.
The Australian National Data Service (ANDS) aims to establish an Australian Research Data Commons by providing services for researchers to manage and share research data. Key services discussed include Identify My Data, which allows researchers to allocate persistent identifiers to datasets, and Register My Data, which registers public descriptions of data collections. ANDS uses persistent identifiers and the Handle system to ensure datasets remain identifiable even if their location changes, and is joining DataCite to offer digital object identifiers for published data.
This document provides an introduction to a course on big data analytics. It discusses the characteristics of big data, including large scale, variety of data types and formats, and fast data generation speeds. It defines big data as data that requires new techniques to manage and analyze due to its scale, diversity and complexity. The document outlines some of the key challenges in handling big data and introduces Hadoop and MapReduce as technologies for managing large datasets in a scalable way. It provides an overview of what topics will be covered in the course, including programming models for Hadoop, analytics tools, and state-of-the-art research on big data technologies and optimizations.
Big Stream Processing Systems, Big GraphsPetr Novotný
Big Data, a recent phenomenon. Everyone talks about it, but do you really know what Big Data is? Join our four-part series about Big Data and you will get answers to your questions!
We will cover Introduction to Big Data and available platforms which we can use to deal with Big Data. And in the end, we are going to give you an insight into the possible future of dealing with Big Data.
After the two previous episodes you know the basics about Big Data. Yet, it might get a bit more complicated than that. Usually when you have to deal with data which is generated in real-time. In this case, you are dealing with Big Stream.
This episode of our series will be focussed on processing systems capable of dealing with Big Streams. But analysing data lacking graphical representation will not be very convenient for us. And this is where we have to use a platform capable of visualising Big Graphs. All these topics will be covered in today’s presentation.
#CHEDTEB
www.chedteb.eu
Data repositories are the core components of an Open Data Ecosystem. To gain a comprehensive model of the data ecosystem supporting tools and services, FAIR principles, joint storage of open data and clinical data and the integration of analysis tools should be considered. The aim was to create a data ecosystem model suitable for the sharing of open data together with sensitive data. For this purpose several tools and services were included in our data ecosystem model: Research Data Marts, I2b2 / tranSMART, CKAN, Dataverse, figshare, OSF (Open Science Framework), ... This multitude of services supports research data repositories. Different types of repositories are connected and supplement each other in the storage, release and sharing of data with different degrees of protection and data ownership. Tools to analyze, browse and visualize data are integrated in the data flow between repositories. Results of our ecosystem analysis:
It doesn‘t matter where one stores data, because everything is connected for data sharing: institutional repositories with dataverses, data marts, general repositories, domain specific repositories, figshare etc. Data governance and privacy protection is integrated at the early stage of data generation.
This document discusses data, data curation, and data visualization. It begins by providing background on the speaker and their experience. It then covers topics like how much data is generated daily on the internet, by organizations like Twitter and Facebook. It discusses what data is, challenges of data curation, and tools that can be used for data curation. It also touches on semantic web, open data protocols, and examples of great data visualization. It emphasizes thinking about how to best share and visualize data for users in an understandable way.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
This document discusses big data and NoSQL databases. It defines big data as data with high volume, velocity, and variety that is difficult for traditional databases to handle. NoSQL databases are presented as an alternative designed for big data by allowing flexible schemas and easy scaling across data centers. The document uses Apache Cassandra as an example of a NoSQL database that can serve as a primary data store, handle real-time and batch analytics, and accommodate structured and unstructured data.
1) Data life cycles describe the stages data passes through from creation to obsolescence, including creating, processing, analyzing, preserving, accessing, and reusing data.
2) The document proposes modeling data life cycles and their relations to EUDAT services using the W3C PROV standard to track provenance.
3) A proof-of-concept service is being built to allow graphical representation of data life cycles, create life cycle plans and templates, and capture provenance during execution by filling templates.
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
How cloud computing can accelerate your research. Presentation given at Moscow State University on 19th May 2015.
Apply for Azure for Research Awards at http://research.microsoft.com/en-US/projects/azure/awards.aspx
1. Big Data solutions are useful for web analytics problems that can be parallelized, but may not be as effective for more complex computations.
2. When starting with a new Big Data system, businesses should determine if it can do what existing solutions cannot, if existing solutions can be improved, and if it can integrate with current systems.
3. Context is important when choosing a Big Data solution, as different business needs may require different approaches.
Research data sharing enables validation and new analyses of results, ensures efficient use of public funds, and counters misconduct. Funding agencies can encourage open data practices by requiring long-term storage, promoting data publication, and helping make data findable through catalogs. They should work with research communities to understand infrastructure needs, partner with libraries on preservation, and consider discipline-specific approaches rather than one-size-fits-all solutions.
eROSA Stakeholder WS1: Big Data and Open Science in agricultural and environm...e-ROSA
This document discusses challenges and opportunities around big data and open science in agricultural and environmental research. It provides a historic perspective on the evolution of data and modeling capabilities over time. While new technologies promise to make data access and analysis easier, realities often involve continuing to use existing approaches and hybrid solutions. The document recommends a focus on improving methodologies to semantically link diverse data sources. Adopting open science practices will require changes to research culture as well as technologies. The workshop aims to discuss needed services and integration across generic and domain-specific research infrastructures to advance open science in agriculture.
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
This document discusses the changing landscape of data science and AI in biomedicine. Some key points:
- We are at a tipping point where data science is becoming a driver of biomedical research rather than just a tool. Biomedical researchers need to become data scientists.
- Data science is interdisciplinary and touches every field due to the rise of digital data. It requires openness, translation of findings, and consideration of responsibilities like algorithmic bias.
- Advances like AlphaFold2 show the power of large collaborative efforts combining data, computing resources, engineering, and domain expertise. This points to the need for public-private partnerships and new models of open data sharing.
- The definition of
Mapping (big) data science (15 dec2014)대학(원)생Han Woo PARK
This document discusses big data mapping and issues. It begins with definitions and characteristics of big data, including volume, velocity, variety, variability and complexity. It then covers the background of data science and trends in big data research and development. Finally, it addresses social issues and implications related to big data, including potential divides between developed and developing countries, academic and commercial researchers, and those with and without computational skills.
New forms of data for the social sciences: Smarter cities, more efficient organisations, and healthier communities. Wednesday 3rd November 2015, UCL, London, United Kingdom
UK e-Infrastructure: Widening Access, Increasing ParticipationNeil Chue Hong
A talk given at the ICHEC Annual Seminar by Neil Chue Hong, reflecting on the rise of Grid and Web 2.0, and how this might enable increased participation and use of computing infrastructure for e-Science and research.
This document discusses managing research data for open science based on the UK experience. It outlines key aspects of open science such as making research more open, global, collaborative and closer to society. The document discusses mandates for open research data from funding bodies in the UK and EU, including stipulations in Horizon 2020 and requirements from EPSRC. It defines what constitutes research data and examines challenges around research data management, including technology issues, people issues, policy issues and resources. The importance of data skills training for researchers and data professionals is also covered.
This document outlines a data science enablement roadmap created by the Advanced Center of Excellence at Modern Renaissance Corporation. The roadmap consists of 1 introductory course and 3 advanced courses that can earn a student a master's level certificate in data science. The introductory course provides a broad overview of topics like algorithms, statistics, machine learning, and big data platforms. The advanced courses focus on specific skills like machine learning with R, modern data platforms using Hadoop, and advanced big data analytics techniques. The goal is to give students a versatile, practical skill set for a career in data science or big data engineering.
Big data and the dark arts - Jisc Digital Media 2015Jisc
There still remains a certain misunderstanding by the very definition of "big data" and the perceived hype around the term. This workshop clarified the concepts and give examples of relevant big data projects.
Big data provides opportunities for social science research by enabling new ways to answer existing questions and allowing entirely new questions to be asked. Large and diverse datasets can be analyzed from various sources like social media, sensors, and citizen science. This allows researchers to study big populations and questions in real time. Challenges include interdisciplinary collaboration, ensuring data and tools are open and reusable, and developing infrastructure to support analysis of large and diverse datasets.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...University of Bologna
The volume, variety, and high availability of data backing decision support systems have impacted on business intelligence, the discipline providing strategies to transform raw data into decision-making insights. Such transformation is usually abstracted in the “knowledge pyramid,” where data collected from the real world are processed into meaningful patterns. In this context, volume, variety, and data availability have opened for challenges in augmenting the knowledge pyramid. On the one hand, the volume and variety of unconventional data (i.e., unstructured non-relational data generated by heterogeneous sources such as sensor networks) demand novel and type-specific data management, integration, and analysis techniques. On the other hand, the high availability of unconventional data is increasingly attracting data scientists with high competence in the business domain but low competence in computer science and data engineering; enabling effective participation requires the investigation of new paradigms to drive and ease knowledge extraction. The goal of this thesis is to augment the knowledge pyramid from two points of view, namely, by including unconventional data and by providing advanced analytics. As to unconventional data, we focus on mobility data and on the privacy issues related to them by providing (de-)anonymization models. As to analytics, we introduce a higher abstraction level than writing formal queries. Specifically, we design advanced techniques that allow data scientists to explore data either by expressing intentions or by interacting with smart assistants in hand-free scenarios.
e-infrastructures supporting open knowledge circulation - OpenAIRE FranceJean-François Lutz
This document discusses e-infrastructures that support open access to scientific knowledge and data. It notes that science is becoming more collaborative globally and data-driven. E-infrastructures provide crucial enabling technologies for open data sharing, scientific workflows, and virtual collaborations. Future steps include further promoting open access policies and ensuring the long-term preservation and reuse of publicly-funded research outputs and data.
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...DataScienceConferenc1
Data science is not only about numbers and how to crunch them; it is also about how to communicate project results with the various audience. Scientific journals and conferences are an excellent venue for getting a wider audience reach and gathering valuable comments. The talk will answer the questions: How to structure a scientific paper in data science? What are relevant venues for showcasing your work to gain the most relevant reach? To demystify the process of scientific writing, the case study will be presented: Messy process: Story of the birth of one data science paper.
Research Methodology (how to choose Datasets ).pptxZainab Alhassani
This document provides summaries of several freely available datasets and data repositories for researchers. It describes BuzzFeed News, which shares datasets, analysis, tools and guides used in its articles on GitHub. It also describes Metatext, which aims to democratize access to AI through curated datasets for classification tasks. Paper with Code is described as sharing machine learning papers, code, datasets and evaluation tables to support NLP and ML. Datahub.io focuses on stock market and property data that is frequently updated. Finally, Google Dataset Search is presented as a search engine for datasets to make them universally accessible.
The document describes a vision for the future of research in 2020 and 2030. By 2020, research is increasingly global and interdisciplinary, with open access the default. Peer review remains important but is supplemented by new methods. Openness and collaboration are recognized as critical. By 2030, digital technologies complement researchers and process vast amounts of data. Discovery tools understand relationships between datasets. Researchers pose questions to digital assistants which mine information to provide answers.
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...Carole Goble
Invited talk, PHIL_OS, March 30-31 2023, Exeter
https://opensciencestudies.eu/whither-open-science. Includes hidden slides.
FAIR and Open Science needs Digital Research Infrastructure, which is a federated system of systems and needs funding models that are fit for purpose
Culture change needed for paying for Open Science’s infrastructure and funding support for data driven research needs more reality and less rhetoric
Data science is an interdisciplinary field that uses scientific methods to extract knowledge and insights from data. It unifies statistics, data analysis, machine learning and related methods. Data science is the future of artificial intelligence and can add value to businesses by turning ideas seen in movies into reality. It involves working with large data sets and machine learning. Data science is primarily used for decisions, predictions, and machine learning by uncovering findings from data. Data science and technology delivers methods for solving data-intensive problems ranging from research to software deployment. Feature engineering is selecting or generating useful columns for modeling. Data cleaning takes up most of a data scientist's time along with exploratory analysis, visualization, machine learning, and communication. Data science education
Similar to Infraestructuras data science_portugal_ipca_industry_4.0_v2 (20)
HPC on Cloud for SMEs. The case of bolt tightening.Andrés Gómez
This document discusses using high performance computing (HPC) resources in the cloud to help small and medium enterprises (SMEs) perform simulations. It describes a case study where HPC resources were used to simulate the bolt tightening process for an SME called Texas Controls. The simulations used Code_Aster software to model the materials, design, sequence and tightening parameters. A Taguchi method was employed to automatically generate 16 parametric simulation jobs. Results were analyzed to determine the optimal tightening strategy. Remote visualization and a graphical user interface were provided to make the HPC resources accessible to the SME. The model was also validated against real sensor data to verify accuracy.
A Web-platform for radiotherapy, a new workflow concept and an information sh...Andrés Gómez
ARTFIBio project has the objective of creating an information network to develop predictive individualized models of the tumor response to radiotherapy, able to define more effective adaptive treatments.
This presentation shows the web interface that has been developed within the ARTFIBio project to share the information among the participants in the project and, in the future, among other researchers in the radiotheray area.
More info: artfibio@cesga.es
Federated HPC Clouds Applied to Radiation TherapyAndrés Gómez
Presentation delivered in the Research Track at ISC CLOUD'13 at Heidelberg (Germany) on Sep. 24th 2013.
It describe the Virtual Cluster Architecture developed during BonFIRE project and the reasons to do it. Some proof-of-concept experiments are also presented
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...Andrés Gómez
This document discusses lessons learned from porting two applications, CalcuNetW and GammaMaps, to the Intel Xeon Phi coprocessor. CalcuNetW calculates measurements in complex networks using MKL libraries, while GammaMaps performs dose calculations for radiation therapy using OpenMP pragmas. With minimal modifications using only pragmas, both applications were able to run natively and offload work to the Xeon Phi. Results showed the Xeon Phi providing similar performance to a single Xeon CPU core but with poor I/O performance. Further optimization work is required to fully leverage the Xeon Phi's capabilities.
Software libre y modelos de programación en la investigación con supercomputa...Andrés Gómez
Presentación hecha en el II Congreso de Software Libre para Educación en julio 2013 en donde se presentan los resultados de una encuesta realizada a los usuarios del CESGA sobre las necesidades computacionales y las herramientas de programación utilizadas.
Role of public supercomputing centers in the promotion of HPC on Cloud: the C...Andrés Gómez
The Galicia Supercomputing Centre (CESGA) provides high performance computing resources and services to research institutions in Galicia, Spain. It aims to promote computational science research and technology transfer. CESGA has over 16,000 GFLOPS of computing power and seeks to make HPC resources more accessible to small and medium enterprises through training programs and a cloud infrastructure called CloudPYME. This project aims to validate open source simulation software, provide training, deploy cloud services, and support 10 SMEs in order to increase adoption of HPC and sustainability of the resources.
VCOC BonFIRE presentation at FIRE Engineering Workshop 2012Andrés Gómez
VCOC experiment in BonFIRE European Project (http://www.bonfire-project.eu) results. It shows a general architecture with fault-tolerance to use in fistributed Cloud environments and the usage of application performance indicators to trigger cluster elasticity. More information at www.cesga.es.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
3. CESGA Mission
“Contribute to the advancement of Science and
Technical Knowledge, by means of research and
application of high performance computing and
communications, as well as other information
technologies resources, in collaboration with other
institutions, for the profit of society”
Contribuir ao avanço da Ciência e a Técnica, mediante a investigação e aplicação de
computação e comunicações de altas prestações, bem como outros recursos das
tecnologias da informação, em colaboração com outras instituições, para o benefício da
Sociedade
6. Universities (mainly from Galicia)
R&D&I centres (mainly from Galicia)
CSIC (around Spain)
Other institutions from Spain and Europe:
Hospitals (ONLY R&D)
Companies (mainly SMEs)
Other non-profit R&D&I organizations
Non-Fee Access for Europeans through:
RES open calls
PRACE open calls
Our Customers
7. CESGA ComputingInfrastructure
2.200 TB
FINIS TERRAE II:
HPC
7,712 cores
SVG:
HTC and
Cloud
~ 3.300 cores
Online Disk
1200 TB
Cloud for
Industry
240 cores
BigData
456
Cores
Remote Visualisation
80 cores
9. What isBig Data?
Why now:
Produce data is very cheap (sensors, people, ….)
Storage is also cheap
Unstructured and high-dimensional data
Big Data consists of extensive datasets - primarily in the
characteristics of volume, variety, velocity, and/or variability - that
require a scalable architecture for efficient storage, manipulation,
and analysis
NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST
Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1
10. V’s Big Data Challenges
Volume Velocity
Variety
Veracity
Value
Added-Value or Knowledge
Variability
Adapted from: Demchenko, Y., Grosso, P., & Membrey, P. (2013). Addressing Big Data Issues in Scientific Data Infrastructure.
Collaboration Technologies and Systems (CTS), 2013 International Conference on (Pp. 48-55). IEEE., 48–55.
http://doi.org/10.1109/CTS.2013.6567203
11. What isData Science?
Data science is the extraction of actionable knowledge directly
from data through a process of discovery, or hypothesis
formulation and hypothesis testing.
NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST
Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1
Data Scientist: A
Champion !
Collaboration is
better
12. Architecture
(NBD-PWG), N. B. D. P. W. G. (2015). NIST Big Data Interoperability Framework: Volume 6, Reference Architecture (Vol. 6).
Gaithersburg, MD. Retrieved from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-6.pdf
13. Big Data Requirements
Very Large Storage (TB, PB, EB,…)
Parallel Very Fast I/O (GB/s)
Computing capacity (move process to data)
Parallel processing.
Interactive, streamed and batch.
Visualisation (first step data analysis)
Advanced Data Analytics and ML packages
Remote Access
Etc
16. CESGASolucion:Dynamic
Create your own cluster for Data Science
HARDWARE PLATFORM FOR BIG DATA
DOCKER
MESOS
Your
Config
Cluster
CassandraSPARK SciDB
PaaS API
WEB Interface
17. CESGASolution:HPC
When data processing needs large computing
HARDWARE
PLATFORM FOR HPC
+ GPUs
HIGH PERFORMANCE
STORAGE: LUSTRE
HIGH SPEED COMM: IB
Theano TensorflowR Caffe
SLURM
WEB Interface/Remote Desktop SSH
18. CESGAData Scientist
CESGA has no Data Scientist
CESGA offers this service in collaboration
Open to collaborations in Portugal