The document provides an overview of Linked (Open) Data including RDF, RDFS and SPARQL. It defines key concepts such as Linked Data principles of using URIs to identify things on the web and describing relationships between them. It describes RDF's basic data model of subject-predicate-object triples to make statements about resources and the RDF serialization formats of Turtle and JSON-LD. It also mentions semantic query language SPARQL for querying RDF data.
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
Data Virtualization means Real-time Data Access and Integration. But why do I need it? This presentation tries to answer it in a simple yet clear way.
By Alberto Pan, CTO of Denodo, and Justo Hidalgo, VP Product Management.
Enabling digital transformation api ecosystems and data virtualizationDenodo
Watch the full webinar here: https://buff.ly/2KBKzLJ
Digital transformation, as cliché as it sounds, is on top of every decision maker’s strategic initiative list. And at the heart of any digital transformation, no matter the industry or the size of the company, there is an application programming interface (API) strategy. While API platforms enable companies to manage large numbers of APIs working in tandem, monitor their usage, and establish security between them, they are not optimized for data integration, so they cannot easily or quickly integrate large volumes of data between different systems. Data virtualization, however, can greatly enhance the capabilities of an API platform, increasing the benefits of an API-based architecture. With data virtualization as part of an API strategy, companies can streamline digital transformations of any size and scope.
Join us for this webinar to see these technologies in action in a demo and to get the answers to the following questions:
*How can data virtualization enhance the deployment and exposure of APIs?
*How does data virtualization work as a service container, as a source for microservices and as an API gateway?
*How can data virtualization create managed data services ecosystems in a thriving API economy?
*How are GetSmarter and others are leveraging data virtualization to facilitate API-based initiatives?
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo
Autodesk designed a modern data architecture that heavily uses data virtualization to integrate both legacy data sources and contemporary big data analytics like Spark into a single unified logical data warehouse. In this presentation, you will learn how to build a logical data warehouse using data virtualization and create a single, unified enterprise-wide access and governance point for any data used within the company.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/Ab4PDB.
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
Data Virtualization means Real-time Data Access and Integration. But why do I need it? This presentation tries to answer it in a simple yet clear way.
By Alberto Pan, CTO of Denodo, and Justo Hidalgo, VP Product Management.
Enabling digital transformation api ecosystems and data virtualizationDenodo
Watch the full webinar here: https://buff.ly/2KBKzLJ
Digital transformation, as cliché as it sounds, is on top of every decision maker’s strategic initiative list. And at the heart of any digital transformation, no matter the industry or the size of the company, there is an application programming interface (API) strategy. While API platforms enable companies to manage large numbers of APIs working in tandem, monitor their usage, and establish security between them, they are not optimized for data integration, so they cannot easily or quickly integrate large volumes of data between different systems. Data virtualization, however, can greatly enhance the capabilities of an API platform, increasing the benefits of an API-based architecture. With data virtualization as part of an API strategy, companies can streamline digital transformations of any size and scope.
Join us for this webinar to see these technologies in action in a demo and to get the answers to the following questions:
*How can data virtualization enhance the deployment and exposure of APIs?
*How does data virtualization work as a service container, as a source for microservices and as an API gateway?
*How can data virtualization create managed data services ecosystems in a thriving API economy?
*How are GetSmarter and others are leveraging data virtualization to facilitate API-based initiatives?
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo
Autodesk designed a modern data architecture that heavily uses data virtualization to integrate both legacy data sources and contemporary big data analytics like Spark into a single unified logical data warehouse. In this presentation, you will learn how to build a logical data warehouse using data virtualization and create a single, unified enterprise-wide access and governance point for any data used within the company.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/Ab4PDB.
Solution architecture for big data projects
solution architecture,big data,hadoop,hive,hbase,impala,spark,apache,cassandra,SAP HANA,Cognos big insights
Delivering Quality Open Data by Chelsea UrsanerData Con LA
Abstract:- The value of data is exponentially related to the number of people and applications that have access to it. The City of Los Angeles embraces this philosophy and is committed to opening as much of its data as it can in order to stimulate innovation, collaboration, and informed discourse. This presentation will be a review of what you can find and do on our open data portals as well as our strategy for delivering the best open data program in the nation.
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
Abstract:- Companies are adopting big data for performing high-velocity real-time analytics on very large volumes of data to enable rapid analysis for business users using self-service and never-before-realized use cases. However, such projects have yielded limited value because these big data systems have become siloed from the rest of the enterprise systems holding critical business operational data. Big Data Fabric is a modern data architecture combining data virtualization, data prep, and lineage capabilities to seamlessly integrate at scale these huge, siloed volumes of structured and unstructured data with other enterprise data assets. This presentation will demonstrate with proven customer case studies in big data and IoT about the value of using big data fabric as a logical data lake for big data analytics.
An introduction to data virtualization in business intelligenceDavid Walker
A brief description of what Data Virtualisation is and how it can be used to support business intelligence applications and development. Originally presented to the ETIS Conference in Riga, Latvia in October 2013
Simplifying Cloud Architectures with Data VirtualizationDenodo
Watch here: https://bit.ly/2yxLo6f
Moving applications and data to the Cloud is a priority for many organizations. The benefits - in terms of flexibility, agility, and cost savings - are driving Cloud adoption. However, the journey to the Cloud is not as easy as many people think. The process of moving application and data to the Cloud is challenging and can entail widespread disruption across the organization if not carefully managed. Even when systems are migrated to the Cloud, the resultant hybrid or multi-Cloud architecture is more complex for users to navigate, making it harder for them to get the data that they need to do their jobs.
Data Virtualization can help organizations at all stages of their journey to the Cloud - during migration and also in the resultant hybrid or multi-Cloud architectures. Attend this webinar to learn how Data Virtualization can:
- Help organizations manage risk and minimize the disruption caused as systems are moved to the Cloud
- Provide a single point of access for data that is both on-premise and in the Cloud, making it easier for users to find and access the data that they need
- Provide a security layer to protect and manage your data when it's distributed across hybrid or multi-Cloud architectures
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
2016 is the year of the data lake. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle new requirements. Thinking you can simply hire more developers to write code or rely on your legacy rows-and-columns centric tools is a recipe to sink in a data swamp instead of swimming in a data lake.
In this presentation, you'll learn about eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.
To learn more, visit: https://www.snaplogic.com/big-data
Introduces the Microsoft’s Data Platform for on premise and cloud. Challenges businesses are facing with data and sources of data. Understand about Evolution of Database Systems in the modern world and what business are doing with their data and what their new needs are with respect to changing industry landscapes.
Dive into the Opportunities available for businesses and industry verticals: the ones which are identified already and the ones which are not explored yet.
Understand the Microsoft’s Cloud vision and what is Microsoft’s Azure platform is offering, for Infrastructure as a Service or Platform as a Service for you to build your own offerings.
Introduce and demo some of the Real World Scenarios/Case Studies where Businesses have used the Cloud/Azure for creating New and Innovative solutions to unlock these potentials.
Where does Fast Data Strategy Fit within IT ProjectsDenodo
Fast Data Strategy is a must for organizations to become and be competitive. There are four use cases where Fast Data Strategy fits within IT Projects - Agile BI, Big Data/ Cloud, Data Services, and Single View. In this presentation, you will discover how four customers used data virtualization and Fast Data Strategy for these use cases.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/UxHMuJ.
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
BI architecture drivers have to change to satisfy new requirements in format, volume, latency, hosting, analysis, reporting, and visualization. In this presentation delivered at the 2014 SATURN conference, SoftServe`s Serhiy and Olha showcased a number of reference architectures that address these challenges and speed up the design and implementation process, making it more predictable and economical:
- Traditional architecture based on an RDMBS data warehouse but modernized with column-based storage to handle a high load and capacity
- NoSQL-based architectures that address Big Data batch and stream-based processing and use popular NoSQL and complex event-processing solutions
- Hybrid architecture that combines traditional and NoSQL approaches to achieve completeness that would not be possible with either alone
The architectures are accompanied by real-life projects and case studies that the presenters have performed for multiple companies, including Fortune 100 and start-ups.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
In this O'Reilly webcast, Ben Sharma (cofounder and CEO of Zaloni) and Vikram Sreekanti (software engineer in the AMPLab at UC Berkeley) discuss the value of collecting and analyzing metadata, and its potential to impact your big data solution and your business.
Watch the replay here: http://oreil.ly/28LO7IW
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...Denodo
Companies such as Autodesk are fast replacing the once-true- and-tried physical data warehouses with logical data warehouses/ data lakes. Why? Because they are able to accomplish the same results in 1/6 th of the time and with 1/4 th of the resources.
In this webinar, Autodesk’s Platform Lead, Kurt Jackson,, will describe how they designed a modern fast data architecture as a single unified logical data warehouse/ data lake using data virtualization and contemporary big data analytics like Spark.
Logical data warehouse / data lake is a virtual abstraction layer over the physical data warehouse, big data repositories, cloud, and other enterprise applications. It unifies both structured and unstructured data in real-time to power analytical and operational use cases.
Solution architecture for big data projects
solution architecture,big data,hadoop,hive,hbase,impala,spark,apache,cassandra,SAP HANA,Cognos big insights
Delivering Quality Open Data by Chelsea UrsanerData Con LA
Abstract:- The value of data is exponentially related to the number of people and applications that have access to it. The City of Los Angeles embraces this philosophy and is committed to opening as much of its data as it can in order to stimulate innovation, collaboration, and informed discourse. This presentation will be a review of what you can find and do on our open data portals as well as our strategy for delivering the best open data program in the nation.
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
Abstract:- Companies are adopting big data for performing high-velocity real-time analytics on very large volumes of data to enable rapid analysis for business users using self-service and never-before-realized use cases. However, such projects have yielded limited value because these big data systems have become siloed from the rest of the enterprise systems holding critical business operational data. Big Data Fabric is a modern data architecture combining data virtualization, data prep, and lineage capabilities to seamlessly integrate at scale these huge, siloed volumes of structured and unstructured data with other enterprise data assets. This presentation will demonstrate with proven customer case studies in big data and IoT about the value of using big data fabric as a logical data lake for big data analytics.
An introduction to data virtualization in business intelligenceDavid Walker
A brief description of what Data Virtualisation is and how it can be used to support business intelligence applications and development. Originally presented to the ETIS Conference in Riga, Latvia in October 2013
Simplifying Cloud Architectures with Data VirtualizationDenodo
Watch here: https://bit.ly/2yxLo6f
Moving applications and data to the Cloud is a priority for many organizations. The benefits - in terms of flexibility, agility, and cost savings - are driving Cloud adoption. However, the journey to the Cloud is not as easy as many people think. The process of moving application and data to the Cloud is challenging and can entail widespread disruption across the organization if not carefully managed. Even when systems are migrated to the Cloud, the resultant hybrid or multi-Cloud architecture is more complex for users to navigate, making it harder for them to get the data that they need to do their jobs.
Data Virtualization can help organizations at all stages of their journey to the Cloud - during migration and also in the resultant hybrid or multi-Cloud architectures. Attend this webinar to learn how Data Virtualization can:
- Help organizations manage risk and minimize the disruption caused as systems are moved to the Cloud
- Provide a single point of access for data that is both on-premise and in the Cloud, making it easier for users to find and access the data that they need
- Provide a security layer to protect and manage your data when it's distributed across hybrid or multi-Cloud architectures
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
2016 is the year of the data lake. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle new requirements. Thinking you can simply hire more developers to write code or rely on your legacy rows-and-columns centric tools is a recipe to sink in a data swamp instead of swimming in a data lake.
In this presentation, you'll learn about eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.
To learn more, visit: https://www.snaplogic.com/big-data
Introduces the Microsoft’s Data Platform for on premise and cloud. Challenges businesses are facing with data and sources of data. Understand about Evolution of Database Systems in the modern world and what business are doing with their data and what their new needs are with respect to changing industry landscapes.
Dive into the Opportunities available for businesses and industry verticals: the ones which are identified already and the ones which are not explored yet.
Understand the Microsoft’s Cloud vision and what is Microsoft’s Azure platform is offering, for Infrastructure as a Service or Platform as a Service for you to build your own offerings.
Introduce and demo some of the Real World Scenarios/Case Studies where Businesses have used the Cloud/Azure for creating New and Innovative solutions to unlock these potentials.
Where does Fast Data Strategy Fit within IT ProjectsDenodo
Fast Data Strategy is a must for organizations to become and be competitive. There are four use cases where Fast Data Strategy fits within IT Projects - Agile BI, Big Data/ Cloud, Data Services, and Single View. In this presentation, you will discover how four customers used data virtualization and Fast Data Strategy for these use cases.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/UxHMuJ.
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
BI architecture drivers have to change to satisfy new requirements in format, volume, latency, hosting, analysis, reporting, and visualization. In this presentation delivered at the 2014 SATURN conference, SoftServe`s Serhiy and Olha showcased a number of reference architectures that address these challenges and speed up the design and implementation process, making it more predictable and economical:
- Traditional architecture based on an RDMBS data warehouse but modernized with column-based storage to handle a high load and capacity
- NoSQL-based architectures that address Big Data batch and stream-based processing and use popular NoSQL and complex event-processing solutions
- Hybrid architecture that combines traditional and NoSQL approaches to achieve completeness that would not be possible with either alone
The architectures are accompanied by real-life projects and case studies that the presenters have performed for multiple companies, including Fortune 100 and start-ups.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
In this O'Reilly webcast, Ben Sharma (cofounder and CEO of Zaloni) and Vikram Sreekanti (software engineer in the AMPLab at UC Berkeley) discuss the value of collecting and analyzing metadata, and its potential to impact your big data solution and your business.
Watch the replay here: http://oreil.ly/28LO7IW
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...Denodo
Companies such as Autodesk are fast replacing the once-true- and-tried physical data warehouses with logical data warehouses/ data lakes. Why? Because they are able to accomplish the same results in 1/6 th of the time and with 1/4 th of the resources.
In this webinar, Autodesk’s Platform Lead, Kurt Jackson,, will describe how they designed a modern fast data architecture as a single unified logical data warehouse/ data lake using data virtualization and contemporary big data analytics like Spark.
Logical data warehouse / data lake is a virtual abstraction layer over the physical data warehouse, big data repositories, cloud, and other enterprise applications. It unifies both structured and unstructured data in real-time to power analytical and operational use cases.
Enabling Data as a Service with the JBoss Enterprise Data Services Platformprajods
This presentation was given at JUDCon 2013, Jan 17,18 at Bangalore. Presented by Prajod Vettiyattil and Gnanaguru Sattanathan. The presentation deals with the Why, What and How of Data Services and Data Services Platforms. It also explains the features of the JBoss Enterprise Data Services Platform.
The need for Data Services is explained with 3 Business use cases:
1. Post purchase customer experience improvement for an Auto manufacturer
2. Enterprise Data Access Layer
3. Data Services for Regulatory Reporting requirements like Dodd Frank
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...Hong-Linh Truong
This presentation is part of the course "184.742 Advanced Services Engineering" at The Vienna University of Technology, in Winter Semester 2012. Check the course at: http://www.infosys.tuwien.ac.at/teaching/courses/ase/
Open and Proprietary Data Economies in Malaysia: The Consumption PerspectiveSandra Hanchard
BIG DATA MALAYSIA @ Open Government Partnership Seminar and Exhibition
Sandra Hanchard
Kuala Lumpur, 18 August 2015
http://ideas.org.my/events/18-august-2015-open-government-partnership-seminar-and-exhibition/
The 101 Of Web 2.0 by Roslan Bakri ZakariahIdeashare
Explore the basics of web 2.0, if you still think a social network is what you get when you’ve been parting all night long, you should take some time to sit in this talk. find out a little about what is freemium and where do internet meems come from.
Expand your horizon and prepare to explore your imagination with Roslan. Dont be late, the class is in.
This talk was on deep learning use cases outside of computer vision. It also covered larger scale patterns of what good deep learning use cases typically look like. We end up on an explanation of anomaly detection and various kinds of anomaly use cases.
Tracxn Startup Research: Data as a Service Landscape, August 2016Tracxn
The top three funded sub-sectors till date are market intelligence (149 investments, $1.3B), financial data providers (158 investments, $1.2B), and geospatial data providers.
Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies
The Zeta Architecture is a high-level enterprise architectural construct which enables simplified business processes and defines a scalable way to increase the speed of integrating data into the business. The result? A powerful, data-centric enterprise.
This is the presentation for the talk I gave at JavaDay Kiev 2015. This is about an evolution of data processing systems from simple ones with single DWH to the complex approaches like Data Lake, Lambda Architecture and Pipeline architecture
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
Presentation by Al Hamilton and Cody Johnson to Canberra Semantic Web Meetup Group on why producers of official statistics are interested in semantic web community (including Linked Open Data) and outlining experimental work by Cody Johnson on transforming selected Population Census data released by the ABS in SDMX-ML to RDF Data Cube Vocabulary format.
Enabling Low-cost Open Data Publishing and ReuseMarin Dimitrov
In the space of just a few years we’ve seen the transformational power of open data; both for transparency and accountability in public data, and efficiency and innovation with businesses in private data. In its first year, institutions and individuals throughout Europe have supported public sector bodies in releasing data and numerous start-ups, developers and SMEs in reusing this data for economic benefit.
However, we are still at the beginning of the open data movement, and there is still more that can be done to make open data simpler to use and to make it available to a wider audience.
The core goal of the DaPaaS project is to provide a Data- and Platform-as-a-Service environment, where 3rd parties (such as governmental organisations, SMEs, developers and larger companies) can publish and host both data sets and data-intensive applications, which can then be accessed by end-user applications in a cross-platform manner. You can find out more about DaPaaS on the detailed about page.
Essentially, DaPaaS aims to make publishing, consumption, and reuse of open data, as well as deploying open data applications, easier and cheaper for SMEs and small public bodies which otherwise may not have sufficient technical expertise, infrastructure and resources required to do so.
see also http://www.slideshare.net/eswcsummerschool/wed-roman-tutopendatapub-38742186
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
Watch full webinar here: https://bit.ly/3nxGFam
Self service is a major goal of modern data strategists. Denodo’s data catalog is a key piece in Denodo’s portfolio to bridge the gap between the technical data infrastructure and business users. It provides documentation, search, governance and collaboration capabilities, and data exploration wizards. It’s the perfect companion for a virtual layer to fully empower those self service initiatives with minimal IT intervention. It provides business users with the tool to generate their own insights with proper security, governance and guardrails.
In this session you will learn about:
- The role of a virtual semantic layer in self service initiatives
- What are the key capabilities of Denodo’s new Data Catalog
- Best practices and advanced tips for a successful deployment
- How customers are using the Denodo’s Data Catalog to enable self-service initiatives
Data Profiling, Data Catalogs and Metadata HarmonisationAlan McSweeney
These notes discuss the related topics of Data Profiling, Data Catalogs and Metadata Harmonisation. It describes a detailed structure for data profiling activities. It identifies various open source and commercial tools and data profiling algorithms. Data profiling is a necessary pre-requisite activity in order to construct a data catalog. A data catalog makes an organisation’s data more discoverable. The data collected during data profiling forms the metadata contained in the data catalog. This assists with ensuring data quality. It is also a necessary activity for Master Data Management initiatives. These notes describe a metadata structure and provide details on metadata standards and sources.
Unlock Your Data for ML & AI using Data VirtualizationDenodo
How Denodo Complement’s Logical Data Lake in Cloud
● Denodo does not substitute data warehouses, data lakes,
ETLs...
● Denodo enables the use of all together plus other data
sources
○ In a logical data warehouse
○ In a logical data lake
○ They are very similar, the only difference is in the main
objective
● There are also use cases where Denodo can be used as data
source in a ETL flow
Towards Semantic APIs for Research Data Services (Invited Talk)Anna Fensel
Rapid development of Internet and Web technology is changing the state of the art in communication of knowledge, or results of research activities. Particularly, Semantic technology, linked and open data become key enablers for successful and efficient progress in research. At first, I define the research data service (RDS) and discuss typical current and possible future usage scenarios involving RDS. Further, I discuss the state of the art in the areas of semantic service and data annotation and API construction, as well as infrastructural solutions, applicable for RDS realisation. At last, innovative methods of online dissemination, promotion and efficient communication of research are discussed.
Relational Database explanation with detail.pdf9wldv5h8n
A relational database is a type of database that stores and provides access to data points that are related to one another. Relational databases are based on the relational model, an intuitive, straightforward way of representing data in tables.A relational database is a type of database that stores and provides access to data points that are related to one another. Relational databases are based on the relational model, an intuitive, straightforward way of representing data in tables.
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
Specifying users' interests with a formal query language is a typically challenging task, which becomes even harder in the context of multi-model data management because we have to deal with data variety. It usually lacks a unified schema to help the users issuing their queries, or has an incomplete schema as data come from disparate sources. Multi-Model DataBases (MMDBs) have emerged as a promising approach for dealing with this task as they are capable of accommodating and querying the multi-model data in a single system. This tutorial aims to offer a comprehensive presentation of a wide range of query languages for MMDBs and to make comparisons of their properties from multiple perspectives. We will discuss the essence of cross-model query processing and provide insights on the research challenges and directions for future work. The tutorial will also offer the participants hands-on experience in applying MMDBs to issue multi-model data queries.
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
Watch this webinar and learn how Neo4j and ICC Technology can help you remove risk from your data governance by improving the way you approach data lineage. We’ll cover some of the common approaches, driving regulations and biggest risks for banks and finances services.
-Find out how Data Lineage is becoming more complex for Banks and Financial Services companies
-Learn how a native-graph model can improve tracing data sources to targets as well as store transformations.
-Watch a demonstration on how you might approach regulations such as BCBS 239
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
The data diversity in healthcare and life sciences is exploding and the market is fundamentally changing as a result of healthcare reform. The result is more and more data but it is compartmentalized and disconnected. At Zephyr Health, we have developed a data platform that is able to provide connectivity between thousands of healthcare data assets using an ontology driven approach storing data in MongoDB. This session will show how we break down this very challenging problem and how some of MongoDBs more recent features have been utilized to do so.
proDataMarket presentation at "Spatial Data on The Web"dapaasproject
Presentation at the "Spatial Data on The Web" event, 10th of February 2016, Amersfoort, The Netherlands
http://www.pilod.nl/wiki/Geodata_on_The_Web_Event_10_February_2016
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
2. About me
• Education
– Eng (2003), Technical University of Cluj-Napoca, Romania
– PhD (2008), University of Innsbruck, Austria
• Current positions
– Senior Research Scientist, SINTEF, Norway
– Associate Professor, University of Oslo, Norway
• Expertise and responsibilities
– Initiating, leading, and carrying out (research-intensive) projects on
data management and service-oriented topics
– Involved with over 20 large-scale R&D projects at the European level
during the past 12 years
2
3. “Technology for a better society”
• Public and private
companies
• Data owners
• Data publishers
• Data integrators and
aggregators
• Developers
• Improved data access
• Data-driven decision making
• Cost reduction when
working with data
• Reduction on the
dependency on generic
infrastructures providers
(e.g. generic cloud)
• Increase in the speed of
making data available
• Increase in the reuse of data
• Data cleaning
• Data transformation
• Data publication
• Data-as-a-Service
• Open data
• Linked data (RDF, SPARQL)
DataGraft
3
5. Outline
Session #1: Open Data
• Open Data
• (Open) Data Quality Issues
• Linked (Open) Data
– RDF, RDFS, SPARQL
Session #2: DataGraft
• Data-as-a-Service: DataGraft
• Examples and Demo
• Big Data and DataGraft
• Open Data in Malaysian
context (by Dennis Gan)
• (Optional: Hands on)
5
What is Open Data?
What is Linked Data?
Challenges in (Linked Open) Data?
How to publish Linked Open Data?
Linked Open Data Use Cases?
(Linked) Open Data and Big Data?
7. What can open data do for you?
(Source: The ODI, https://vimeo.com/110800848)
7
8. Open Data
…is changing the nature of business
...reflects a cultural shift to a more open
society
8
9. Example: Personalized and Localized Urban
Quality Index (PLUQI)
The index includes data from various
domains:
Daily life satisfaction
weather, transportation, community, …
Healthcare level
number of doctors, hospitals, suicide statistics, …
Safety and security
number of police stations, fire stations, crimes
per capita, …
Financial satisfaction
prices, incomes, housing, savings, debt,
insurance, pension, …
Level of opportunity
jobs, unemployment, education, re-education, …
Environmental needs and efficiency
green space, air quality,…
9
10. PLUQI – potential usage
• Place recommendation for travel agencies or travelers
• Policy analysis and optimization for (local) government
• Understanding the citizen’s voice and demands regarding
environmental conservation
• Commercial impact analysis for retailer and franchises
• Location recommendation and understanding local issues
for real estate
• Risk analysis and management for insurance and
financial companies
• Local marketing and sales force optimization for
marketers
10
11. Open Data
• Businesses can develop new ideas, services and applications;
improve decision making, cost savings
• Can increase government transparency and accountability, quality
of public services
• Citizens get better and timely access to public services
11
Source: McKinsey
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a
nd_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and
grant broad/public access to it.
Source: Garner
http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data
_JUN+2014_v2.pdf
12. Lots of open datasets on the Web…
• A large number of datasets have been published as open data in the
recent years
• Many kinds of data: cultural, science, finance, statistics, transport,
environment, …
• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …
12
13. …but few actually used
• Few applications utilizing open
and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high
quality data
– Unclear monetization & sustainability
13
Open Data Portal Datasets Applications
data.gov ~ 200 000 ~ 80
publicdata.eu ~ 48 000 ~ 85
data.gov.uk ~ 31 000 ~ 390
data.norge.no ~ 620 ~ 60
data.gov.my ~ 1065 ~ 10
14. Lots of datasets are in tabular format
– Records organized in silos of
collections
– Very few links within and/or
across collections
– Difficult to understand the nature
of the data
– Difficult to integrate / query
14
europeandataportal.eu
15. Openly
available on
the web as a
document
Available
under
structured
format (XLS)
Available
under non-
proprietary
formats (CSV)
Uses URIs to
denote things
Linked to other
data to provide
context
Tim Berners-Lee's
5 stars open data
rating system
15
16. 1-Star Benefits
Consumers:
Ability to look at, print,
store, modify and
share data
Ability to use data as
input to a system
Publishers:
Easily publish data
Ensure transparency
5-Star Benefits
Consumers:
Discover more (related) data while
consuming the data
Directly learn about the data schema
? Have to deal with broken data links
? Trust issues
Publishers:
Make data discoverable
Increase the value of data
Gain the same benefits from the links
as the consumers
? Need to invest resources to link data
? May need to clean data
16
…
17. Tabular Data Graph Data
• Lots of open datasets are in tabular format
• CSV, Excel, TSV, etc.
• Records organized in silos of collections
• Very few links within and/or across
collections
• Difficult to understand the nature of the data
• Difficult to integrate / query
Based on Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
• Open standards by W3C
− Data format: RDF
− Knowledge representation: RDFS/OWL
− Query language: SPARQL
http://www.w3.org/standards/semanticweb/data
europeandataportal.eu
17
20. Tabular data
Tabular data is data that is structured into rows and columns
Correspondence with reality:
1) Each row represents an entity
2) Each column header represents an attribute of entity
3) Each column value represents a value of attribute
4) Each table represents a collection of entities
20
21. Tabular data files
Tabular data can be stored in different formats:
Tabular Text Formats (pure tabular data)
Delimiter-separated values:
- CSV – comma-separated values
- Less common, including TSV – tab-separated values, colon-separated
values etc.
Spreadsheet Formats (meta-data information about the document,
tabular data, formulas)
- XLS (Excel spreadsheet)
- XLSX (Excel 2007 format)
21
22. Tabular data quality issues
When a dataset does not satisfy specified data quality
criteria, it means that it contains data quality issues.
In order to provide higher data quality, these quality
issues should be detected and removed.
22
34. How to resolve data quality issues?
Workflow:
1) Identify data quality issues
2) Define transformation functions to resolve them
3) Execute transformation and verify the result
34
35. Transformation function types
By scope:
Functions on rows
Functions on columns
Functions transforming entire
dataset
By caused effect:
Data reordering functions
Data extraction functions
Data manipulation functions
Data enrichment functions
35
36. Transformation functions
Scope Name Description Effect
Rows
Add Row Create a new record in a dataset Data enrichment
Take/Drop Rows Extract only relevant rows by index
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection
Filter Rows Extract only relevant rows by condition
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Entire
dataset
Remove
Duplicates
Remove similar rows Data extraction. Resolves issues: “Duplicate rows”
Sort Dataset
Sorts dataset by given column names in
given order
Data reordering, simplifies quality issues detection
Reshape Dataset
(Melt)
Move columns to rows
Data manipulation. Resolves issues: “Column headers, containing
attribute values”
Reshape Dataset
(Cast)
Move rows to columns by categorizing
and aggregating
Data enrichment, simplifies quality issues detection
Group and
Aggregate
Group values by column or multiple
columns and perform aggregation
Data enrichment, simplifies quality issues detection
Columns
Add Column
Add a column with a manually specified
value
Data enrichment
Derive Column
Add a column with values, computed
from other columns
Data enrichment
Take/Drop
Columns
Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”
Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection
Merge Columns Merge columns using custom separator
Data manipulation. Resolves issues: “Single value is splitted across
multiple columns”
Split Column Split column using custom separator
Data manipulation. Resolves issues: “Multiple values stored in one
column”
Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”
Map columns Apply function to all values in a column
Data manipulation. Resolves issues: “Illegal values”, “Missing values”,
“Inconsistent values” 36
37. Tabular data cleaning tools
CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface
Programming languages and libraries for data analysis (R, agate for
Python) – users need knowledge in programming
Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google
Spreadsheets) - were not initially created for data cleaning, hard to debug,
code is mixed up with data
Frameworks/tools designed to be used for interactive data cleaning and
transformation in ETL process
37
38. Example: vehicle registration data
https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&C
MSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true
38
39. Example: vehicle registration data
(continued)
* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39
40. Map columns – applying a function to all
values in a column
Effect: data manipulation
Resolves anomalies: Illegal values, Missing values, Inconsistent values
Required parameters:
For all columns that should be mapped
1) Name of column to manipulate
2) Name of function to apply
40
43. Derive column – add a column with values
computed from others
Effect: data enrichment
Adds new information to data
Required parameters:
1) Name of derived column
2) Column(s) to derive from
3) Function to derive with
43
46. Cast dataset – move rows to columns by
categorizing and aggregating
Effect: data enrichment
Adds new information to data, simplifies anomaly detection
Required parameters:
1) Column name for variable (what to categorize and put to headers)
2) Column name for value (on what to perform aggregations)
46
53. Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
http://www.w3.org/standards/sema
nticweb/data
53
54. Linked open data cloud
By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792
54
55. Linked Data principles
• Every thing is represented by a URI
• URIs of things can be dereferenced
• Things are linked to other things by relating their URIs
55
56. Linked Data technology
• Data format:
• Knowledge representation: RDFS/OWL
• Query language:
• Linking medium: HTTP
56
59. Resource Description Framework (RDF)
Basics
• RDF making statements on resources (entities)
o Triple data model: subject -> predicate -> object (Alice's age is 34)
• Subjects and objects:
o Resources (URIs of entities) – can have properties related to them (http://my-
domain.com/Alice)
o Literals – constant values ("female", "3.14159"); can not be subjects
o Blank nodes – used to specify composite properties (e.g., address which is composed
of a country, city, street name, house number, zip code etc.)
• Realtionships (a.k.a. predicates) – relate one subject to one object
59
61. RDF serialisation formats (continued)
• RDFa (for HTML and XML embedding)
61
<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">
<div resource="http://example.org/bob#me" typeof="foaf:Person">
<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>
and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>
<p>Bob is interested in <span property="foaf:topic_interest"
resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>
</div>
<div resource="http://www.wikidata.org/entity/Q12418">
<p>The <span property="dcterms:title">Mona Lisa</span> was painted by
<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>
and is the subject of the video
<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à
Washington'</a>. </p>
</div>
<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>
</div>
</body>
63. RDF Schema (RDFS)
• basic capabilities for describing RDF vocabularies
• includes concepts to describe:
o classes, class hierarchies (sub-classes) and instances (typing)
o non-standard literal data types
o property hierarchies (sub-properties)
o predicate domain and range
o utility properties (labels, comments, additional information
about things, definitions of reources)
o …
63
67. SPARQL querying – query
Question: What are the nicknames of people that Alice knows?
Query:
@prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
a:Alice
foaf:knows
?someone
foaf:nick
?nickname
67
70. Data integration using Linked Data: using
URIs
Example: Relational DB or spreadsheet – dataset about scientific publications:
ID Name Home page
1 Alice http://alice.org/
2 Tim https://www.w3.org/People/Berners-Lee/
ID author ISBN Publication topic
1 978-3-16-14410-0 "On the frictional coefficient of bananas"
1
534-1-22-66975-1
"Do woodpeckers get headaches?"
2 1-933019-33-6 "The Semantic Web"
70
71. Data integration using Linked Data: using
URIs (continued)
a:Alice
http://.../978-3-16-148410-0
http://.../534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient of
bananas"
"Do woodpeckers get headaches?"
t:Tim http://.../1-933019-33-6
foaf:publications
foaf:topic
"The Semantic Web"
Graph representation of new dataset:
71
75. Linked Data is great for Open Data
• Linked Data is a great means to represent data
– Semantics are part of the data
– Naturally linked to other data
– Querying language
• How Linked Data can improve Open Data:
– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
75
76. … but has been ignored by the mainstream
• Difficult to make it accessible to people
– Publishers
– Developers
– Data workers
• Challenges with using Linked Data
– Lack of tooling and expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
• DataGraft: packaging Linked Data to make it more
approachable to the open data community
76
78. 78
“Data is the new oil”
…but many of us just need gasoline
Data-as-a-Service
…is the new filling station
79. Data-as-a-Service
• Outsourcing of various data operations to the cloud
• Eliminates
– upfront costs on data infrastructure
– ongoing investment of time and resources in managing the data
infrastructure
• Complete package for
– transformation of raw data into meaningful data assets
– reliable delivery of data assets
79
80. was developed to allow
data workers
to manage their data in a
simple, effective, and efficient way
Powerful
data transformation and
reliable data access capabilities
80
DataGraft
81. Data Transformation and
RDF Publication Process
• Interactive design of transformations?
• Repeatable transformations?
• Reuse/share transformations (user-based access)?
• Cloud-based deployment of transformations?
• Self-serviced process?
• Data and Transformation as-a-Service? 81
Transform
Generate
RDF
Ontology X
Ontology X
Ontology X
Ontology
mapping
RDF Graph
Raw Data Prepared Data
Map
Map
RDF Triple
Store
102. 102
Data records (rows)
Add row
Take row(s)
Drop row(s)
Shift row
Filter rows (grep)
Remove duplicate rows
Entire dataset
Sort
Reshape dataset
Group (categorize) and aggregate
Columns
Add column(s)
Take column(s)
Drop column(s)
Move column
Merge columns
Split column
Rename column(s)
Apply function to all values in a column
108. Data pages and federated querying
108
What is the
population of
locations and
total number of
persons employed
in Human health
and social work
activities?
114. DataGraft key feature:
Flexible management and sharing of data
and transformations
Fork, reuse and extend
transformations built by other
professionals from DataGraft’s
transformations catalog
Interactively build,
modify and share data
transformations
Share transformations
privately or publicly
Reuse transformations to
repeatably clean and
transform spreadsheet
data
Programmatically access transformations
and the transformation catalogue
114
115. Reuse of transformations in environmental
data publishing
TRAGSA Pilot
• Number of
transformations: 42
– Created via reuse: 25
• Number of triples:
– ~ 7.7M
ARPA Pilot
• Number of
transformations: 5
– Created via reuse: 2
• Number of triples:
– ~ 14K
115
Forking/reusing transformations helped us spend less
time on creating new transformations
116. DataGraft key feature:
Reliable data hosting and querying services
Host data on DataGraft’s
reliable, cloud-based
semantic graph database
Share data privately or
publicly
Query data through
your own SPARQL
endpoint
Programmatically
access the data
catalogue
116
Operations & maintenance
performed on behalf of users
120. The context: Statsbygg
120
• A public sector administration
company
• Norwegian government's key
advisor in construction and
property affairs
• Building commissioner
• Property manager
• Property developer
• Interest:
Exploit/Share
property data in
novel ways
• For efficiency and sustainability of
the property included in the
government's civil estate
Example: Reporting state-owned
real estate properties in Norway
121. Example: Reporting state-owned
real estate properties in Norway (cont’)
• A hard copy of 314 pages and as a
PDF file
• 6 Person-Months
• Data collection with spreadsheets
• Quality assurance through e-mails
and phone correspondence
Pains
• Time consuming
• Poor data quality
• Static report without live updating
• Live service
• Efficient sharing of data
• Simplified integration with external
datasets
• Live updating
• Reliable access
• …
• Risk and vulnerability analysis,
e.g. buildings affected by
flooding
• Analysis of leasing prices
Report Reporting Service 3rd party services
121
123. Demo Scenario
• Interactively create tabular data transformations
• Reuse/extend data transformations (incl. data
annotations)
• RDF data publication and querying
• Integrating and visualising data from different
sources
• (Using 3rd party tools with DataGraft)
123
126. Benefits of DataGraft in use cases
• Simplified data publishing process
• Integration with external data sources using
established web standards
• Data that was not publicly available – now published
(e.g. air quality data in Oslo)
• Time-efficient publishing
• Repeatable data transformation process
126
127. DataGraft and Big Data
• Desired features:
– real-time interactivity
– large datasets batch transformation capability
We are developing a hybrid solution to work with both
batch and real-time processing.
127
129. DataGraft – targeted impacts
Reduction in costs
for organisations which lack
sufficient expertise and resources to
make their data available
Reduction on the dependency
of data owners on generic Cloud platforms
to build, deploy and maintain their linked
data from scratch
Increase in the speed of
publishing
new datasets and updating existing
datasets
Reduction in the cost and
complexity of developing
applications that use data
Increase in the reuse of data
by providing reliable access to numerous
datasets hosted on DataGraft.net
129
130. • Gathering enough of good datasets
• Designing/implementing
2. Able to focus on
service quality
Example: The benefit of DataGraft in PLUQI
130
• Reducing cost for implementing
transformations
• Integrating the process is
simpler
1. 23% of development
cost reduction
Datasets
gathering
Data
transformation
Data
provisioning/access
Implementing
App
Before
Datasets
gathering
Data
transformation
Data
provisioning/
access
Implementing
App
After (with DataGraft)
131. DataGraft in numbers
(as of end of Jan 2016)
131
238
Registered users
607 (208 public)
Registered
Data transformations
1828
Uploaded files
192
Public Data
pages
132. DataGraft in the wild
• Investigating crime data in small geographies
• Used DataGraft to transform data and publish RDF
132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/
133. Data Science and DataGraft
Greater Data Science:
1. Data Exploration and
Preparation
2. Data Representation and
Transformation
3. Computing with Data
4. Data Visualization and
Presentation
5. Data Modeling
6. Science about Data Science
133
“50 years of Data Science” by David Donoho
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
DataGraft
136. Summary
• DataGraft – emerging Data-as-a-Service solution for
making (linked) data more accessible
– Platform, portal, methodology, APIs
– Online service, functional and documented
– Validated through several use cases
• Key features:
– Support for Sharable/Repeatable/Reusable Data
Transformations
– Reliable RDF Database-as-a-Service
136