The volume, variety, and high availability of data backing decision support systems have impacted on business intelligence, the discipline providing strategies to transform raw data into decision-making insights. Such transformation is usually abstracted in the “knowledge pyramid,” where data collected from the real world are processed into meaningful patterns. In this context, volume, variety, and data availability have opened for challenges in augmenting the knowledge pyramid. On the one hand, the volume and variety of unconventional data (i.e., unstructured non-relational data generated by heterogeneous sources such as sensor networks) demand novel and type-specific data management, integration, and analysis techniques. On the other hand, the high availability of unconventional data is increasingly attracting data scientists with high competence in the business domain but low competence in computer science and data engineering; enabling effective participation requires the investigation of new paradigms to drive and ease knowledge extraction. The goal of this thesis is to augment the knowledge pyramid from two points of view, namely, by including unconventional data and by providing advanced analytics. As to unconventional data, we focus on mobility data and on the privacy issues related to them by providing (de-)anonymization models. As to analytics, we introduce a higher abstraction level than writing formal queries. Specifically, we design advanced techniques that allow data scientists to explore data either by expressing intentions or by interacting with smart assistants in hand-free scenarios.
Augmented reality allows users to superimpose digital information (typically, of operational type) upon real world entities. The synergy of analytical frameworks and augmented reality opens the door to a new wave of situated OLAP, in which users within a physical environment are provided with immersive analyses of local contextual data. In this paper we propose an approach that, based on the sensed augmented context (provided by wearable and smart devices), proposes a set of relevant analytical queries to the user. This is done by relying on a mapping between the entities that can be recognized by the devices and the elements of the enterprise data, and also taking into account the queries preferred by users during previous interactions that occurred in similar contexts. A set of experimental tests evaluates the proposed approach in terms of efficiency and effectiveness.
http://ceur-ws.org/Vol-2324/Paper02-MGolfarelli.pdf
[ADBIS 2021] - Optimizing Execution Plans in a MultistoreChiara Forresi
Multistores are data management systems that enable query processing across different database management systems (DBMSs); besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. In a recent work [2], we have proposed a multistore solution that relies on a dataspace to provide the user with an integrated view of the available data and enables the formulation and execution of GPSJ (generalized projection, selection and join) queries. In this paper, we propose a technique to optimize the execution of GPSJ queries by finding the most efficient execution plan on the multistore. In particular, we devise three different strategies to carry out joins and data fusion, and we build a cost model to enable the evaluation of different execution plans. Through the experimental evaluation, we are able to profile the suitability of each strategy to different multistore configurations, thus validating our multi-strategy approach and motivating further research on this topic.
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...Christoph Lange
The Linked Data paradigm has emerged as a powerful enabler for data and knowledge interlinking and exchange using standardised Web technologies.
In this article, we discuss our vision how the Linked Data paradigm can be employed to evolve the intranets of large organisations -- be it enterprises, research organisations or governmental and public administrations -- into networks of internal data and knowledge.
In particular for large enterprises data integration is still a key challenge. The Linked Data paradigm seems a promising approach for integrating enterprise data. Like the Web of Data, which now complements the original document-centred Web, data intranets may help to enhance and flexibilise the intranets and service-oriented architectures that exist in large organisations. Furthermore, using Linked Data gives enterprises access to 50+ billion facts from the growing Linked Open Data (LOD) cloud. As a result, a data intranet can help to bridge the gap between structured data management (in ERP, CRM or SCM systems) and semi-structured or unstructured information in documents, wikis or web portals, and make all of these sources searchable in a coherent way.
Keynote at Baltic DB&IS 2014, 9 June 2014, Tallinn, Estonia
Augmented reality allows users to superimpose digital information (typically, of operational type) upon real world entities. The synergy of analytical frameworks and augmented reality opens the door to a new wave of situated OLAP, in which users within a physical environment are provided with immersive analyses of local contextual data. In this paper we propose an approach that, based on the sensed augmented context (provided by wearable and smart devices), proposes a set of relevant analytical queries to the user. This is done by relying on a mapping between the entities that can be recognized by the devices and the elements of the enterprise data, and also taking into account the queries preferred by users during previous interactions that occurred in similar contexts. A set of experimental tests evaluates the proposed approach in terms of efficiency and effectiveness.
http://ceur-ws.org/Vol-2324/Paper02-MGolfarelli.pdf
[ADBIS 2021] - Optimizing Execution Plans in a MultistoreChiara Forresi
Multistores are data management systems that enable query processing across different database management systems (DBMSs); besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. In a recent work [2], we have proposed a multistore solution that relies on a dataspace to provide the user with an integrated view of the available data and enables the formulation and execution of GPSJ (generalized projection, selection and join) queries. In this paper, we propose a technique to optimize the execution of GPSJ queries by finding the most efficient execution plan on the multistore. In particular, we devise three different strategies to carry out joins and data fusion, and we build a cost model to enable the evaluation of different execution plans. Through the experimental evaluation, we are able to profile the suitability of each strategy to different multistore configurations, thus validating our multi-strategy approach and motivating further research on this topic.
Interlinking Data and Knowledge in Enterprises, Research and Society with Lin...Christoph Lange
The Linked Data paradigm has emerged as a powerful enabler for data and knowledge interlinking and exchange using standardised Web technologies.
In this article, we discuss our vision how the Linked Data paradigm can be employed to evolve the intranets of large organisations -- be it enterprises, research organisations or governmental and public administrations -- into networks of internal data and knowledge.
In particular for large enterprises data integration is still a key challenge. The Linked Data paradigm seems a promising approach for integrating enterprise data. Like the Web of Data, which now complements the original document-centred Web, data intranets may help to enhance and flexibilise the intranets and service-oriented architectures that exist in large organisations. Furthermore, using Linked Data gives enterprises access to 50+ billion facts from the growing Linked Open Data (LOD) cloud. As a result, a data intranet can help to bridge the gap between structured data management (in ERP, CRM or SCM systems) and semi-structured or unstructured information in documents, wikis or web portals, and make all of these sources searchable in a coherent way.
Keynote at Baltic DB&IS 2014, 9 June 2014, Tallinn, Estonia
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...webwinkelvakdag
QUESTION: What was worth millions of dollars yesterday, but it is not worth anything tomorrow?
ANSWER: Today's newspaper.
For my master thesis of the Business Analytics program at the Vrije Universiteit, Amsterdam, commissioned by Deloitte, I developed an algorithm that learned to predict stock price movements, based on news paper articles. It took over twenty six thousand articles to train the model, but in the end it was capable of predicting over 56% of articles correctly.
This presentation will walk through the algorithm's pipeline from begin to end in an intuitive, yet technical matter. After they have left the presentation, visitors are expected to have a global understanding of the components of the algorithm as well as why and how they work. The components discusses include: Word2Vec, Long-Short Term Memory (LSTM) and a Convolutional Neural Network.
Presentation about new data, methods and outputs to create knowledge for innovation policy. Presented at the OECD Blue Sky Conference, 20 September 2016.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Single view vs. multiple views scatterplotsIJECEIAES
Among all the available visualization tools, the scatterplot has been deeply analyzed through the years and many researchers investigated how to improve this tool to face new challenges. The scatterplot visualization diagram is considered one of the most functional among the variety of data visual representations, due to its relative simplicity compared to other multivariable visualization techniques. Even so, one of the most significant and unsolved challenge in data visualization consists in effectively displaying datasets with many attributes or dimensions, such as multidimensional or multivariate ones. The focus of this research is to compare the single view and the multiple views visualization paradigms for displaying multivariable dataset using scatterplots. A multivariable scatterplot has been developed as a web application to provide the single view tool, whereas for the multiple views visualization, the ScatterDice web app has been slightly modified and adopted as a traditional, yet interactive, scatterplot matrix. Finally, a taxonomy of tasks for visualization tools has been chosen to define the use case and the tests to compare the two paradigms.
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...Paolo Nesi
Abstract— Graph databases are taking place in many different applications: smart city, smart cloud, smart education, etc. In most cases, the applications imply the creation of ontologies and the integration of a large set of knowledge to build a knowledge base as an RDF KB store, with ontologies, static data, historical data and real time data. Most of the RDF stores are endowed of inferential engines that materialize some knowledge as triples during indexing or querying. In these cases, deleting concepts may imply the removal and change of many triples, especially if the triples are those modeling the ontological part of the knowledge base, or are referred by many other concepts. For these solutions, the graph database versioning feature is not provided at level of the RDF stores tool, and it is quite complex and time consuming to be addressed as black box approach. In most cases the indexing is a time consuming process, and the rebuilding of the KB may imply manually edited long scripts that are error prone. Therefore, in order to solve these kinds of problems, this paper proposes a lifecycle methodology and a tool supporting versioning of indexes for RDF KB store. The solution proposed has been developed on the basis of a number of knowledge oriented projects as Sii-Mobility (smart city), RESOLUTE (smart city risk assessment), ICARO (smart cloud). Results are reported in terms of time saving and reliability.
Keywords — RDF Knowledge base versioning, graph stores versioning, RDF store management, knowledge base life cycle.
PREDICTING STOCK PRICE MOVEMENTS BASED ON NEWSPAPER ARTICLES USING A NOVEL DE...webwinkelvakdag
QUESTION: What was worth millions of dollars yesterday, but it is not worth anything tomorrow?
ANSWER: Today's newspaper.
For my master thesis of the Business Analytics program at the Vrije Universiteit, Amsterdam, commissioned by Deloitte, I developed an algorithm that learned to predict stock price movements, based on news paper articles. It took over twenty six thousand articles to train the model, but in the end it was capable of predicting over 56% of articles correctly.
This presentation will walk through the algorithm's pipeline from begin to end in an intuitive, yet technical matter. After they have left the presentation, visitors are expected to have a global understanding of the components of the algorithm as well as why and how they work. The components discusses include: Word2Vec, Long-Short Term Memory (LSTM) and a Convolutional Neural Network.
Presentation about new data, methods and outputs to create knowledge for innovation policy. Presented at the OECD Blue Sky Conference, 20 September 2016.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Single view vs. multiple views scatterplotsIJECEIAES
Among all the available visualization tools, the scatterplot has been deeply analyzed through the years and many researchers investigated how to improve this tool to face new challenges. The scatterplot visualization diagram is considered one of the most functional among the variety of data visual representations, due to its relative simplicity compared to other multivariable visualization techniques. Even so, one of the most significant and unsolved challenge in data visualization consists in effectively displaying datasets with many attributes or dimensions, such as multidimensional or multivariate ones. The focus of this research is to compare the single view and the multiple views visualization paradigms for displaying multivariable dataset using scatterplots. A multivariable scatterplot has been developed as a web application to provide the single view tool, whereas for the multiple views visualization, the ScatterDice web app has been slightly modified and adopted as a traditional, yet interactive, scatterplot matrix. Finally, a taxonomy of tasks for visualization tools has been chosen to define the use case and the tests to compare the two paradigms.
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...Paolo Nesi
Abstract— Graph databases are taking place in many different applications: smart city, smart cloud, smart education, etc. In most cases, the applications imply the creation of ontologies and the integration of a large set of knowledge to build a knowledge base as an RDF KB store, with ontologies, static data, historical data and real time data. Most of the RDF stores are endowed of inferential engines that materialize some knowledge as triples during indexing or querying. In these cases, deleting concepts may imply the removal and change of many triples, especially if the triples are those modeling the ontological part of the knowledge base, or are referred by many other concepts. For these solutions, the graph database versioning feature is not provided at level of the RDF stores tool, and it is quite complex and time consuming to be addressed as black box approach. In most cases the indexing is a time consuming process, and the rebuilding of the KB may imply manually edited long scripts that are error prone. Therefore, in order to solve these kinds of problems, this paper proposes a lifecycle methodology and a tool supporting versioning of indexes for RDF KB store. The solution proposed has been developed on the basis of a number of knowledge oriented projects as Sii-Mobility (smart city), RESOLUTE (smart city risk assessment), ICARO (smart cloud). Results are reported in terms of time saving and reliability.
Keywords — RDF Knowledge base versioning, graph stores versioning, RDF store management, knowledge base life cycle.
DISIT Lab overview: smart city, big data, semantic computing, cloudPaolo Nesi
Smart City
• Projects: http://www.disit.org/5501
– Sii-Mobility, http://www.sii-mobility.org
– Service Map: http://servicemap.disit.org
– Social Innovation: Coll@bora http://www.disit.org/5479
– Navigation Indoor/outdoor: Mobile Emergency http://www.disit.org/5404
– Mobility and Transport: TRACE-IT, RAISSS, TESYSRAIL
• Tools: http://www.disit.org/5489
– Data gathering, data mining and reconciliation
– Data reasoning, deduction, prediction
– Smart city ontology and reasoning tools
– Service analysis and recommendations
– Autonomous train operator, train signaling
– Risk analysis, decision support systems
– Mobile Applications
Data Analytics - Big data
• Projects: http://www.disit.org/5501
– Linked Open Graph: http://LOG.disit.org
– Sii-Mobility, http://www.sii-mobility.org
– Service on a number of projects
• Tools: http://www.disit.org/5489
– Open data and Linked Open Data
– LOG LOD service and tools
– Data mining and reconciliation
– Data reasoning, deduction, prediction, decision support
– SN Analysis and recommendations
– User behavior monitoring and analysis
Smart Cloud - Computing
• Projects: http://www.disit.org/5501
– ICARO: http://www.disit.org/5482
– Cloud ontology: http://www.disit.org/5604
– Cloud simulator:
– Smart Cloud: http://www.disit.org/6544
• Tools: http://www.disit.org/5489
– Cloud Monitoring
– Smart Cloud Engine and reasoner,
– Service Level Analyzer and control
– Configuration analysis and checker
– Cloud Simulation
Text and Web Mining
• Projects: http://www.disit.org/5501
– OSIM: http://www.disit.org/5482
– SACVAR: http://www.disit.org/5604
– Blog/Twitter Vigilance
• Tools: http://www.disit.org/5489
– Text and web mining, Natural Language Processing
– Service localization
– Web Crawling
– Competence analysis
– Blog Vigiliance, sentiment analysis
Social Media and e-Learning
• Projects: http://www.disit.org/5501
– ECLAP, http://www.eclap.eu
– ApreToscana: http://www.apretoscana.org
– Others: AXMEDIS, VARIAZIONI, SMNET, etc.
– Samsung Smart TV: http://www.disit.org/6534
• Tools: http://www.disit.org/5489
– XLMS, Cross Media Learning System
– IPR and content protection and distribution
– Mobile and SmartTv Applications
– Suggestions and recommendations
– Matchmaking solutions
– Media Tools for cross media content
Mobile Computing
• Projects:
– ECLAP: http://www.eclap.eu
– Mobile Medicine: http://mobmed.axmedis.org
– Mobile Emergency: http://www.disit.org/5500
– Smart City, FODD 2015: http://www.disit.org/6593
– Resolute: Mobiles as sensors
• Tools and support:
– Content distribution: e-learning
– Integrated Indoor/outdoor navigation
– User networking and collaboration
– Service localization
– Smart city and services
– OS: iOS, Android, Windows Phone, etc.
– Tech: IOT, iBeacoms, NFC, QR, ….
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...ijistjournal
As we all know, in the current era, Internet of Things (IOT) word is very booming in technological market and everyone is talking about the term Smart city especially in India and with reference to keyword smart city, IOT comes with it. The Small word IOT but very big responsibility comes on the shoulders of the technical person to Play with it and extract the data from the IOT . IoT its connecting the multiple things this interconnection is in between living as well as non living things and in that communication huge amount of data is generated so tools and technique which are used for knowledge discover we discuss in this paper.
Internet of Things (IOT) and knowledge discovery are the two sides of the coin and both go together. In the absence of one, there is no use of other. This Paper also focuses on types of the data and data generative sources, Knowledge discovery from that data, tools which are useful for the discovery of the knowledge. Technique, which are to be followed for the purpose of discovering meaningful data from the huge amount of data and its impact.
Data enrichment is vital for leveraging heterogeneous data sources in various business analyses, AI applications, and data-driven services. Knowledge Graphs (KGs) support the enrichment of heterogeneous data sources by making entities first-class citizens: links to entities help interconnect heterogeneous data pieces or even ease access to external data sources to eventually augment the original data. Data annotation algorithms to find and link entities in reference KGs, as well as to identify out-of-KG entities have been proposed and applied to different types of data, such as tables, and texts. However, despite recent progress in annotation algorithms, the output of these algorithms does not always meet the quality requirements that make the enriched data valuable in downstream applications. As a result, semantic data enrichment remains an effort-consuming and error-prone task. In this seminar, we discuss the relationships between annotation algorithms, data enrichment, and KG construction, highlighting challenges and open problems. In addition, we advocate for a native human-in-the-loop perspective that enables users to control the outcome of the enrichment and, eventually, improve the quality of the enriched data. We focus in particular on the annotation and enrichment of tabular data and briefly discuss the application of a similar paradigm to the enrichment of textual data in the legal domain, e.g., on court decisions and criminal investigation documents.
Fundamental Areas Of Study In Data Science.pdfBPBOnline
Data Science is a broad term that encompasses multiple disciplines. It is a rapidly growing field of study that uses scientific methods to extract meaningful insights from given input data. The rapid growth in the field of data science has opened the eyes of researchers interested in this field to explore more into the multiple disciplines that encompass data science.
Let's see a few of these broad areas that are fundamental aspects to be covered for mastering Data science.
Ontology Building vs Data Harvesting and Cleaning for Smart-city ServicesPaolo Nesi
Presently, a very large number of public and private data sets are available around the local governments. In most cases, they are not semantically interoperable and a huge human effort is needed to create integrated ontologies and knowledge base for smart city. Smart City ontology is not yet standardized, and a lot of research work is needed to identify models that can easily support the data reconciliation, the management of the complexity and reasoning. In this paper, a system for data ingestion and reconciliation of smart cities related aspects as road graph, services available on the roads, traffic sensors etc., is proposed. The system allows managing a big volume of data coming from a variety of sources considering both static and dynamic data. These data are mapped to smart-city ontology and stored into an RDF-Store where they are available for applications via SPARQL queries to provide new services to the users. The paper presents the process adopted to produce the ontology and the knowledge base and the mechanisms adopted for the verification, reconciliation and validation. Some examples about the possible usage of the coherent knowledge base produced are also offered and are accessible from the RDF-Store and related services. The article also presented the work performed about reconciliation algorithms and their comparative assessment and selection. Keywords Smart city, knowledge base construction, reconciliation, validation and verification of knowledge base, smart city ontology, linked open graph.
EUDAT Webinar "Organise, retrieve and aggregate data using annotations with B...EUDAT
| www.eudat.eu | Annotate your research data with B2NOTE:
A note in the margins of a book or a scientific paper, a comment on a manuscript: we are all using annotations to add information to existing physical documents. To offer a similar experience with digital content within the EUDAT Collaborative Data Infrastructure (CDI), we developed a service that allows associating additional information to a file, in a computer-readable format, without changing the file or the data record itself. These digital annotations can thus be searched to organize, retrieve and aggregate files, datasets and documents.
Although B2NOTE is a standalone service, it has been designed to be integrated with the existing EUDAT services. In the first pilot version, B2NOTE allows to annotate files located in B2SHARE. The service is called as a “widget” within the B2SHARE User Interface. B2NOTE allows you to easily and intuitively create three types of annotations: a semantic tag coming from identified ontology repositories (only Bioportal at the moment but we are working toward integrating more vocabularies), a free-text keyword that can be used when you do not find a semantic term in particular and a free-text comment.
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
Charles Cai has more than two decades of experience and track records of global transformational programme deliveries – from vision, evangelism to end-to-end execution in global investment banks, and energy trading companies, where he excels at designing and building innovative, large scale, Big Data systems in high volume low latency trading, global Energy Trading & Risk Management, and advanced temporal and geospatial predictive analytics, as Chief Front Office Technical Architect and Head of Data Science. He’s also a frequent speaker at Google Campus, Big Data Innovation Summit, Cloud World Forum, Data Science London, QCon London and MoD CIO Symposium etc, to promote knowledge and best practice sharing, with audience ranging from developers, data scientists, to CXO level senior executives from both IT and business background. He has in-depth knowledge and experience Scala, Python, C# / F#, C++, Node.js, Java, R, Haskell programming languages in Mobile, Desktop, Hadoop/Spark, Cloud IoT/MCU and BlockChain etc, and TOGAF9, EMC-DS, AWS CNE4 etc. certifications.
Open Data Day 2016, Km4City, L’universita’ come aggregatore di Open Data del ...Paolo Nesi
Open Data Day, UNIMORE, Modena, 5 Marzo 2016.
Aggregazione dati, experienza di Firenze,
Smart City, Km4City,
Smart Decision Support,
Data Ingestion manager,
Data aggregation,
User profiling on demand.
Mobilità: inter-modalità, bigliettazione integrata, sostenibile, scambiatori, sfruttamento stazioni, etc.,
Servizi: gov ..SUAP, edu, turismo, beni culturali, salute, etc.,
Energia: risparmio energetico, riduzione amissioni, inquinamento, etc.,
Ambiente: qualità dell’aria, fiumi, meteo, rifiuti, etc.,
… commercio, industria, etc.
... Infrastrutture critiche. resilienza
Collezionamento dati statici, quasi statici e real time, stream
Dati open: geo localizzati, servizi, statistiche, censimenti, etc.
Dati privati degli operatori: con licenze limitate per non permettere di fare profitto ad altri operatori sulla base dei loro dati
Dati personali delle persone: profili, comportamenti tramite APP, IOT, sensori, web, etc.
Integrazione dati per renderli semanticamente interoperabili, ed operare deduzioni (time, space… )
I tradizionali collettori di open data danno visioni statistiche ma non sono adatti a produrre servizi integrati
Integrazione con modelli semantici unificanti come Km4City
Control Room delle Città Metropolitane devono:
arrivare a supervisionare domini multipli e le interdipendenze fra mobilità, energia, comunicazione, servizi, flussi traffico, flussi pedonali, turismo, etc.
Migliorare la loro Resilienza, capacità di reazione ed assorbimento
ridurre i costi sociali della mobilità per le persone
consentendo minori disagi, maggiore efficienza,
maggiore sensibilità verso le necessità del cittadino,
minori emissioni, migliori condizioni ambientali;
percorsi info-formativi in modo che il cittadino cambi le abitudini non virtuose;
ridurre i costi di trasporto ed i tempi di percorrenza per gli utenti, per i gestori e le amministrazioni, tramite soluzioni di ottimizzazione.
Presentación Ciro Cattuto, ISI Foundation en VI Summit País Digital 2018PAÍS DIGITAL
Exposición “ Data Science for private and public good” de Ciro Cattuto, Scientific Director, ISI Foundation, en el marco del VI Summit País Digital 2018, realizado el 4 y 5 de septiembre en Santiago, Chile.
Data models in precision agriculture: from IoT to big data analyticsUniversity of Bologna
Data models are abstract models that standardize data formats and relationships.
In other words, data models describe the concepts that belong to a certain application domain (e.g., "Device" and "Farm" are concepts that belong to the "Agriculture" domain, and "Device" is also a concept in the domain of "Smart Cities").
Over the years, many data models (and ontologies) have been produced for the precision agriculture domain.
On the one hand, such models provide standards for data transmission and representation.
On the other hand, these models are not suited for (automated) data integration and analysis, which are core tasks in building decision support systems for precision agriculture ---digitalized systems that support farmers and technicians in making data-driven decisions.
Following the advancements in big data technologies and internet of things systems, managing such systems is increasingly harder and requires not only standards to transmit and represent the data, but also to automatically integrate heterogeneous data into a uniform medium and to automate data analysis and fruition.
While this is a well-known issue in the field of precision agriculture, where data models usually fuel data silos for ad-hoc independent applications (e.g., smart watering management, autonomous weeding systems, vegetation index computation), the synergy with computer science and database techniques could both answer these challenges and open novel research directions.
In this poster, we (i) describe some of the state-of-the-art models for precision agriculture and their application (e.g., from the FIWARE ecosystem), (ii) factorize the limitations and issues of such models (e.g., inter-domain ambiguities, intra-domain inconsistency, wrong modeling practices), (iii) show how computer science techniques (e.g., entity resolution, data normalization, data provenance collection) can answer these issues, and (iv) introduce novel data-driven research directions for building unifying decision support systems.
[EDBT2023] Describing and Assessing Cubes Through Intentional Analytics (demo...University of Bologna
The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics by (i) letting users explore multidimensional cubes stating their intentions, and (ii) returning multidimensional data coupled with knowledge insights in the form of annotations of subsets of data. Goal of this demonstration is to showcase the IAM approach using a notebook where the user can create a data exploration session by writing describe and assess statements, whose results are displayed by combining tabular data and charts so as to bring the highlights discovered to the user's attention. The demonstration plan will show the effectiveness of the IAM approach in supporting data exploration and analysis and its added value as compared to a traditional OLAP session by proposing two scenarios with guided interaction and letting users run custom sessions.
The Intentional Analytics Model (IAM) has been devised to couple OLAP and analytics by (i) letting users express their analysis intentions on multidimensional data cubes and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of models (e.g., correlations). Five intention operators were proposed to this end; of these, describe and assess have been investigated in previous papers. In this work we enrich the IAM picture by focusing on the explain operator, whose goal is to provide an answer to the user asking "why does a measure show these values?". Specifically, we propose a syntax for the operator and discuss how enhanced cubes are built by (i) finding the polynomials that best approximate the relationship between a measure and the other cube measures, and (ii) highlighting the most interesting one. Finally, we test the operator implementation in terms of efficiency.
Carrying out OLAP analyses in hands-free scenarios requires lean forms of communication between the users and the system, based for instance on natural language. In this paper we introduce VOOL, a framework specifically devised for vocalizing the insights resulting from OLAP sessions. VOOL is self-configurable, extensible, and is aware of the user's intentions expressed by OLAP operators. To avoid overwhelming the user with very long descriptions, we pursue the vocalization of selected insights automatically extracted from query results. These insights are detected by a set of modules, each returning a set of independent insights that characterize data. After describing and formalizing our approach, we evaluate it in terms of efficiency and effectiveness.
The democratization of data access and the adoption of OLAP in scenarios requiring hand-free interfaces push towards the creation of smart OLAP interfaces. We describe COOL, a framework devised for COnversational OLap applications. COOL interprets and translates a natural language dialog into an OLAP session that starts with a GPSJ (Generalized Projection, Selection, and Join) query and continues with the application of OLAP operators. The interpretation relies on a formal grammar and on a repository storing metadata and values from a multidimensional cube. In case of ambiguous text description, COOL can obtain the correct query either through automatic inference or user interactions to disambiguate the text.
[EDBT2021] Conversational OLAP in Action (Best Demo Award EDBT2021)University of Bologna
Demo Paper presented at EDBT 2021: Conversational OLAP in Action (Best Demo Award)
Link to the paper: https://edbt2021proceedings.github.io/docs/p145.pdf
The democratization of data access and the adoption of OLAP in scenarios requiring hand-free interfaces push towards the creation of smart OLAP interfaces. In this demonstration we present COOL, a tool supporting natural language COnversational OLap sessions. COOL interprets and translates a natural language dialogue into an OLAP session that starts with a GPSJ (Generalized Projection, Selection and Join) query. The interpretation relies on a formal grammar and a knowledge base storing metadata from a multidimensional cube. COOL is portable, robust, and requires minimal user intervention. It adopts an n-gram based model and a string similarity function to match known entities in the natural language description. In case of incomplete text description, COOL can obtain the correct query either through automatic inference or through interactions with the user to disambiguate the text. The goal of the demonstration is to let the audience evaluate the usability of COOL and its capabilities in assisting query formulation and ambiguity/error resolution.
[EDBT2021] Assess Queries for Interactive Analysis of Data CubesUniversity of Bologna
Paper presented at EDBT 2021: Assess Queries for Interactive Analysis of Data Cubes
Link to the paper: https://edbt2021proceedings.github.io/docs/p41.pdf
Assessment is the process of comparing the actual to the expected behavior of a business phenomenon and judging the outcome of the comparison. In this paper, we propose `assess`, a novel querying operator that supports assessment based on the results of a query on a data cube. This operator requires (1) the specification of an OLAP query over a measure of a data cube, to define the target cube to be assessed; (2) the specification of a reference cube of comparison (benchmark), which represents the expected performance of the measure; (3) the specification of how to perform the comparison between the target cube and the benchmark, and (4) a labeling function that classifies the result of this comparison using a set of labels. After introducing an SQL-like syntax for our operator, we formally define its semantics in terms of a set of logical operators. To support the computation of `assess` we propose a basic plan as well as some optimization strategies, then we experimentally evaluate their performance using a prototype.
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema VarietyUniversity of Bologna
Paper presented at SEBD 2020
Document stores are preferred to relational ones for storing heterogeneous data due to their schemaless nature. However, the absence of a unique schema adds complexity to analytical applications. In a previous paper we have proposed an original approach to OLAP on document stores; its basic idea was to stop fighting against schema variety and welcome it as an inherent source of information wealth in schemaless sources. In this paper we focus on the querying phase, showing how queries can be directly rewritten on a heterogeneous collection in an inclusive way, i.e., also including the concepts present in a subset of documents only.
Authors: Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi
Paper presented at DOLAP 2020: Towards Conversational OLAP
Link to the presentation: https://youtu.be/IfBc1H46s8Y
Abstract: The democratization of data access and the adoption of OLAP in scenarios requiring hand-free interfaces push towards the creation of smart OLAP interfaces. In this paper, we envisage a conversational framework specifically devised for OLAP applications. The system converts natural language text in GPSJ (Generalized Projection, Selection and Join) queries. The approach relies on an ad-hoc grammar and a knowledge base storing multidimensional metadata and cubes values. In case of ambiguous or incomplete query description, the system is able to obtain the correct query either through automatic inference or through interactions with the user to disambiguate the text. Our tests show very promising results both in terms of effectiveness and efficiency.
Authors: Matteo Francia, Enrico Gallinucci, Matteo Golfarelli
[MIPRO2019] Map-Matching on Big Data: a Distributed and Efficient Algorithm w...University of Bologna
In urban mobility, map-matching aims to project GPS points generated by moving objects onto the road segments representing the actual object positions. Up to now, map-matching has found interesting applications in traffic analysis, frequent path extraction, and location prediction. However, state-of-art implementations of map-matching algorithms are either private, sequential or inefficient. In this paper, we propose an extension of an existing serial algorithm of known efficiency by reformulating it in a distributed way, in order to achieve great scalability on real big data scenarios. Furthermore, we enhance the robustness of the algorithm, which is based on a first-order Hidden Markov Model, by introducing a smart strategy to avoid gaps in the matched road segments; indeed, this problem may occur under sparse GPS sampling or in urban areas with highly fragmented road segments. Our implementation is based on Apache Spark and is publicly available on Github. The implementation is tested against a dataset with 7.8 million GPS points in Milan.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data and advanced analytics
1. PhD Computer Science and Engineering
Information
World
Data
Knowledge
Wisdom
Augmenting the Knowledge Pyramid
with Unconventional Data & Advanced Analytics
Matteo Francia
Supervisor: Prof. Matteo Golfarelli
Ciclo XXXIII
2. PhD Computer Science and Engineering
Outline
The knowledge pyramid
Augmenting the knowledge pyramid
Part I: Unconventional data
Part II: Advanced analytics
Advanced analytics in hand-free scenarios
Conclusion
Matteo Francia – University of Bologna 2
Information
World
Data
Knowledge
Wisdom
3. PhD Computer Science and Engineering
BI & the knowledge pyramid
Business intelligence
Strategies to transform raw data into decision-making insights
Transformation is usually abstracted in the “knowledge pyramid” [1, 2]
Data: symbols representing real-word objects (e.g., store product sales)
Information: processed data (e.g., query the product with highest profit)
Knowledge: understanding (e.g., mine products often sold together)
Wisdom: knowledge in action (e.g., discount products to optimize profits)
Contribution: augmenting the knowledge pyramid
PART I: unconventional data to improve decision-making
PART II: advanced analytics to climb the pyramid
Matteo Francia – University of Bologna 3
[1] Jennifer E. Rowley: The wisdom hierarchy: representations of the DIKW hierarchy. J. Inf. Sci. 33(2): 163-180 (2007)
[2] Martin Frické: The knowledge pyramid: a critique of the DIKW hierarchy. J. Inf. Sci. 35(2): 131-142 (2009)
World
Data
(Operational DB, OLTP)
Information
(Data warehouse, OLAP)
Knowledge
(Data Mining)
Wisdom
(Decisions)
4. PhD Computer Science and Engineering
Part I: unconventional data
Sensing provides data to support contextual decisions
“World” and “Data” levels
New challenges on unconventional data
Unstructured and non-relational
Transformation requires type-aware techniques
Matteo Francia – University of Bologna 5
World
Knowledge
(Data Mining)
Data
(Operational DB, OLTP)
Information
(Data Warehouse, OLAP)
Wisdom
(Decisions)
Unconventional data
5. PhD Computer Science and Engineering
Contribution: mobility data
Mobility data are at the core of location-based systems
Trajectory: temporal sequence of spatial locations
- Uncertainty: positioning errors
- E.g., GPS (~m) vs GSM (~km)
- Sensitivity: 4 points can identify 95% individuals [1, 2]
- De-anonymize through raw signatures [3]
- De-anonymize through personal gazetteers [4]
Big data applications
- Map matching [5]: project GPS locations to the most-likely road segments
- Profiling [6]: estimate user profiles and income by frequented places
- Precision farming [7]: monitor and coordinate cropping robots
Matteo Francia – University of Bologna 7
[1] Yves-Alexandre De Montjoye, et al.: Unique in the crowd: The privacy bounds of human mobility. Scientific reports 3 (2013): 1376.
[2] Fengmei Jin, Wen Hua, Matteo Francia, Pingfu Chao, Maria E. Orlowska, Xiaofang Zhou: A Survey and Experimental Study on Privacy-Preserving Trajectory Data Publishing. (Under review, TKDE)
[3] Fengmei Jin, Wen Hua, Thomas Zhou, Jiajie Xu, Matteo Francia, Maria E. Orlowska, Xiaofang Zhou: Trajectory-Based Spatiotemporal Entity Linking. IEEE Trans. on Know. and Data Eng. (2020).
[4] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Nicola Santolini: DART: De-Anonymization of personal gazetteers through social trajectories. J. Inf. Secur. Appl. 55: 102634 (2020)
[5] Matteo Francia, Enrico Gallinucci, Federico Vitali: Map-Matching on Big Data: a Distributed and Efficient Algorithm with a Hidden Markov Model. MIPRO 2019: 1238-1243
[6] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Summarization and visualization of multi-level and multi-dimensional itemsets. Inf. Sci. 520: 63-85 (2020)
[7] Giuliano Vitali, Matteo Francia, Matteo Golfarelli, Maurizio Canavari: Crop Management with the IoT: An Interdisciplinary Survey. Agronomy 11.1 (2021): 181.
A B C D
1
2
3
4
Tb
Tg
Tr
6. PhD Computer Science and Engineering
Part II: advanced analytics
High availability and accessibility attract new data scientists
High competence in business domain
Low competence in computer science
Since the ’70s, relational queries to retrieve data
Comprehension of formal languages and DBMS
Advanced analytics (semi-automatic transformation)
- “Information” and “Knowledge” levels
Matteo Francia – University of Bologna 8
Advanced analytics
Intention
Hand-free scenarios
Data summaries
World
Knowledge
(Data Mining)
Data
(Operational DB, OLTP)
Information
(Data Warehouse, OLAP)
Wisdom
(Decisions)
7. PhD Computer Science and Engineering
Contribution: advanced analytics
Hand-free scenarios
Augmented OLAP [1]: recommendation in augmented reality
Conversational OLAP [2, 3]: interpret natural language queries
Express high-level analytic abstractions, not queries
E.g., describe [4, 5] interesting patterns of sales
E.g., assess [6] Italian sales against French sales
Data summaries
Summarization based on multidimensional similarity [7]
Conceptual model for data narratives [8, 9]
Matteo Francia – University of Bologna 9
[1] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+. A framework for Augmented Business Intelligence. Inf. Syst. 92: 101520 (2020)
[2] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A framework for conversational OLAP. Inf. Syst. 101752. (2021)
[3] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action. EDBT 2021: 646-649
[4] Antoine Chédin, Matteo Francia, Patrick Marcel, Veronika Peralta, and Stefano Rizzi. The tell-tale cube. ADBIS, 2020.
[5] Matteo Francia, Patrick Marcel, Verónika Peralta, Stefano Rizzi: Enhancing Cubes with Models to Describe Multidimensional Data. Information Systems Frontiers (2021)
[6] Matteo Francia, Matteo Golfarelli, Patrick Marcel, Stefano Rizzi, Panos Vassiliadis: Assess Queries for Interactive Analysis of Data Cubes. EDBT 2021: 121-132
[7] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Summarization and visualization of multi-level and multi-dimensional itemsets. Inf. Sci. 520: 63-85 (2020)
[8] Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis: Towards a Conceptual Model for Data Narratives. ER 2020: 261-270
[9] Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis: Supporting the Generation of Data Narratives. ER Forum/Posters/Demos 2020: 168-172
8. PhD Computer Science and Engineering
Information
World
Data
Knowledge
Wisdom
Advanced analytics
Augmented OLAP
Matteo Francia – University of Bologna 10
Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+: A framework for Augmented Business Intelligence. Inf. Syst. 92: 101520 (2020)
Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Augmented Business Intelligence. DOLAP 2019
9. PhD Computer Science and Engineering
Application scope
Enable analytics on augmented reality
E.g., an inspector analyzing production rates
Sense the context through augmented devices
E.g., smart glasses
Detect interaction and engagement [1]
Produce analytical reports
Relevant to the sensed context
Cardinality constraint
Near real-time
Matteo Francia – University of Bologna 11
?
Analytical Reports
[1] Yu-Chuan Su, Kristen Grauman: Detecting Engagement in Egocentric Video. ECCV (5) 2016: 454-471
10. PhD Computer Science and Engineering
Data Mart: repository of multidimensional cubes
Cubes representing business facts
Data dictionary
What we can recognize (i.e., md-elements)
Context: subset of md-elements
Mappings to sets of md-elements
A-priori interest
What can we sense?
Matteo Francia – University of Bologna 13
Date
Year
Product
Type
Category
City
Sales
Quantity
Revenues
Assembly
AssembledItems
AssemblyTime
Part
Context
<Object, Seat> dist = 1m
<Object, BikeExcite> dist = 2m
<Location, RoomA.1>
<Date, 16/10/2018>
<Role, Controller>
Date
Month
Year
Product
Type
Category
Family
Month
Store
Device
Dictionary
11. PhD Computer Science and Engineering
Recommendation
Context interpretation
Given context T over the data dictionary
Project T to an image of fragments I through mappings
- Fragment: intuitively a “small” query
Add the log
Get queries with positive feedback from similar contexts
- Enrich I to I* with unperceived elements from T
Each fragment has contextual and log relevance
Query generation
Cannot directly translate I* into a well-formed query
High cardinality I* = hardly interpretable “monster query”
Matteo Francia – University of Bologna 14
Query
generation
Context relevant queries
recommended
queries
Query selection
<Object, Seat> dist = 1m
<Object, BikeExcite> dist = 2m
<Location, RoomA.1>
<Date, 16/10/2018>
<Role, Controller>
Log
Analytical Reports
user’s
feedback
12. PhD Computer Science and Engineering
Query generation
Generate queries from image I* of fragments
Each fragment is a query
Depth-first exploration with pruning rules
- Query cardinality can only increase
- Some queries are redundant
Matteo Francia – University of Bologna 15
I*
μ(T)
{Month},
{},
{AssembledItems}
{Product},
{(Product=BikeExcite)},
{Quantity}
{Part,Type},
{(Type=Bike)},
{}
{Part,Product},
{(Product=BikeExcite)},
{Quantity}
{Month,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Year},
{},
{AssembledItems}
{Year,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Part,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Year,Part,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Part,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Product},
{(Product=BikeExcite)},
{Quantity,AssembledItems}
{Month,Part,Type},
{(Type=Bike)},
{AssembledItems}
{Year,Part,Type},
{(Type=Bike)},
{AssembledItems}
{Month,Part,Type},
{(Type=Bike)},
{AssembledItems}
{Month},
{},
{AssembledItems}
Fragments
13. PhD Computer Science and Engineering
Query selection
Given #queries (rq), maximize the covered fragments and minimize their overlapping
E.g., given two queries q and q’
rel(q) + rel(q’) – sim(q, q’) * (rel(q) + rel(q’)) / 2
Weighted Maximum Coverage Problem (NP-hard)
Greedy: iteratively pick query maximizing relT
- Only a few query are retrieved, not expensive
Matteo Francia – University of Bologna 16
q
I*
μ(T)
q'
14. PhD Computer Science and Engineering
Test set up
Cube with 109 md-elements
Simulate user moving inside a factory
Given fixed context and query target
Assess similarity of the proposed query in similar contexts
𝛽: context similarity
sim: proposed/target query similarity
Effectiveness
Matteo Francia – University of Bologna 17
Best query (with user exp.)
After 2 visits: 0.95, 4 visits: 0.98
Best query (no user exp.)
|T| = 12, rq = 4
Target context Similar context
15. PhD Computer Science and Engineering
Research directions
OLAP in augmented reality
Support analytical queries in hand-free scenarios
Recommend relevant data facts from a real-world context
Research directions
Provide (fast) query previews
- Estimate the execution time of each query
- Address query caching and multi-query optimization issues
Correlate context-awareness to data quality [3]
- Relevance, amount, and completeness [4]
Matteo Francia – University of Bologna 19
[3] Stephanie Watts, Ganesan Shankaranarayanan, Adir Even: Data quality assessment in context: A cognitive perspective. Decis. Support Syst. 48(1): 202-211 (2009)
[4] Diane M. Strong, Yang W. Lee, Richard Y. Wang: Data Quality in Context. Commun. ACM 40(5): 103-110 (1997)
16. PhD Computer Science and Engineering
Information
World
Data
Knowledge
Wisdom
Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A framework for conversational OLAP. Inf. Syst. 101752. (2021)
Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action. EDBT 2021 (best demo award): 646-649
Advanced analytics
Matteo Francia – University of Bologna 22
Conversational OLAP
17. PhD Computer Science and Engineering
Motivation
Enable analytics through natural language
OLAP provides low-level operators [1]
Users need to have knowledge on the multidimensional model…
… or even programming skills
We introduce COOL (COnversational OLap) [3]
Translate natural language into formal queries
Matteo Francia – University of Bologna 23
[1] Panos Vassiliadis, Patrick Marcel, Stefano Rizzi: Beyond roll-up's and drill-down's: An intentional analytics model to reinvent OLAP. Information Systems. (2019)
[2] Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+: A framework for Augmented Business Intelligence. Information Systems. (2020)
[3] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A Framework for Conversational OLAP. Information Systems. (2021)
18. PhD Computer Science and Engineering
COOL: architecture
Matteo Francia – University of Bologna 24
Automatic
KB feeding
Manual KB
enrichment KB
DW
Metadata
& values
Synonyms
Offline
Online
Synonyms
Ontology
19. PhD Computer Science and Engineering
COOL: architecture
Matteo Francia – University of Bologna 25
Speech-
to-Text
OLAP
operator
Full query
Disambiguation
& Enhancement
Execution &
Visualization
Automatic
KB feeding
Manual KB
enrichment
Raw
text
Annotated
parse forest
Parse
tree
Metadata
& values
Synonyms
Log
Interpretation
Offline
Online
Synonyms
Ontology
SQL
generation
SQL
Sales by
Customer and
Month
Parse tree
Statistics
KB
DW
21. PhD Computer Science and Engineering
Effectiveness
40 users with heterogeneous OLAP skills
Asked to translate (Italian) analytic goals into English
Users provided good feedback on the interface...
... as well as on the interpretation accuracy
Matteo Francia – University of Bologna 31
Full Query OLAP operator
OLAP Familiarity Accuracy Time (s) Accuracy Time (s)
Low 0.91 141 0.86 102
High 0.91 97 0.92 71
22. PhD Computer Science and Engineering
COOL in Action!
Matteo Francia – University of Bologna 33
[3] Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action. EDBT (best demo award) 2021: 646-649
23. PhD Computer Science and Engineering
Research directions
COOL (Conversational OLAP)
Support the translation of a natural language conversation into an OLAP session
Analyze data without requiring technological skills
- Add conversational capabilities to Augmented OLAP
Towards an end-to-end conversational solution
Create query summaries that can be returned as short vocal messages
Identify insights out of a large amount of data
Identify the “right” storytelling and user-system interaction
Matteo Francia – University of Bologna 36
24. PhD Computer Science and Engineering
Conclusion
Data scientists have heterogeneous background
The need for high-level analytic abstractions and interfaces is well-understood
Advanced analytics work towards (semi-)autonomous data transformation
Data management should be (semi-)automated as well
- Orchestrate data platforms, maintain data lineage, profile data
Unconventional mobility data
Handle trajectory variety and semantic is troublesome
- Difference in sampling rates, speed, accuracy, transportation means
- We need a unifying framework for storage and analysis
Privacy of spatio-temporal data is a concern
- Besides protection, we need scalable solutions
Matteo Francia – University of Bologna 37
25. PhD Computer Science and Engineering
Publications
Journal articles
1. Matteo Francia, Patrick Marcel, Verónika Peralta, Stefano Rizzi: Enhancing Cubes with
Models to Describe Multidimensional Data. Information Systems Frontiers (2021)
2. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: COOL: A framework for
conversational OLAP. Information Systems (2021)
3. Giuliano Vitali, Matteo Francia, Matteo Golfarelli, Maurizio Canavari: Crop Management
with the IoT: An Interdisciplinary Survey. Agronomy (2021)
4. Fengmei Jin, Wen Hua, Thomas Zhou, Jiajie Xu, Matteo Francia, Maria E. Orlowska,
Xiaofang Zhou: Trajectory-Based Spatiotemporal Entity Linking. IEEE Transactions on
Knowledge and Data Engineering (2020).
5. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Nicola Santolini: DART: De-
Anonymization of personal gazetteers through social trajectories. Journal of
Information Security and Applications. 55: 102634 (2020)
6. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A-BI+: A framework for Augmented
Business Intelligence. Information Systems 92: 101520 (2020)
7. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Summarization and visualization of
multi-level and multi-dimensional itemsets. Information Sciences 520: 63-85 (2020)
8. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Social BI to understand the debate
on vaccines on the Web and social media: unraveling the anti-, free, and pro-vax
communities in Italy. Social Network Analysis and Mining 9(1): 46:1-46:16 (2019)
Conference papers
1. Matteo Francia, Matteo Golfarelli, Patrick Marcel, Stefano Rizzi, Panos Vassiliadis:
Assess Queries for Interactive Analysis of Data Cubes. EDBT 2021: 121-132
2. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Conversational OLAP in Action.
EDBT 2021: 646-649 (best demo award)
3. Antoine Chédin, Matteo Francia, Patrick Marcel, Verónika Peralta, Stefano Rizzi: The
Tell-Tale Cube. ADBIS 2020: 204-218
4. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Towards Conversational OLAP.
DOLAP 2020: 6-15
5. Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis:
Supporting the Generation of Data Narratives. ER Forum/Posters/Demos 2020: 168-172
6. Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, Panos Vassiliadis:
Towards a Conceptual Model for Data Narratives. ER 2020: 261-270
7. Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi: OLAP Querying of
Document Stores in the Presence of Schema Variety. SEBD 2020: 128-135
8. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: Augmented Business Intelligence.
DOLAP 2019
9. Matteo Francia, Enrico Gallinucci, Federico Vitali: Map-Matching on Big Data: a
Distributed and Efficient Algorithm with a Hidden Markov Model. MIPRO 2019: 1238-1243
10. Matteo Francia, Matteo Golfarelli, Stefano Rizzi: A Similarity Function for Multi-Level and
Multi-Dimensional Itemsets. SEBD 2018
11. Matteo Francia, Danilo Pianini, Jacob Beal, Mirko Viroli: Towards a Foundational API for
Resilient Distributed Systems Design. FAS*W@SASO/ICCAC 2017: 27-32
Matteo Francia – University of Bologna 38
26. PhD Computer Science and Engineering
Thank you.
Information
World
Data
Knowledge
Wisdom
Questions?
Matteo Francia – University of Bologna 39
Editor's Notes
insight -> intuizione
decision making -> processo decisionale
Yves-Alexandre De Montjoye, et al.: Unique in the crowd: The privacy bounds of human mobility. Scientific reports 3 (2013): 1376.
We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals.
Sistemi come recommender system di Amazon possono usare dati contestuali (e.g, la posizione), tuttavia
Ci sono differenze sia differenze di «metodo» / «framework» che di «recommendation»
«Metodo»
- Amazon si basa su verità «più storiche», noi interpretiamo e «mixiamo» un contesto real-time costituito da più oggetti interessanti rilevati (e/o ingaggiati) dal sistema
- Il nostro sistema è «end-to-end», cioè riguarda anche la gestione e linking dei dati per la costruzione delle query
«Recommendation», formalmente noi usiamo un approccio ibrido (mentre i classici sono item-based o collaborative)
- Mix di conoscenza real-time con storica: Non siamo strettamente log-based (i.e., il contesto ci serve per un cold-start problem). Mentre il consiglio di amazon è «altri utenti hanno acquistato/visualizzato anche…»- Cardinalità del risultato per fare fit di un device augmented- Diversification di query diverse, non di una singola query
DIFF: [17] returns tuples that maximize difference between cells of a cube given as input
Profile user exploration to recommend which unvisited parts of the cube
RELAXoperator allows toverify whether a pattern observed at a certain level of detail ispresent at a coarser level of detail too [19]
Alternative operators have also been proposed in theCinecubes method [7,8]. The goal of this effort is to facilitateautomated reporting, given an original OLAP query as input.To achieve this purpose two operators (expressed asacts) areproposed, namely, (a)put-in-context, i.e., compare the result ofthe original query to query results over similar, sibling values;and (b)give-details, where drill-downs of the original query’sgroupers are performed.
DIFF: [17] returns tuples that maximize difference between cells of a cube given as input
Profile user exploration to recommend which unvisited parts of the cube
RELAXoperator allows toverify whether a pattern observed at a certain level of detail ispresent at a coarser level of detail too [19]
Alternative operators have also been proposed in theCinecubes method [7,8]. The goal of this effort is to facilitateautomated reporting, given an original OLAP query as input.To achieve this purpose two operators (expressed asacts) areproposed, namely, (a)put-in-context, i.e., compare the result ofthe original query to query results over similar, sibling values;and (b)give-details, where drill-downs of the original query’sgroupers are performed.
Jagadish: The linguistic parse trees in our system are dependency parse trees, in which each node is a word/phrase specified by the user while each edge is a linguistic dependency relationship be- tween two words/phrases. The
Jagadish: The linguistic parse trees in our system are dependency parse trees, in which each node is a word/phrase specified by the user while each edge is a linguistic dependency relationship be- tween two words/phrases. The
Jagadish: The linguistic parse trees in our system are dependency parse trees, in which each node is a word/phrase specified by the user while each edge is a linguistic dependency relationship be- tween two words/phrases. The