The document discusses the DAME (Data Mining & Exploration) project, which aims to implement data mining applications and services for massive data analysis and exploration using a distributed computing environment. It seeks to standardize data mining methods and make them interoperable within the virtual observatory. The project has developed several web applications and investigates using a plugin architecture and standardized accounting to improve interoperability between applications and minimize data transfer requirements. The goal is to develop a unified data mining application approach for the virtual observatory.
IEEE P2P 2009 - Kalman Graffi - Monitoring and Management of Structured Peer-...Kalman Graffi
The peer-to-peer paradigm shows the potential to provide the same functionality and quality like client/server based systems, but with much lower costs. In order to control the quality of peer-to-peer systems, monitoring and management mechanisms need to be applied. Both tasks are challenging in large-scale networks with autonomous, unreliable nodes. In this paper we present a monitoring and management framework for structured peer-to-peer systems. It captures the live status of a peer-to-peer network in an exhaustive statistical representation. Using principles of autonomic computing, a preset system state is approached through automated system re-configuration in the case that a quality deviation is detected. Evaluation shows that the monitoring is very precise and lightweight and that preset quality goals are reached and kept automatically.
Edinburgh Data-Intensive Research Data-intensive refers to huge volumes of data, complex patterns of data integration and analysis, and intricate interactions between data and users. Current methods and tools are failing to address data-intensive challenges effectively. They fail for several reasons, all of which are aspects of scalability. The deluge of computational methods and plethora of computational systems prevents effective and efficient use of resources, user interfaces are not adopted at a sufficient rate to satisfy demand for scientific computing and data and knowledge is created outside suitable contexts for collaborative research to be effective. The Edinburgh Data-Intensive Research group addresses these scalability issues by providing mappings from abstract formulations to concrete and optimised executions of research challenges, by developing intuitive interfaces to enable access to steer these executions and by developing systems to aid in creating new research challenges. In this talk I will present several exemplars where we have dealt with scalability issues in scientific scenarios.
IEEE P2P 2009 - Kalman Graffi - Monitoring and Management of Structured Peer-...Kalman Graffi
The peer-to-peer paradigm shows the potential to provide the same functionality and quality like client/server based systems, but with much lower costs. In order to control the quality of peer-to-peer systems, monitoring and management mechanisms need to be applied. Both tasks are challenging in large-scale networks with autonomous, unreliable nodes. In this paper we present a monitoring and management framework for structured peer-to-peer systems. It captures the live status of a peer-to-peer network in an exhaustive statistical representation. Using principles of autonomic computing, a preset system state is approached through automated system re-configuration in the case that a quality deviation is detected. Evaluation shows that the monitoring is very precise and lightweight and that preset quality goals are reached and kept automatically.
Edinburgh Data-Intensive Research Data-intensive refers to huge volumes of data, complex patterns of data integration and analysis, and intricate interactions between data and users. Current methods and tools are failing to address data-intensive challenges effectively. They fail for several reasons, all of which are aspects of scalability. The deluge of computational methods and plethora of computational systems prevents effective and efficient use of resources, user interfaces are not adopted at a sufficient rate to satisfy demand for scientific computing and data and knowledge is created outside suitable contexts for collaborative research to be effective. The Edinburgh Data-Intensive Research group addresses these scalability issues by providing mappings from abstract formulations to concrete and optimised executions of research challenges, by developing intuitive interfaces to enable access to steer these executions and by developing systems to aid in creating new research challenges. In this talk I will present several exemplars where we have dealt with scalability issues in scientific scenarios.
In collaboration with the Province of Brescia, Italy, we aim to redesign the relationship between four elements: information, the urban space, people and institutions. First, we will innovatively imagine new forms of communication and services to foster learning, knowledge and social inclusion. In particular, we will investigate the use of new media and communication technologies to promote social sustainability and cultural enrichment for location-based communities. Second, we will explore innovative designs for embedding electronics into the urban fabric, as well as into the public transportation system, so that they may promote ubiquitous accessibility to information, culture and knowledge. The ultimate goal of the project is to imagine how new media and mobile technologies can increase the younger population's awareness of environmental problems, foster learning and civic engagement.
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
We asked LinkedIn members worldwide about their levels of interest in the latest wave of technology: whether they’re using wearables, and whether they intend to buy self-driving cars and VR headsets as they become available. We asked them too about their attitudes to technology and to the growing role of Artificial Intelligence (AI) in the devices that they use. The answers were fascinating – and in many cases, surprising.
This SlideShare explores the full results of this study, including detailed market-by-market breakdowns of intention levels for each technology – and how attitudes change with age, location and seniority level. If you’re marketing a tech brand – or planning to use VR and wearables to reach a professional audience – then these are insights you won’t want to miss.
Presentation of the status of my PhD in 2012 done to ABLE group at Carnegie Mellon.
Years later from that appeared
https://github.com/iTransformers/netTransformer
Cloud computing is facing some serious latency issues due to huge volumes of data that need to be transferred from the place where data is generated to the cloud. For some types of applications, this is not acceptable.
One of the possible solutions to this problem is the idea to bring cloud services closer to the edge of the network, where data origi- nates. This idea is called edge computing, and it is advertised that it dramatically reduces the network latency as a bridge that links the users and the clouds, and as such, it makes the foundation for future interconnected applications.
Edge computing is a relatively new area of research and still faces many challenges like geo-organization and a clear separation of concerns, but also remote configuration, well defined native applications model, and limited node capacity. Because of these issues, edge computing is hard to be offered as a service for future real-time user-centric applications.
This thesis presents the dynamic organization of geo-distributed edge nodes into micro data-centers and forming micro-clouds to cover any arbitrary area and expand capacity, availability, and reliability. We use a cloud organization as an influence with adaptations for a different environment with a clear separation of concerns, and native applications model that can leverage the newly formed system.
We argue that the presented model can be integrated into existing solutions or used as a base for the development of future systems.
Furthermore, we give a clear separation of concerns for the proposed model. With the separation of concerns setup, edge-native applications model, and a unified node organization, we are moving towards the idea of edge computing as a service, like any other utility in cloud computing.
As the network softwarization trend started by SDN and NFV keeps evolving, the hardware/software continuum becomes more relevant than ever, offering new offloading/acceleration opportunities at node and network-wide scales. This talk will review evolving transformations behind network softwarization with a special focus on network refactoring and offloading trends leading to “fluid networks planes”, characterized by multiple candidate options for the specific HW/SW embodiment and the location of chained network functions, from the edge to core, from one administrative provider to another, from programmable silicon to portable lightweight virtualized containers. The talk will overview concrete examples from the literature with a special focus on the role of Machine Learning to assist key (automated) decision-making steps. Lastly, the talk will conclude with a glimpse on ongoing ML work applied to Youtube video QoE prediction in live 5G networks.
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
Michael Wrinn
Research Program Director, University Research Office,
Intel Corporation
Jason Dai
Engineering Director and Principal Engineer,
Intel Corporation
In collaboration with the Province of Brescia, Italy, we aim to redesign the relationship between four elements: information, the urban space, people and institutions. First, we will innovatively imagine new forms of communication and services to foster learning, knowledge and social inclusion. In particular, we will investigate the use of new media and communication technologies to promote social sustainability and cultural enrichment for location-based communities. Second, we will explore innovative designs for embedding electronics into the urban fabric, as well as into the public transportation system, so that they may promote ubiquitous accessibility to information, culture and knowledge. The ultimate goal of the project is to imagine how new media and mobile technologies can increase the younger population's awareness of environmental problems, foster learning and civic engagement.
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
We asked LinkedIn members worldwide about their levels of interest in the latest wave of technology: whether they’re using wearables, and whether they intend to buy self-driving cars and VR headsets as they become available. We asked them too about their attitudes to technology and to the growing role of Artificial Intelligence (AI) in the devices that they use. The answers were fascinating – and in many cases, surprising.
This SlideShare explores the full results of this study, including detailed market-by-market breakdowns of intention levels for each technology – and how attitudes change with age, location and seniority level. If you’re marketing a tech brand – or planning to use VR and wearables to reach a professional audience – then these are insights you won’t want to miss.
Presentation of the status of my PhD in 2012 done to ABLE group at Carnegie Mellon.
Years later from that appeared
https://github.com/iTransformers/netTransformer
Cloud computing is facing some serious latency issues due to huge volumes of data that need to be transferred from the place where data is generated to the cloud. For some types of applications, this is not acceptable.
One of the possible solutions to this problem is the idea to bring cloud services closer to the edge of the network, where data origi- nates. This idea is called edge computing, and it is advertised that it dramatically reduces the network latency as a bridge that links the users and the clouds, and as such, it makes the foundation for future interconnected applications.
Edge computing is a relatively new area of research and still faces many challenges like geo-organization and a clear separation of concerns, but also remote configuration, well defined native applications model, and limited node capacity. Because of these issues, edge computing is hard to be offered as a service for future real-time user-centric applications.
This thesis presents the dynamic organization of geo-distributed edge nodes into micro data-centers and forming micro-clouds to cover any arbitrary area and expand capacity, availability, and reliability. We use a cloud organization as an influence with adaptations for a different environment with a clear separation of concerns, and native applications model that can leverage the newly formed system.
We argue that the presented model can be integrated into existing solutions or used as a base for the development of future systems.
Furthermore, we give a clear separation of concerns for the proposed model. With the separation of concerns setup, edge-native applications model, and a unified node organization, we are moving towards the idea of edge computing as a service, like any other utility in cloud computing.
As the network softwarization trend started by SDN and NFV keeps evolving, the hardware/software continuum becomes more relevant than ever, offering new offloading/acceleration opportunities at node and network-wide scales. This talk will review evolving transformations behind network softwarization with a special focus on network refactoring and offloading trends leading to “fluid networks planes”, characterized by multiple candidate options for the specific HW/SW embodiment and the location of chained network functions, from the edge to core, from one administrative provider to another, from programmable silicon to portable lightweight virtualized containers. The talk will overview concrete examples from the literature with a special focus on the role of Machine Learning to assist key (automated) decision-making steps. Lastly, the talk will conclude with a glimpse on ongoing ML work applied to Youtube video QoE prediction in live 5G networks.
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
Michael Wrinn
Research Program Director, University Research Office,
Intel Corporation
Jason Dai
Engineering Director and Principal Engineer,
Intel Corporation
Cloud services are mid-way migrating from the traditional topic of unit VM/container/app migration to management of populations of VMs/containers/apps. Cloud Providers (CPs) today do not yet cooperate with Service Providers (SPs) by providing information on local performance and/or local execution context. On the other hand, CPs attempting to optimize performance for all their services would run into an acute complexity problem, where this presentation makes this case by discussing the Virtual Network Embedding (VNE) problem scaled to many concurrent services. One way to resolve the complexity problem is to allow services to manage their own populations, in a Do-It-Yourself (DiY) manner. The recently proposed Cloud Probing technology [1] does just that -- services actively probe e2e network performance of the cloud and optimize themselves. Another recent proposal in [2] discusses the next step -- the Local Hardware Awareness (LHA) technology through which CPs would cooperate with SPs in local discovery and would therefore improve DiY decisions. Among several examples, this presentation will discuss a cloud-based CDN [3] where SPs manage populations while clients aggregate content from several sources via concurrent streams -- referred to as the substream method in the literature on P2P streaming. This author realizes that the LHA technology breaks a fundamental rule in clouds -- blackboxing, but hopes that the case delivered in this presentation is sufficiently appealing to warrant a standardization effort in this direction.
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
Application-Aware Big Data Deduplication in Cloud EnvironmentSafayet Hossain
Here I present a paper based on Application-Aware Big Data Deduplication in Cloud Environment. It is published on IEEE on 31 May 2017.
Abstract of this paper:
Deduplication has become a widely deployed technology in cloud data centers to improve IT resources efficiency. However, traditional techniques face a great challenge in big data deduplication to strike a sensible tradeoff between the conflicting goals of scalable deduplication throughput and high duplicate elimination ratio. We propose AppDedupe, an application-aware scalable inline distributed deduplication framework in cloud environment, to meet this challenge by exploiting application awareness, data similarity and locality to optimize distributed deduplication with inter-node two-tiered data routing and intra-node application-aware deduplication. It first dispenses application data at file level with an application-aware routing to keep application locality, then assigns similar application data to the same storage node at the super-chunk granularity using a handprinting-based stateful data routing scheme to maintain high global deduplication efficiency, meanwhile balances the workload across nodes. AppDedupe builds application-aware similarity indices with super-chunk handprints to speedup the intra-node deduplication process with high efficiency. Our experimental evaluation of AppDedupe against state-of-the-art, driven by real-world datasets, demonstrates that AppDedupe achieves the highest global deduplication efficiency with a higher global deduplication effectiveness than the high-overhead and poorly scalable traditional scheme, but at an overhead only slightly higher than that of the scalable but low duplicate-elimination-ratio approaches.
Link of this paper:
https://ieeexplore.ieee.org/document/7936577
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Dame ivoa interop_brescia_naples2011
1. searching for KDD in MDS standards…
…the DAME experience
Marianna Annunziatella, Massimo Brescia, Stefano Cavuoti, Raffaele D’Abrusco, George
S. Djorgovski, Ciro Donalek, Mauro Garofalo , Marisa Guglielmo, Omar Laurino,
Giuseppe Longo, Ashish Mahabal, Ettore Mancini, Francesco Manna, Amata Mercurio,
Alfonso Nocella, Maurizio Paolillo, Luca Pellecchia, Sandro Riccardi, Giovanni Vebber,
Civita Vellucci.
Department of Physics – University Federico II – Napoli
INAF – National Institute of Astrophysics – Capodimonte Astronomical Observatory – Napoli
CALTECH – California Institute of Technology - Pasadena
2. Data Mining (KDD) as the Fourth
Paradigm Of Science
Definition
DM is the exploration and analysis of large quantities of data in order
to discover meaningful patterns and rules
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
3. The BoK’s Problem
Limited number of problems due to limited number of reliable BoKs
So far
• Limited number of BoK (and of limited scope) available
• Painstaking work for each application (es. spectroscopic redshifts for photometric
redshifts training)
• Fine tuning on specific data sets needed (e.g., if you add a band you need to re-train the
methods)
• There’s a need of standardization and interoperability between data together
with DM application
Community believes AI/DM methods are black boxes
You feed in something, and obtain patters, trends, i.e. knowledge….
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
4. What DAME is
DAME Program is a joint effort between University Federico II, INAF-OACN, and Caltech aimed at
implementing (as web applications and services) a scientific gateway for massive data analysis,
exploration and mining, on top of a virtualized distributed computing environment.
http://dame.dsf.unina.it/
Technical and management info
Documents
Science cases
Newsletters
http://dame.dsf.unina.it/beta_info.html
DAMEWARE Web application Beta Version
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
5. DM 4-rule virtuous cycle
• Virtuous cycle implementation steps:
• Finding patterns is not enough – Transforming data into information
• Science business must: via:
– Respond to patterns by taking action • Hypothesis testing
• Profiling
– Turning:
• Predictive modeling
• Data into Information – Taking action
• Information into Action • Model deployment
• Action into Value • Scoring
• Hence, the Virtuous Cycle of DM: – Measurement
• Assessing a model’s stability &
effectiveness before it is used
1. Identify the problem
2. Mining data to transform it into actionable information
3. Acting on the information
4. Measuring the results
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
6. DM: 11-step Methodology
The four rules reflect into an 11-step exploded strategy, at the base of DAME concept
1. Translate any opportunity (science case) into DM opportunity (problem)
2. Select appropriate data
3. Get to know the data
4. Create a model set
5. Fix problems with the data
6. Transform data to bring information
7. Build models
8. Assess models
9. Deploy models
10. Assess results
11. Begin again (GOTO 1)
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
7. Effective DM process break-down
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
8. The Black box Infrastructure
In this scenario DAME (Data Mining & Exploration) project, starting from astrophysics
requirements domain, has investigated the Massive Data Sets (MDS) exploration by
producing a taxonomy of data mining applications (hereinafter called functionalities)
and collected a set of machine learning algorithms (hereinafter called models).
This association functionality-model is made of what we defined "use case", easily
configurable by the user through specific tutorials. At low level, any experiment
launched on the DAME framework, externally configurable through dynamical
interactive web pages, is treated in a standard way, making completely transparent to
the user the specific computing infrastructure used and specific data format given as
input.
So the user doesn’t need to know anything about the computing infrastructure and
almost nothing about the internal mechanisms of the chosen machine learning
model..
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
9. DAME Infrastructure
DR Storage DR Execution
GRID SE GRID UI GRID CE
User & Data Archives DM Models Job Execution
(300 TB dedicated) (300 multi-core
processors)
Cloud facilities
16 TB
15 processors
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
10. DAME SW Architecture
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
11. The Available Services
DAMEWARE Web Application Resource
Main service providing via browser a list of algorithms and tools to configure and launch
experiments as complete workflows (dataset creation, model setup and run, graphical/text
output):
• Functionalities: Regression, Classification, Image Segmentation, Multi-layer Clustering;
• Models: MLP+BP, MLP+GA, SVM, MLP+QNA, K-Means (through KNIME), PPS, SOM, NEXT-II;
VOGCLUSTERS
Web Application for data and text mining on globular clusters;
STraDiWA (Sky Transient Discovery Web Application)
detect variable objects from real or simulated images (under R&D);
WFXT (Wide Field X-Ray Telescope) Transient Calculator
Web service to estimate the number of transient and variable sources that can be detected by
WFXT within the 3 main planned extragalactic surveys, with a given significant threshold;
SDSS (Sloan Digital Sky Survey)
Local mirror website hosting a complete SDSS Data Archive and Exploration System;
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
12. K-Means (through KNIME)
KNIME WORKFLOW
Offline
creation
OUTPUT
DM PLUG-IN COMPONENT
Offline
EXECUTION
Offline creation
creation
DMM API COMPONENT
CLOUD EXE/STORAGE ENVIRONMENT
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
13. Web 2.0 Features in DAME
Web 2.0? It is a system that breaks with the old model of centralized Web sites
and moves the power of the Web/Internet to the desktop. [J. Robb]
the Web becomes a universal, standards-based integration platform. [S. Dietzen]
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
14. VO Interoperability scenarios
DA1 SAMP Full interoperability between DA (Desktop Applications)
Local user desktop fully involved (requires computing
DA2 power)
Full WA DA interoperability
WSC Partial DA WA interoperability (such as remote file
DA
storing)
MDS must be moved between local and remote apps
MASSIVE WA
Local user desktop partially involved (requires minor
DATA
computing and storage power)
SETS
Except from URI exchange, no standard interoperabilty
WA1 URI?
Different accounting policy
MDS must be moved between remote apps (but larger
MASSIVE WA2
bandwidth)
DATA
No local computing power required
SETS
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
15. Our vision: improving aspects
DAs has to become WAs
Unique accounting policy (google/Microsoft like)
WA1 plugins
To overcome MDS flow apps must be plug&play (e.g.
any WAx feature should be pluggable in WAy on
WA2 demand)
No local computing power required. Also smartphones
can run VO apps
Requirements
• Standard accounting system;
• No more MDS moving on the web, but just moving Apps, structured as plugin
repositories and execution environments;
• standard modeling of WA and components to obtain the maximum level of granularity;
• Evolution of SAMP architecture to extend web interoperability (in particular for the
migration of the plugins);
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
16. Our vision: plugin granularity flow
WAx WAy
Px-1 Py-1
Px-2 Py-2
Px-3 Py-…
Px-… Py-n
Px-n Px-3
3. Way execute Px-3
This scheme could be iterated and extended
involving all standardized web apps
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
17. The Lernaean Hydra VO KDD App
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
18. The Lernaean Hydra VO KDD App
After a certain number of such iterations…
The VO KDD App scenario
WAx will become: WAy
No different WAs, but simply
Px-1 one WA with several sites Py-1
(eventually with different GUIs
Px-2 Py-2
and computing environments)
Px-3 All WA sites can become a Py-…
mirror site of all the others
Px-… Py-n
Px-n The synchronization of plugin
Px-1
releases between WAs is
Py-1 performed at request time Px-2
Py-2 Minimization of data exchange Px-3
flow (just few plugins in case of
Py-… synchronization between Px-…
mirrors)
Py-n Px-n
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011
19. Conclusions
DAME was not originally conceived (for the lack of suitable standards) to
be interoperable with the VO, but offers a good benchmark to plan for
the future developments of KDD on MDS in a VO environment.
1. DAME is just an example of what new ICT (Web 2.0) can do for A&A
KDD problems.
2. A new vision of the KDD App approach, suitable for VO must be based
on the minimization of data transfer and maximization of
interoperability within the VO community.
3. If implemented, the new scheme can reach a wider science
community by giving the opportunity to share data and apps worldwide,
without any particular infrastructure requirements (i.e. by using a
simple smartphone with a low-band connection).
DAME group is currently involved in the definition of standards and rules and is working to
modify and adapt the present infrastructure to become compliant with the VO.
M. Brescia et al. – IVOA Interop Meeting – Napoli, May 2011