The 3TU.Datacentrum repository of research data hosts datasets as well as other objects representing measuring devices, locations, time periods and the like. Virtually all metadata is in rdf so the repository can be approached as an rdf graph. We will show how this is implemented with Fedora Commons, heavily leaning on rdf queries and xslt2.0. As a result of this architecture, it is relatively easy to make the repository linked-data-enabled by generating OAI/ORE resource maps.
While most of the metadata is rdf, most of the data is in NetCDF. Although not very well known in the library world, this is very popular format in various fields of science and engineering. It comes with its own data server Opendap which offers a rich API to interact with the data. Our repository is therefore a hybrid Fedora + Opendap setup and we will show how the two are integrated into a unified view and how they are kept in sync on ingest.
This was presented at the ELAG conference, Palma de Mallorca 2012.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD).
Bio:
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD).
Bio:
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
This is a talk given at Eclipse Con Europe 2014 on how to use the open source project DAWN, Data Analysis Workbench. This project has two papers with more than three hundred citations of using the software.
A good foundation has been established for both data mining research and genuine
application based data mining. The current functionality of EMADS is limited
to classification and Meta-ARM. The research team is at present working towards
increasing the diversity of mining tasks that EMADS can address. There are many
directions in which the work can (and is being) taken forward. One interesting direction
is to build on the wealth of distributed data mining research that is currently
available and progress this in an MAS context. The research team are also enhancing
the system’s robustness so as to make it publicly available. It is hoped that once
the system is live other interested data mining practitioners will be prepared to contribute
algorithms and data.
Real time analytics with Spark Streaming by Padma at Bangalore I & D meetup (https://www.meetup.com/Bengaluru-Insights-and-Data-Meetup/events/238459154)
Data Science, Statistical Analysis and R... Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
Leveraging your operational data for advanced and predictive analytics enables deeper insights and greater value for cloud applications. DSE Analytics is a complete platform for Operational Analytics, including data ingestion, stream processing, batch analysis, and machine learning.
In this talk we will provide an overview of DSE Analytics as it applies to data science tools and techniques, and demonstrate these via real world use cases and examples.
Brian Hess
Rob Murphy
Rocco Varela
About the Speakers
Brian Hess Senior Product Manager, Analytics, DataStax
Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.
Similar to Elag 2012 - Under the hood of 3TU.Datacentrum. (20)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Elag 2012 - Under the hood of 3TU.Datacentrum.
1. Under the hood of 3TU.Datacentrum,
a repository for research data.
abstract
Egbert Gramsbergen
TU Delft Library /
3TU.Datacentrum
e.f.gramsbergen@tudelft.nl
ELAG, 2012-05-17
2. 3TU.Datacentrum
• 3 Dutch TU’s: Delft, Eindhoven, Twente
• Project 2008-2011, going concern 2012-
• Data archive
– 2008-
– “finished” data
– preserve but do not forget usability
– metadata harvestable (OAI-PMH)
– metadata crawlable (OAI-ORE linked data)
– data citable (by DataCite DOI’s)
• Data labs
– Just starting
– Unfinished data + software/scripts
4. Fedora digital objects
XML container with “datastreams” containing /
pointing to (meta)data
•3 special RDF datastreams
indexed in triple store
-> query with REST API / SPARQL
•Any number of content datastreams
xml datastreams may be inline,
other datastreams are on a location managed by Fedora
5. Fedora Content Model Architecture
Content Model object: links to Service Definition(s)
optionally defines datastreams + mime-types
Service Definition object: defines operations (methods) on data objects
incl parameters + validity constraints
Service Deployment object: implements the methods
Requests are handled by some service whose location is known to the Service Deployment
URL: /objects/<data object pid>/methods/<service definition pid>/<method name>[?<params>]
6. Fedora API & Saxon xslt2 service
API’s for viewing and manipulating objects
View API (REST, GET method)
– findObjects
– getDissemination
– getObjectHistory
– listDatastreams
– risearch (query triple store (ITQL, SPARQL))
– …
So everything has a url and returns xml
All methods so far have to return xml or (x)html
xslt is a natural fit
(remember: you can easily open secondary documents aka use the REST API)
xslt2.0 is much more powerful than xslt1.0
With Saxon, you can use Java classes/methods from within xslt
(rarely needed, in 3TU.DC only for spherical trigonometry in geographical calculations)
7. 3TU.DC architecture
Saxon for:
•html pages
•rdf for linked data (OAI-ORE)
•KML for maps
•Faceted search forms
•csv, cdl, Excel for datasets
•xml for indexing by SOLR
•xml for Datacite
•xml for PROAI
•… and more
Not in picture:
•PROAI (OAI-PMH service
provider)
•DOI registration (Datacite)
8. 3TU.DC architecture [2]
Content Model Architecture and xslt’s in detail
•10 content models
•7 service definition objects with 19 methods
•14 service deployment objects using 32 xslt’s
Left to right: content models, service deployments, methods aka xslt’s, service definitions
Lines: CMA, xslt imports, xml includes . All xslt’s are datastreams of one special xslt object.
9. rdf relations in 3TU.DC
Example relations (namespaces are omitted for brevity)
10. UI as rdf / linked data viewer
This dataset has some
metadata
and is part of
this dataset
with these
metadata
It was calculated
from this dataset
with these
metadata
measured by
this
instrument
with these
metadata
11. UI as rdf / linked data viewer [2]
Dilemmas - how far will you go?
•Which relations must be expanded?
•How many levels deep?
•Which inverse relations will you show?
•Show repetitions?
Answer: trial and error
Set of rules for each type of relation
Show enough for context but not too much… it’s a delicate balance
13. NetCDF
NetCDF: data format + data model
•Developed by UCAR (University Corporation for Atmospheric Research, USA),
roots at NASA, 1987.
•Comes with set of software tools / interfaces for programming
languages.
•Binary format, but data can be dumped in asci or xml
•Used mainly in geosciences (e.g. climate forecast models)
•BUT: fit for almost any type of numeric data + metadata
•Core data type: multidimensional array
>90% of 3TU.DC data is in NetCDF
14. NetCDF [2]
Example: T(x,y,z,t) - what can we say in NetCDF?
Variable T (4D array)
Variables x,y,z,t (1D arrays)
Dimensions x,y,z,t
Attributes: creator=‘me’
Attributes: x.units=‘m’, y.units=‘m’, z.units=‘m’, t.units=‘s’, T.units=‘deg_C’
T.name=‘Temperature’, T.error=0.1, etc…
You may invent your own attributes or use conventions (e.g. CF4)
newer NetCDF versions:
•More complex / irregular / nested structures
•built-in compression by variable
boost compression with “leastSignificantDigit=n”
15. OPeNDAP
OPeNDAP: protocol to talk to NetCDF (and similar) data over internet
THREDDS: server that speaks OPeNDAP
•Internal metadata directly visible on site
•APIs for all main programming languages
•Queries to obtain:
– cross-sections (slices, blocks)
– samples (take only 1 in n points)
– aggregated datasets (e.g. glue together consecutive time series)
Queries are handled server-side
(Datafiles in 3TU.DC are up to 100GB)
16. OPeNDAP python example
import urllib
import numpy as np
import netCDF4
import pydap
import matplotlib
import matplotlib.pyplot as plt
import pylab
from pydap.client import open_url
year = '2008'
month = '08'
myurl = 'http://opendap.tudelft.nl/thredds/dodsC/data2/darelux/maisbich/Tcalibrated/‘
+year+'/'+month+'/Tcalibrated'+year+'_'+month+'.nc'
dataset = open_url(myurl) # make connection
print dataset.keys() # inspect dataset
T = dataset['temperature'] # choose a variable
print T.shape # inspect the dimensions of this variable
T_red = T[:2000,:150] # take only a part
T_temp = T_red.array
T_time = T_red.time
T_dist = T_red.distance
mesh = plt.pcolormesh(T_dist[:],T_time[:],T_temp[:]) # let’s make a nice plot
mesh.axes.set_title('water temperature Maisbich [deg C]')
mesh.axes.set_xlabel('distance [m]')
mesh.axes.set_ylabel('time [days since '+year+'-'+month+'-01T00:00:00]')
mesh.figure.colorbar(mesh)
mesh.figure.savefig('maisbich-'+year+'-'+month+'.png')
mesh.figure.clf()
17. OPeNDAP catalogs
Datasets are organized in catalogs (catalog.xml)
•Usually (not necessarily) maps to folder
•Contains location, size, date, available services of datasets
Catalogs are our hook to Fedora
catalog.xml Fedora object