The document discusses making science more reproducible through provenance. It introduces the W3C PROV standard for representing provenance which describes entities, activities, and agents. Python libraries like prov can be used to capture provenance which can be stored in graph databases like Neo4j that are suitable for provenance graphs. Capturing provenance allows researchers to understand the origins and process that led to results and to verify or reproduce scientific findings.
Microsoft R server for distributed computing โดย กฤษฏิ์ คำตื้อ Technical Evangelist Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
Mining and Managing Large-scale Linked Open DataMOVING Project
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Attila Szucs
An introduction to Apache Spark, designed for C# developers. It introduces Mobius, the open-source C# Spark API maintained by Microsoft, and also comes with a real-world use case.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UkZRIC.
Monal Daxini presents a blueprint for streaming data architectures and a review of desirable features of a streaming engine. He also talks about streaming application patterns and anti-patterns, and use cases and concrete examples using Apache Flink. Filmed at qconsf.com.
Monal Daxini is the Tech Lead for Stream Processing platform for business insights at Netflix. He helped build the petabyte scale Keystone pipeline running on the Flink powered platform. He introduced Flink to Netflix, and also helped define the vision for this platform. He has over 17 years of experience building scalable distributed systems.
Microsoft R server for distributed computing โดย กฤษฏิ์ คำตื้อ Technical Evangelist Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
Mining and Managing Large-scale Linked Open DataMOVING Project
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Attila Szucs
An introduction to Apache Spark, designed for C# developers. It introduces Mobius, the open-source C# Spark API maintained by Microsoft, and also comes with a real-world use case.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UkZRIC.
Monal Daxini presents a blueprint for streaming data architectures and a review of desirable features of a streaming engine. He also talks about streaming application patterns and anti-patterns, and use cases and concrete examples using Apache Flink. Filmed at qconsf.com.
Monal Daxini is the Tech Lead for Stream Processing platform for business insights at Netflix. He helped build the petabyte scale Keystone pipeline running on the Flink powered platform. He introduced Flink to Netflix, and also helped define the vision for this platform. He has over 17 years of experience building scalable distributed systems.
These slides will show how to approach a multi-class (classification) problem using H2O. The data that is being used is an aggregated log of multiple systems that are constantly providing information about their status, connections and traffic. In large organizations, these log datasets can be very huge and unidentifiable due to the number of sources, legacy systems etc. In our example, we use a created response for each source. The use H2O to classify the source of data.
Author Bio: Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
Don’t forget to download H2O!
http://www.h2o.ai/download/
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Note: this is a placeholder for the presentation next Tuesday at the Percona Live London
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
Write Graph Algorithms Like a Boss Andrew RayDatabricks
Graph-parallel algorithms such as PageRank operate on an entire graph at once. Efficient distributed implementations of these algorithms are important at scale. This session will introduce the two main abstractions for these types of algorithms: Pregel and PowerGraph.
Explore how GraphX combines the best of both abstractions and walk through multiple example algorithms. Note: Familiarity with Apache Spark and basic Graph concepts is expected.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Case Studies in advanced analytics with RWit Jakuczun
A talk I gave at SQLDay 2017:
About 1,5 years ago Microsoft finalised acquisition of Revolution Analytics – a provider of software and services for R. In my opinion this was one of the most important event for R community. Now it is crucial to present its capabilities to SQL Server community. It will be beneficial for both parties. I will present three case studies: cash optimisation in Deutsche Bank, midterm model for energy prices forecasting, workforce demand optimising. The case studies were implemented with our analytical workflow R Suite that will be also shortly presented.
Seminario "Análisis de Big Data con Tidyverse y Spark: uso en estadística pública". Dentro de curso de postgrado: "Data Science. Applications to Biology and Medicine with Python and R". Universidad de Barcelona. 2020
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTHHeri kusrianto
Discovered TRUTH and hidden Treasure, let it be known that JESUS was actually a MUSLIM and he taught ISLAM, on the other hand CHRISTIANITY was created by PAUL of Tarsus backed by ROMAN EMPIRE
These slides will show how to approach a multi-class (classification) problem using H2O. The data that is being used is an aggregated log of multiple systems that are constantly providing information about their status, connections and traffic. In large organizations, these log datasets can be very huge and unidentifiable due to the number of sources, legacy systems etc. In our example, we use a created response for each source. The use H2O to classify the source of data.
Author Bio: Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
Don’t forget to download H2O!
http://www.h2o.ai/download/
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Note: this is a placeholder for the presentation next Tuesday at the Percona Live London
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
Write Graph Algorithms Like a Boss Andrew RayDatabricks
Graph-parallel algorithms such as PageRank operate on an entire graph at once. Efficient distributed implementations of these algorithms are important at scale. This session will introduce the two main abstractions for these types of algorithms: Pregel and PowerGraph.
Explore how GraphX combines the best of both abstractions and walk through multiple example algorithms. Note: Familiarity with Apache Spark and basic Graph concepts is expected.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Case Studies in advanced analytics with RWit Jakuczun
A talk I gave at SQLDay 2017:
About 1,5 years ago Microsoft finalised acquisition of Revolution Analytics – a provider of software and services for R. In my opinion this was one of the most important event for R community. Now it is crucial to present its capabilities to SQL Server community. It will be beneficial for both parties. I will present three case studies: cash optimisation in Deutsche Bank, midterm model for energy prices forecasting, workforce demand optimising. The case studies were implemented with our analytical workflow R Suite that will be also shortly presented.
Seminario "Análisis de Big Data con Tidyverse y Spark: uso en estadística pública". Dentro de curso de postgrado: "Data Science. Applications to Biology and Medicine with Python and R". Universidad de Barcelona. 2020
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTHHeri kusrianto
Discovered TRUTH and hidden Treasure, let it be known that JESUS was actually a MUSLIM and he taught ISLAM, on the other hand CHRISTIANITY was created by PAUL of Tarsus backed by ROMAN EMPIRE
AWS re:Invent 2016: Deploying Scalable SAP Hybris Clusters using Docker (CON312)Amazon Web Services
Rent-A-Center’s challenge was to architect, deploy, and manage a mission-critical SAP Hybris ecommerce platform that could scale to 2 million users a month. Together with Flux7, an AWS Advanced Consulting Partner, Rent-A-Center created an AWS-based approach that would help deliver the solution to market faster, in a secure, highly available, PCI-compliant fashion. In this session, we walk through the implementation details of this solution and its challenges, and explore how Rent-A-Center is now able to achieve ROI through agility, scalability, security, and cost savings.
When a new security vulnerability is identified or during a large-scale attack, accurate and fast coordination is critical. While runbooks exist for many of the technical challenges, executing them in concert and filling the gaps between them requires creativity and quick thinking as well as discipline, a strong ability to read situations, and a willingness to make tough decisions.
As a content delivery network, Fastly operates a large internetwork and a global application environment, which face many security threats. Recognizing the impact security events can have, Fastly developed its Incident Command protocol, which it uses to deal with large-scale events. Maarten Van Horenbeeck, a lead on Fastly’s security team, and experienced incident commanders Lisa Phillips and Tom Daly explore how Incident Command was conceived and the protocols that were developed within Fastly to make it work. The three share a number of war stories that illustrate how Incident Command contributes to protecting Fastly, its customers, and the many end users relying on the service. Examples include a major software vulnerability that affected a Linux component in common use across Fastly and a large attack. Maarten, Lisa, and Tom cover in detail the typical struggles a company Fastly’s size runs into when building around-the-clock incident operations and the things Fastly has put in place to make dealing with security incidents easier and more effective.
Using NLP to find contextual relationships between fashion housesSushant Shankar
Understanding how companies and products
are compared online can provide insights for
analyzing a market. This paper proposes a
method to bootstrap entities (‘players’) in a
market and their relationship to each other.
Our corpus was high quality fashion blogs
on eyewear. Our robust system takes the
dataset and discovers competitive associations
between different fashion houses and returns
weights for each association and phrases that
describe them. This is done in three steps:
a) looking for patterns in sentences that are
of a competitive nature, b) using Parts of
Speech tagging and Named Entity Recognizer
to extract singleton and paired entities, and c)
traversing the Typed Dependency Graph to extract
contextual phrases that describe the competitive
association. Out of the top 10 fashion
houses that we extracted, all the relations we
caught were actual competitive relationships
between fashion houses. Of all long contextual
phrases that we extracted, 92.1% were accurate
and informative.
Respond to and troubleshoot production incidents like an saTom Cudd
So it's 4 AM and you just got a call from a panicked executive that the system is down! Oh noes! What do you do? Troubleshoot LIKE AN SA. I know "Systems Administrator" is not the cool industry term anymore, but that mentality for fixing the big live problem, like RIGHT NOW can still help today.
You're probably in the job you're in because you're AWESOME at figuring out what's wrong and fixing problems. But your projects have grown, your team has grown, and the expectations grow with them. How do you deal with these new found responsibilities? LIKE AN SA. There are some simple processes you can put in place to help make your life easier. We'll discuss a framework for incident response, a step-by-step guide for troubleshooting production issues, and how to then learn from these outages to prevent problems from happening again.
Geld verdienen is in 2014 bijna onmogelijk geworden. De concurrentie van internet is alom, prijzen zijn transparant geworden, een langdurende zakelijke relatie is weinig meer waard. Klanten sluiten zich steeds vaker af voor reclame en laten zich bij keuzes adviseren door onbekenden op internet. Hoe slaag je er als bedrijf in om jouw klanten en prospects te bereiken? En hoe kun je je meerwaarde overtuigend brengen?
Of toch?
Het jaar 2014 biedt enorme kansen. Met de nieuwste communicatiemiddelen kun je duizenden mensen in één keer bereiken. Je kunt ze ontroeren, laten lachen en ze betrekken bij wat jóu bezig houdt. Communicatie is meer dan ooit tweerichtingsverkeer.
De manier waarop bedrijven communiceren zal daardoor wel moeten veranderen. Geen monoloog, maar een dialoog. Het vergt een andere aanpak, een andere visie, waarbij openheid en transparantie centraal staan.
360 Inspiratieborrel: trends en kansen voor 2014
Wil jij weten wat jij kunt doen in 2014 om jouw business succesvoller te maken? Download dan deze presentatie. Trendwatcher Bob van Leeuwen en Harald van Engelen nemen je mee in de nieuwste ontwikkelingen in techniek en communicatie.
Zij geven je inspiratie en inzicht over de volgende thema’s en vraagstukken:
wat is het belang van verbindende kernwaarden voor een bedrijf?
welke nieuwe businessmodellen gaan er ontstaan?
welke trends zullen de komende 5 jaar onze wereld vormgeven?
wat betekent dat voor organisaties?
‘In de communicatie van vandaag, groeit de kiem voor de oplossing voor morgen’
vanEngelen: Inspireren + Creëren = Groeien
Next-gen Network Telemetry is Within Your Packets: In-band OAMFrank Brockners
While troubleshooting or planning, did you ever wish to get full insight into which paths *all* your packets take in your network or were you ever asked to prove that your traffic really follows the path you specified by service chaining or traffic engineering? We approach this problem by adding meta-data to *all* packets - "In-band OAM for IPv6" and "path/service-chain verification" are the associated technologies. In-band OAM adds forwarding path information and other information/stats to every data packet - as opposed to relying on probe packets, which is the traditional method that tools like ping or traceroute use. In-band OAM information can either be accessed directly on the router or be available via Netflow. The presentation introduces in-band OAM as a technology and discuss a series of use-cases and deployment scenarios, ranging from proving that all packets traverse a specific path and troubleshooting forwarding issues in networks which use ECMP, over simple approaches to deriving the network traffic matrix, or trend analysis on network parameters such as delay or packet loss, to using iOAM as a tool to optimize forwarding in your network. The technology discussion is complemented references to demos (using Cisco IOS, FD.io/VPP, OpenDaylight Controller etc.) which showcase this new technology at work.
Learn more about Cohesive Networks' virtual networking device with our handy comparison guide. See how VNS3 outshines the rest with enhanced capabilities, functionality and interoperability for any public, private or hybrid cloud.
Interact Differently: Get More From Your Tools Through Exposed APIsKevin Fealey
Most tools are designed with a single function in mind, but can often be leveraged for additional workflows as well. For example, SonarQube is a marketed as a platform to manage code quality, but custom plugins can also enable security testing. At its heart, Jenkins is a build tool, but it can be used for security testing as well. Don't like the security dashboard applications bundled with your commercial security tools? Either Jenkins or SonarQube can replace them! Can't stand viewing your tool exports in Excel? A simple XML transformation can allow you to see your data in a way that makes sense to you.
Welcome to the age of APIs. Nearly every major software product now exposes RESTful APIs, an SDK, command line, or plugin interface. Don't limit yourself to out-of-the-box functionality - customize your tools to work best for you.
This talk will describe some ways exposed APIs and plugin interfaces in commercial and open source products have been used to make work more effective and efficient. We'll demonstrate simple tools that interact with common software packages, like Jenkins, SonarQube, and OWASP ZAP, to streamline workflows and provide better visibility on what matters most to us. And we'll tell you how to get started getting more out of the tools you already use.
Dashboards: Using data to find out what's really going onrouanw
How many people are using your website right now? Which features are their favourite? Are they experiencing errors? Getting stuck? How are your servers performing? Is your code easy to work with? Are you making money? Dashboards are a way to have the answers to these questions all around you, all the time.
Join Rouan as he shows you why building dashboards will change the way you look at software. He’ll share concrete examples of the kinds of dashboards you could build and will show you the tools with which you can build them. He’ll introduce you to principles that will guide you as you make decisions about what you dashboard, how you treat data and how you use data to make decisions.
Provenance as a building block for an open science infrastructureAndreas Schreiber
International Symposium on Grids & Clouds 2018 (ISGC 2018)
Taipei, Taiwan
March 23, 2018
http://indico4.twgrid.org/indico/event/4/session/17/contribution/46
Presentation of the paper "Primers or Reminders? The Effects of Existing Review Comments on Code Review" published at ICSE 2020.
Authors:
Davide Spadini, Gül Calikli, Alberto Bacchelli
Link to the paper: https://research.tudelft.nl/en/publications/primers-or-reminders-the-effects-of-existing-review-comments-on-c
Reveal's Advanced Analytics: Using R & PythonPoojitha B
Learn how you can use Reveal’s R & Python scripting capability to bring advanced data preparation, deeper analytics, and richer visualizations to your users!
Class-oriented programming, as supported by Java, C++ and C#, helps you develop classes for your customer. Object-oriented programming, on the other hand, lets you focus on networks of cooperating objects that work together to create business value.
This talk describes the trygve open-source programming language and its support for real object-oriented programming the way it was envisioned by those who shaped it in its early days. Learn about trygve and maybe even join the community to help evolve it. And if you’re a working developer, some of the ideas carry over into C# and C++.
More on the philosophy and so forth:
* User manual (the intro might help)
* Original “white paper”
* More academic paper
* fulloo web site
* Past version of a similar talk
About the speaker
Jim Coplien is a Certified Scrum Trainer in Denmark and best-selling author, lecturer, and consultant in the areas of software design, object-oriented programming, lean software development process, and agile development. His earlier work was one of the foundations of Scrum and of XP and he is one of the founders of the software pattern discipline. He helps enterprises solve architectural and organisational problems together and challenges people to question practices they do out of habit or popularity, exhorting people to establish empirical and otherwise provable justifications for their practices.
Is there a free lunch for cloud-based evolutionary algorithms?Mario Garcia Valdez
Presented at 2013 IEEE Congress on Evolutionary Computation, Cancún México. June 20th
We present a distributed evolutionary algorithm that uses exclusively cloud services. This presents certain advantages, such as avoiding the acquisition of expensive resources, but at the same time presents the problem of choice between different services at different levels (infrastructure, platform, software) and, finally the actual scalability that can be achieved in a real distributed evolutionary algorithm. These issues
are addressed by creating a pure-cloud version of EvoSpace, a pool-based evolutionary algorithm previously presented by the authors. EvoSpace is tested using the free tier of two services (one for the pool and other for the clients) and also the paying tier, and speedup is measured and its limits assessed. In general, this paper proves that a low-cost distributed evolutionary algorithm system can be created using cloud services that can be set up in very short time, but that major efficiency improvements can be
obtained by switching to the non-free tier, giving another twist to the famous phrase “there is no free lunch”. We also show that using a pool-based algorithm allows to use cloud services more efficiently (and dynamically) than a static or synchronous service.
Introduction to interactive data visualisation using R Shinyanamarisaguedes
Shiny is an R library for building interactive webapps. Shiny allows rapid prototyping and quick production of dashboards and interactive data visualisations. This is especially important in situations where putting a real data-driven prototype in the hands of the end user allows for better refining of requirements before passing off to a web development team. This allows to speed up the delivery process and reducing the dependencies on other teams.
Code and solution to exercises available on github: https://github.com/amguedes/ShinySeminar
A short introduction to reproducible research, reproducibility with R, Docker, and all together for reproducible research using R and Docker containers. Includes demos of Rocker and containerit.
Basic introduction to "R", a free and open source statistical programming language designed to help users analyze data sets by creating scripts to increase automation. The program can also be used as a free substitute for Microsoft Excel.
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...Andreas Schreiber
https://iitdbgroup.github.io/ProvenanceWeek2021/virtual.html
Software repositories contain information about source code, software development processes, and team interactions. We combine the provenance of development processes with code security analysis results to provide fast feedback on the software’s design and security issues. Results from queries of the provenance graph drives the security analysis, which are conducted on certain events—such as commits or pull requests by external contributors. We evaluate our method on Open Source projects that are developed under time pressure and use Germany’s COVID-19 contact tracing app ‘Corona-Warn-App’ as a case study.
https://link.springer.com/chapter/10.1007/978-3-030-80960-7_6
Tracking after Stroke: Doctors, Dogs and All The RestAndreas Schreiber
After having a stroke, I started tracking my vitals signs and weight. I'll share how my data helped me to understand my personal habits and helped my doctors to improve my treatments.
(Show & Tell Talk, 2015 Quantified Europe Conference, Amsterdam)
Space Debris are defunct objects in space, including old space vehicles or fragments from collisions. Space debris can cause great damage to functional space ships and satellites. Thus detection of space debris and prediction of their orbital paths are essential. The talk shows a Python based infrastructure for storing space debris data from sensors and high-throughput processing of that data.
PyData Seattle (26. Juli 2015)
http://seattle.pydata.org/schedule/presentation/35/
Wissenschaft im Rathaus, Köln (02.03.2015)
"Gesundheitsmanagement aus der Ferne ist heute nicht mehr ungewöhnlich. Inzwischen kommunizieren Ärzte mit Patienten, mit Ärzten und mit Betreuungseinrichtungen – ohne dass sie sich von Angesicht zu Angesicht gegenüberstehen. Befunde und Bilddaten werden drahtlos übermittelt. Wir sprechen von Telemedizin. Mehr und mehr machen die Möglichkeiten des Überwachens bestimmter eigener Körperfunktionen (Self-Tracking) von sich reden.
Andreas Schreiber zeigt, welche „Self-Tracking-Systeme“ bereits genutzt werden und an welchen neuen Entwicklungen derzeit gearbeitet wird."
(http://www.koelner-wissenschaftsrunde.de/wissenschaft-erleben/aktuell-koelner-themenjahr-wissenschaft-erleben/2015-gesellschaft-im-wandel/wir-vortrag-4/)
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
1. Reproducible Science with Python
Andreas Schreiber
Department for Intelligent and Distributed Systems
German Aerospace Center (DLR), Cologne/Berlin
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 1
2. Simulations, experiments, data analytics, …
Science
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 2
SIMULATION FAILED
3. Reproducing results relies on
• Open Source codes
• Code reviews
• Code repositories
• Publications with code
• Computational environment
captured (Docker etc.)
• Workflows
• Open Data formats
• Data management
• (Electronics) laboratory notebooks
• Provenance
Reproducible Science
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 3
4. Provenance is information about entities, activities, and
people involved in producing a piece of data or thing,
which can be used to form assessments about its
quality, reliability or trustworthiness.
PROV W3C Working Group
https://www.w3.org/TR/prov-overview
Provenance
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 4
5. W3C Provenance Working Group:
https://www.w3.org/2011/prov
PROV
• The goal of PROV is to enable the wide publication and
interchange of provenance on the Web and other information
systems
• PROV enables one to represent and interchange provenance
information using widely available formats such as RDF and XML
PROV
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 5
6. Entities
• Physical, digital, conceptual, or other kinds of things
• For example, documents, web sites, graphics, or data sets
Activities
• Activities generate new entities or
make use of existing entities
• Activities could be actions or processes
Agents
• Agents takes a role in an activity and
have the responsibility for the activity
• For example, persons, pieces of software, or organizations
Overview of PROV
Key Concepts
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 6
7. PROV Data Model
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 7
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
8. Example Provenance
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 8
https://provenance.ecs.soton.ac.uk/store/documents/113794/
9. Storing Provenance
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 9
Provenance
Store
Provenance recording
during runtime
Applications
/ Workflow
Data (results)
10. Some Storage Technologies
• Relational databases and SQL
• XML and Xpath
• RDF and SPARQL
• Graph databases and Gremlin/Cypher
Services
• REST APIs
• ProvStore (University of Southampton)
Storing and Retrieving Provenance
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 10
11. ProvStore
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 11
https://provenance.ecs.soton.ac.uk/store/
12. Provenance is a directed acyclic graph (DAG)
Graphs
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 12
A
B
E
F
G
D
C
13. Naturally, graph databases are a good technology for storing
(Provenance) graphs
Many graph databases are available
• Neo4J
• Titan
• ArangoDB
• ...
Query languages
• Cypher
• Gremlin (TinkerPop)
• GraphQL
Graph Databases
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 13
14. • Open-Source
• Implemented in Java
• Stores property graphs
(key-value-based, directed)
http://neo4j.com
Neo4j
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 14
15. Depends on your application (tools, languages, etc.)
Libraries for Python
• prov
• provneo4j
• NoWorkflow
• …
Tools
• Git2PROV
• Prov-sty for LaTeX
• …
Gathering Provenance
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 15
16. Python Library prov
https://github.com/trungdong/prov
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 16
from prov.model import ProvDocument
# Create a new provenance document
d1 = ProvDocument()
# Entity: now:employment-article-v1.html
e1 = d1.entity('now:employment-article-v1.html')
# Agent: nowpeople:Bob
d1.agent('nowpeople:Bob')
# Attributing the article to the agent
d1.wasAttributedTo(e1, 'nowpeople:Bob')
d1.entity('govftp:oesm11st.zip',
{'prov:label': 'employment-stats-2011',
'prov:type': 'void:Dataset'})
d1.wasDerivedFrom('now:employment-article-v1.html',
'govftp:oesm11st.zip')
# Adding an activity
d1.activity('is:writeArticle')
d1.used('is:writeArticle', 'govftp:oesm11st.zip')
d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
23. > PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 23
24. LaTeX Annotations
prov-sty for LaTeX
https://github.com/prov-suite/prov-sty
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 24
begin{document}
provAuthor{Andreas Schreiber}{http://orcid.org/0000-0001-5750-5649}
provOrganization{German Aerospace Center (DLR)}{http://www.dlr.de}
provTitle{A Provenance Model for Quantified Self Data}
provProject
{PROV-SPEC (FS12016)}
{http://www.dlr.de/sc/desktopdefault.aspx/tabid-8073/}
{http://www.bmwi.de/}
25. Electronic Laboratory Notebook
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 25
26. Query „Who worked on experiment X?“
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 26
$experiment = g.key($_g, 'identifier', X)
$user = $experiment/inE/inV/outE[@label = controlled_by]
27. Practical Use Case – Archive all Data of a Paper
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 27
https://github.com/DLR-SC/DataFinder
28. New PROV Library for Python in development
• https://github.com/DLR-SC/prov-db-connector
• Connectors for Neo4j implemented, ArangoDB planned
• APIs in REST, ZeroMQ, MQTT
Trusted Provenance
• Storing Provenance in Blockchains
Provenance for people
• New approaches for visualization
• For example, PROV Comics
Current Research and Development
> PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 28
29. > PyCon DE 2016 > Andreas Schreiber • Reproducible Science with Python > 29.10.2016DLR.de • Chart 29
Thank You!
Questions?
Andreas.Schreiber@dlr.de
www.DLR.de/sc | @onyame