Talk for first-year PhD students at the CRG. The goal of the talk was to present scenarios that students will likely face and that can compromise reproducibility and efficiency in the analysis of data in the life sciences. Importantly, making the questions is probably more important than the given answers.
Reproducibility and automation of machine learning processDenis Dus
A speech about organization of machine learning process in practice. Conceptual and technical aspects discussed. Introduction into Luigi framework. A short story about neural networks fitting in Flo - top-level mobile tracker of women health.
Reproducibility and automation of machine learning processDenis Dus
A speech about organization of machine learning process in practice. Conceptual and technical aspects discussed. Introduction into Luigi framework. A short story about neural networks fitting in Flo - top-level mobile tracker of women health.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Scaling Security Threat Detection with Apache Spark and DatabricksDatabricks
Apple must detect a wide variety of security threats, and rises to the challenge using Apache Spark across a diverse pool of telemetry. This talk covers some of the home-grown solutions we’ve built to address complications of scale
Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.
Applying soft computing techniques to corporate mobile security systemsPaloma De Las Cuevas
Corporate workers increasingly use their own devices for work purposes, in a trend that has come to be called the "Bring Your Own Device" (BYOD) philosophy and companies are starting to include it in their policies. For this reason, corporate security systems need to be redefined and adapted, by the corporate Information Technology (IT) department, to these emerging behaviours. This work proposes applying soft-computing techniques, in order to help the Chief Security Officer (CSO) of a company (in charge of the IT department) to improve the
security policies.
The actions performed be company workers under a BYOD situation will be treated as events: an action or set of actions yielding to a response. Some of those events might cause a non compliance with some corporate policies, and then it would be necessary to define a set of security rules (action, consequence). Furthermore, the processing of the extracted knowledge will allow the rules to be adapted.
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Performance Issue? Machine Learning to the rescue!Maarten Smeets
t can be difficult to determine how to improve performance of microservices. There are many factors you can vary but which factor will be the one having most impact? During this presentation, a method using the random forest machine learning algorithm will be applied in order to help improve performance of a microservice running inside a JVM. Several measures are taken such as thoughput and response times. Java version, JVM supplier, heap, garbage collection algorithm and microservice framework are all varied. Which factor is most important in determining the response time and throughput of the services? The Random Forest algorithm will be introduced to solve this challenge. Not only will this presentation give some useful suggestions for improving the performance of microservices but will also introduce a novel way to take on the challenge of performance tuning which can be applied to other use-cases. This presentation is especially interesting to developers and architects.
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Code Review Checklist: How far is a code review going? "Metrics measure the design of code after it has been written, a Review proofs it and Refactoring improves code."
In this paper a document structure is shown and tips for a code review.
Some checks fits with your existing tools and simply raises a hand when the quality or security of your codebase is impaired.
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Scaling Security Threat Detection with Apache Spark and DatabricksDatabricks
Apple must detect a wide variety of security threats, and rises to the challenge using Apache Spark across a diverse pool of telemetry. This talk covers some of the home-grown solutions we’ve built to address complications of scale
Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.
Applying soft computing techniques to corporate mobile security systemsPaloma De Las Cuevas
Corporate workers increasingly use their own devices for work purposes, in a trend that has come to be called the "Bring Your Own Device" (BYOD) philosophy and companies are starting to include it in their policies. For this reason, corporate security systems need to be redefined and adapted, by the corporate Information Technology (IT) department, to these emerging behaviours. This work proposes applying soft-computing techniques, in order to help the Chief Security Officer (CSO) of a company (in charge of the IT department) to improve the
security policies.
The actions performed be company workers under a BYOD situation will be treated as events: an action or set of actions yielding to a response. Some of those events might cause a non compliance with some corporate policies, and then it would be necessary to define a set of security rules (action, consequence). Furthermore, the processing of the extracted knowledge will allow the rules to be adapted.
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
Performance Issue? Machine Learning to the rescue!Maarten Smeets
t can be difficult to determine how to improve performance of microservices. There are many factors you can vary but which factor will be the one having most impact? During this presentation, a method using the random forest machine learning algorithm will be applied in order to help improve performance of a microservice running inside a JVM. Several measures are taken such as thoughput and response times. Java version, JVM supplier, heap, garbage collection algorithm and microservice framework are all varied. Which factor is most important in determining the response time and throughput of the services? The Random Forest algorithm will be introduced to solve this challenge. Not only will this presentation give some useful suggestions for improving the performance of microservices but will also introduce a novel way to take on the challenge of performance tuning which can be applied to other use-cases. This presentation is especially interesting to developers and architects.
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Code Review Checklist: How far is a code review going? "Metrics measure the design of code after it has been written, a Review proofs it and Refactoring improves code."
In this paper a document structure is shown and tips for a code review.
Some checks fits with your existing tools and simply raises a hand when the quality or security of your codebase is impaired.
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Similar to Good practices (and challenges) for reproducibility (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Good practices (and challenges) for reproducibility
1. Good practices (and challenges) for reproducibility
“Give your samples a decent life”
Javier Quilez
2. Outline
● Make groups of 3 (ideally 2 wet-lab + 1 dry-lab)
● I will present sequentially several scenarios/challenges
● You will have some minutes to think how you will tackle them
● I will propose approaches that worked for me
2
4. The life of your sample
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
4
5. What is your sample?
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
5
6. What is your sample?
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
This is NOT enough
6
7. ● Initial processing of the data
● Quality control
● Downstream analysis
● Reproducibility
● Data sharing and publication
Is all the information needed available?
7
8. ● What information (aka. metadata) will describe your experiment?
● How will you collect metadata?
● Who will have access to metadata?
● Will metadata be future-proof?
Think
8
9. Collect systematically the metadata of the experiments
● Do it before processing the data
● Short and easy to complete
● Instantly accessible by authorized members of the team
● Easy to parse for humans and computers
9
11. Experiments will happen over time
Time
Exp. 1
Untreated
ctrl.txt
Treated
t60.txt
Exp. 2
Treated
T60.txt
11
12. Which is your sample (and other issues)?
Untreated
ctrl.txt
Treated
t60.txt
Treated
T60.txt
? ?
What “*60.txt” file does correspond to each trated
experiment?
What “*60” and “ctrl” means may not be so obvious
and implies human interpretation whatsoever
Are both treated samples to be used with the same
untreated sample?
The variable use of lower/upper case complicates
computer searches
12
13. ● How will you name your samples?
● Will it be really unique?
● Will it provide any information about the sample and/or group similar samples?
● Is it future-proof (i.e. consider more samples will come)?
● What will you label with the sample name (i.e. tubes, files)?
Think
13
14. ● Simplest way: auto-incremental identifier (ID) (i.e. sample001, sample002, …)
● More complex options (sample ID based on metadata)
● Whichever you choose…
○ Unique
○ Computer-friendly (fixed length and pattern, all upper or lower case)
○ Anticipate the number of samples that can be reached
● Trace your sample with its ID through its life: from the tube to the files
Establish a system: each sample a unique identifier
14
19. ● How will you organize your raw data?
● How will you organize your processed data?
● How will you organize your analysis results data?
● Will human and computer searches be easy?
Think
19
20. The life of your sample
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
20
21. The life of your sample
Experiment
(wet-lab domain)
Raw
Data
1
Analysis
results
3
21
2
Processed
data
22. (1) Raw data - 1 directory per instrument run
● Files as spit from the instrument
● Do not store modified, subsetted or merged files
● Quality control of raw files
22
23. (2) Processed data - 1 directory per sample
● Several subdirectories
○ Steps of the analysis pipeline
○ Logs of the programs used
○ File integrity verifications
● Subdirectories accommodate variations in the analysis pipelines
○ sample1/step1/program_a/sample1.txt
○ sample1/step1/program_b/sample1.txt
23
26. Data analysis hardly ever is a one-time task
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
26
27. Can you process seamlessly multiple samples?
Time
ResultsData
Results
Results
Results
Results
Results
Data
...
27
28. ● Imagine you write code to process/analyze 1 sample:
○ How will it handle 100 samples?
○ Will 100 samples be processed in a reasonable time?
○ Will you have to manually configure sample-specific parameters?
○ Will you be able to run specific parts of your code?
Think
28
42. Data go through many procedures to generate results
Time
ResultsData
Results
Results
Results
Results
Results
Data
...
42
43. Can you or anybody else reproduce your results?
Results
Results
Results
Results
?
?
Little understanding, irreproducibility, identification of errors is harder
43
44. ● How will you document your procedures?
● How will you store your code?
● How others will have access to your documentation?
Think
44
45. ● Write in README files how and when software and accessory files are obtained
(e.g. genome reference sequence, annotation)
● Allocate a directory for any task (even as simple as sharing files)
● Code core analysis pipeline to log the output of the programs and verify file
integrity
● Document procedures using Markdown, Jupyter Notebooks, RStudio or alike
● Specify non-default variable values
Document, document and document
45
47. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
47
48. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
48
49. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
49
50. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
Structured and hierarchical organization of the data
50
51. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
Structured and hierarchical organization of the data
Scalability, parallelization, automatic configuration and
modularity of the code
51
52. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
Structured and hierarchical organization of the data
Scalability, parallelization, automatic configuration and
modularity of the code
Document, document and document!
52
53. In case you forget the take home message…
The human factor is the greatest hurdle for reproducibility
Limit or control human intervention by automating every step of
the data analysis as much as possible
It’s not you, it’s the lab culture
53
55. Your involvement in the data analysis is a choice
The data analysis itself is not
55
56. Your involvement in the data analysis is a choice
The data analysis itself is not
56
Your
autonomy
Dependenceon
bioinformaticians
Your involvement in the data analysis