Presented as an invited talk to the Workshop for Natural Language Processing Open Source Software (NLP-OSS), co-located with ACL 2018 (Melbourne, Australia, July 2018).
This document discusses how AI can be used to assist with software testing and development. It suggests that AI may one day be able to understand complex problems, automatically generate error-free code, and even replace testers. However, AI tools also need to be intuitive, easy to use, and help improve workflows. The document promotes using AI to enhance activities like automated regression testing and program compilation. Overall, the document explores the potential for AI to both automate certain testing tasks and help testers work more efficiently.
It is widely accepted that AI is the future of testing. However, because a fault lies in the eye of the beholder, it is pretty unclear how to apply AI to testing—called the Oracle problem.
There are literally thousands of UI test automation tools. But due to high efforts for creation and maintenance, together with the brittleness and unreliability of the resulting tests, testing often remains a manual task (confirming the testing pyramid). Meanwhile Software testing accounts for 30%-40% of the budget of a typical software project.
However, there is a way to circumvent the oracle problem and use AI to not only find technical errors (i.e. crashes), but to generate tests for business functionality—autonomous automation. See how AI can be trained to generate tests that are optimized towards several goals and are even better than manually created ones.
Visit the future of testing and see how AI can help us create better software!
CrowdTruth: Machine-Human Computation for Harnessing Disagreement in Semantic...Lora Aroyo
CrowdTruth is a machine-human computation platform that harnesses disagreement in semantic interpretation to help machines learn. It presents tasks to human workers to annotate examples for interpretation and analyzes levels of disagreement, which can indicate ambiguity, low quality work, or the clarity of examples. The open source CrowdTruth software includes components for machine preprocessing, reusable microtasks, and analytics on disagreement and provenance tracking to collect and evaluate ground truth data.
QA Fest 2017. Jeremias Rößler. Applying AI to testingQAFest
If the future of testing is AI, I will show you how.
Various automation tools are available, but due to high creation and maintenance efforts for tests, GUI-Testing is still mainly a manual task. Meanwhile overall testing effort has risen to make up 30% of a typical software budget. Is crowd-testing the answer? What if we could have automated test cases be created automatically?
Monkey Testing is not a new idea. But combined with a manually trainable AI and an innovative new testing approach (dubbed ""difference testing""), we can now not only have the monkey search for technical bugs (i.e. crashes) but generate functional test cases that are optimized towards several goals and are even better than manually created ones.
Visit the future of testing and see how AI can help us create better software!
Software Testing Terms Defined. Answering the FAQ "What is Regression Testing?"
- What is Regression Testing?
- How to do Regression Testing?
- Why do we do Regression Testing?
- How to re-think Regression Testing in terms of Risk?
Meetup of test mini conference on ai in testingKai Lepler
this presentation is intended to start thinking about the benefits AI can bring to the testing profession.
Just randomly replacing human brains with machines will not do the trick...
CCCT University of Amsterdam Seminars 2013: Crowdsourcing SessionLora Aroyo
The document discusses harnessing disagreement in crowdsourcing to create gold standards for training cognitive systems. It describes representing disagreement through vectors for sentences, workers, and relations to understand different interpretations. Scoring systems can consider where their outputs fall within the space of possible interpretations represented by disagreement vectors. The goal is to capture and exploit disagreement rather than treating it as noise.
The recording in https://eviltester.com/talks has:
- longer practice session recording
- live recording - local recording better quality
- 8 bonus recordings with an extra hour of material
- will automation take over
- impact of buzzwords
- how to cope with trends
- contextual problem solving
- information about the references
- exercises
- behind the scenes look at how the talk was prepared and tools used
- transcripts
- subtitles
This document discusses how AI can be used to assist with software testing and development. It suggests that AI may one day be able to understand complex problems, automatically generate error-free code, and even replace testers. However, AI tools also need to be intuitive, easy to use, and help improve workflows. The document promotes using AI to enhance activities like automated regression testing and program compilation. Overall, the document explores the potential for AI to both automate certain testing tasks and help testers work more efficiently.
It is widely accepted that AI is the future of testing. However, because a fault lies in the eye of the beholder, it is pretty unclear how to apply AI to testing—called the Oracle problem.
There are literally thousands of UI test automation tools. But due to high efforts for creation and maintenance, together with the brittleness and unreliability of the resulting tests, testing often remains a manual task (confirming the testing pyramid). Meanwhile Software testing accounts for 30%-40% of the budget of a typical software project.
However, there is a way to circumvent the oracle problem and use AI to not only find technical errors (i.e. crashes), but to generate tests for business functionality—autonomous automation. See how AI can be trained to generate tests that are optimized towards several goals and are even better than manually created ones.
Visit the future of testing and see how AI can help us create better software!
CrowdTruth: Machine-Human Computation for Harnessing Disagreement in Semantic...Lora Aroyo
CrowdTruth is a machine-human computation platform that harnesses disagreement in semantic interpretation to help machines learn. It presents tasks to human workers to annotate examples for interpretation and analyzes levels of disagreement, which can indicate ambiguity, low quality work, or the clarity of examples. The open source CrowdTruth software includes components for machine preprocessing, reusable microtasks, and analytics on disagreement and provenance tracking to collect and evaluate ground truth data.
QA Fest 2017. Jeremias Rößler. Applying AI to testingQAFest
If the future of testing is AI, I will show you how.
Various automation tools are available, but due to high creation and maintenance efforts for tests, GUI-Testing is still mainly a manual task. Meanwhile overall testing effort has risen to make up 30% of a typical software budget. Is crowd-testing the answer? What if we could have automated test cases be created automatically?
Monkey Testing is not a new idea. But combined with a manually trainable AI and an innovative new testing approach (dubbed ""difference testing""), we can now not only have the monkey search for technical bugs (i.e. crashes) but generate functional test cases that are optimized towards several goals and are even better than manually created ones.
Visit the future of testing and see how AI can help us create better software!
Software Testing Terms Defined. Answering the FAQ "What is Regression Testing?"
- What is Regression Testing?
- How to do Regression Testing?
- Why do we do Regression Testing?
- How to re-think Regression Testing in terms of Risk?
Meetup of test mini conference on ai in testingKai Lepler
this presentation is intended to start thinking about the benefits AI can bring to the testing profession.
Just randomly replacing human brains with machines will not do the trick...
CCCT University of Amsterdam Seminars 2013: Crowdsourcing SessionLora Aroyo
The document discusses harnessing disagreement in crowdsourcing to create gold standards for training cognitive systems. It describes representing disagreement through vectors for sentences, workers, and relations to understand different interpretations. Scoring systems can consider where their outputs fall within the space of possible interpretations represented by disagreement vectors. The goal is to capture and exploit disagreement rather than treating it as noise.
The recording in https://eviltester.com/talks has:
- longer practice session recording
- live recording - local recording better quality
- 8 bonus recordings with an extra hour of material
- will automation take over
- impact of buzzwords
- how to cope with trends
- contextual problem solving
- information about the references
- exercises
- behind the scenes look at how the talk was prepared and tools used
- transcripts
- subtitles
Data excellence: Better data for better AILora Aroyo
The document discusses the importance of data quality and a data lifecycle approach for artificial intelligence. Some key points made include:
- A data lifecycle is needed to guide best practices for data research and development, similar to how a software lifecycle guides software engineering.
- Data quality must be addressed through practices and standards to help avoid unintended AI behaviors that can result from low quality data.
- Disagreement in annotation tasks can provide valuable signals about ambiguity and diversity rather than just being considered noise.
- Achieving high quality, reliable data requires consideration of aspects like validity, fidelity, reproducibility and maintaining data over time - an approach toward "data excellence".
Open IoT Made Easy - Introduction to OGC SensorThings APISensorUp
This is the slides presented at FOSS4G N.A. Conference.
This presentation will introduce and demonstrate the OGC SensorThings API. The OGC SensorThings API is a new Open Geospatial Consortium standard that provides an open and unified way to interconnect the Internet of Things (IoT) devices, data, and applications over the Web. Unlike the traditional OGC standards, SensorThings API is very simple and efficient. At the same time, it is also comprehensive and designed to handle complex use cases. It builds on a rich set of proven-working and widely-adopted open standards, such as the OGC Sensor Web Enablement (SWE) standards, including the ISO/OGC Observation and Measurement (O&M) and Sensor Observation Services (SOS). The main difference between the SensorThings API and the OGC SOS is that the SensorThings API is designed specifically for the resource-constrained IoT devices and the Web developer community. As a result, the SensorThings API follows the REST principles, the use of an efficient JSON encoding, and the use of the flexible OASIS OData protocol and URL conventions. We will also demonstrate several real-world applications of the OGC SensorThings API in emergency management and smart cities. In addition to introduce the specification, this talk will also demonstrate an end-to-end IoT application based on the open source libraries of the SensorUp SensorThings platform.
In 2011, the website edge.org asked the question, 'What scientific concept would improve everybody's cognitive toolkit?'. There were answers from renowned intellectuals. I have listed some of the concepts which are relevant to software development.
An Ontology Engineering Approach to Gamify Collaborative Learning ScenariosGeiser Chalco
This document proposes an ontological engineering approach to gamify collaborative learning scenarios. It introduces the concept of gamification and identifies the problem of applying gamification to collaborative learning. The approach involves developing an ontological structure to describe gamified collaborative learning scenarios. The structure would define the elements needed to apply game design concepts to collaborative learning activities. The goal is to engage students and satisfy psychological needs through problem-solving in a playful context.
Artificial Intelligence power point presentation documentDavid Raj Kanthi
This document provides a certificate for a seminar report on the topic of artificial intelligence. It was completed by a student in partial fulfillment of an M.C.A. degree program in 2016-2017. The document includes an acknowledgment, declaration, abstract, and index sections that provide information about the student, guide, and overall content covered in the seminar report on artificial intelligence.
Trusted, Transparent and Fair AI using Open SourceAnimesh Singh
The document discusses IBM's efforts to bring trust and transparency to AI through open source. It outlines IBM's work on several open source projects focused on different aspects of trusted AI, including robustness (Adversarial Robustness Toolbox), fairness (AI Fairness 360), and explainability (AI Explainability 360). It provides examples of how bias can arise in AI systems and the importance of detecting and mitigating bias. The overall goal is to leverage open source to help ensure AI systems are fair, robust, and understandable through contributions to tools that can evaluate and improve trusted AI.
The document is a transcript from a Splunk presentation about using Splunk for IT operations. It discusses using Splunk to correlate machine data from different sources like servers, applications, and databases to gain visibility into IT services and their components. It provides a live demonstration of how Splunk can be used to monitor system performance, create tickets or alerts when issues arise, and troubleshoot issues by searching through logs and events. The presentation emphasizes how the common information model in Splunk allows mapping these components like hosts, applications, and services for improved IT operations and issue resolution.
This session is meant for intermediate to advanced learners who want to expand their knowledge of artificial intelligence (AI) and machine learning (ML). In this session, you will learn about cutting-edge AI and machine learning techniques, as well as real-world applications, and get the skills required to create your own intelligent systems.
How Four Statistical Rules Forecast Who Wins a Competitive BidIntelCollab.com
Can Bayesian statistics really determine in advance if the bid you are offering will be the winner or just another loser? And, if the metrics forecast a loss, can the same algorithm tell you what to change in order to win instead?
Competitive bidding is where big money sales opportunities are won or lost, and there are four (4) rules that can help you turn a losing situation into a winning sale.
These four rules help you better understand what the customer wants, examine what competitors might do in response and how to beat them, while helping you to offer the best bid, optimized for yours and your prospective customer’s intended outcome. Statistical metrics evaluate your probability of success against the competition and help you more objectively determine how to win. But how can you get at the foundational issues that will determine who will win?
Learning objectives:
Learn the Four Rules that help you understand what will actually determine the customer’s decision.
Visualize your bid head-to-head against the competition and employ objective metrics to determine if you will win.
Identify weaknesses in your offer that must be improved for your bid to beat the competition.
Bill Zangwill is a Professor, Emeritus, from the University of Chicago, Booth School of Business. He has authored four published books, one of which was selected by the Library Journal as “One of the Best Business Books of the Year,” and had over 50 papers in academic journals. In addition, he has had three articles published in the Wall Street Journal. His consulting engagements include top firms such as IBM, AT&T, Motorola, many smaller firms and the US government. He has also taught at the University of Illinois and the University of California, Berkeley. He is considered one of the most innovative thinkers in his field.
Bill will present 30 minutes on how the four rules can help you turn a losing situation into a winning sale and will be joined by webinar moderator Arik Johnson, Founder & Chairman at Aurora WDC.
Mobile devices, games and education: exploring an untapped potential through ...Ruth S. Contreras Espinosa
Why game testing is important? The game development process is long and complex with game testing done at different points of the process. The Open Device Lab community is an important alternative to help the software development community. EJML presentation. Porto, Portugal, June, 2018.
1. Introduction and how to get into Data
2. Data Engineering and skills needed
3. Comparison of Data Analytics for statistic and real time streaming data
4. Bayesian Reasoning for Data
The document is a summary of the 2015 Data Science Salary Survey conducted by O'Reilly Media. Over 600 respondents from various industries completed the anonymous online survey about their demographics, tasks, tools used, and compensation. Key findings include that the top four tools - SQL, Excel, R, and Python - remained the same as the previous year. Use of Spark and Scala has grown significantly compared to last year, and their users tend to earn more. Even when controlling for other factors, women are paid less than men. About 40% of the variation in salaries can be explained by the data provided in the survey.
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Analytic workload characteristics and constraints
* Data architecture
* Functional architecture
* Tradeoffs between different classes of technology
* Technology planning assumptions and guidance
#strataconf
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT
Over the years, the Netflix UI has evolved from a sparse and static webpage into an immersive, video-centric experience tailored to a variety of platforms. In this talk, I’ll describe the simple but powerful framework that Netflix uses to evolve the product experience: we ask our members, through online A/B tests, which of several possible experiences resonate with them. I’ll also describe the steps we are taking to democratize access to experimentation across the company so that we can explore more ideas and identify those that deliver more value to our members.
The State of AI in Insights and Research 2024: Results and FindingsRay Poynter
This presentation from Ray Poynter of NewMR shares the findings from the latest NewMR study.
The study looks at the use of Artificial Intelligence in insights and research, including a deeper dive into the topic of Synthetic Data.
Find out how many people are using AI, how many are testing it, what they are doing, and what people think about Synthetic Data.
And, hear about where AI and research might be in two year’s time.
Listen to the recording of this presentation via NewMR.org/Play-Again, and also access other presentations from NewMR on all things market research, insight, and AI.
This document provides an introduction to data science, including definitions, key concepts, and applications. It discusses what data science is, the differences between data science, big data, and artificial intelligence. It also outlines several applications of data science like internet search, recommendation systems, image/speech recognition, gaming, and price comparison websites. Finally, it discusses the data science life cycle and some popular tools used in data science like Python, NumPy, Pandas, Matplotlib, and Scikit-learn.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Using Python makes Programmers more productive and their programs ultimately better. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python runs on Windows, Linux/Unix, Mac OS and has been ported to Java and .NET virtual machines. Python is free to use, even for the commercial products, because of its OSI-approved open source license.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
6 Open Source Data Science Projects To Impress Your InterviewerPrachiVarshney7
This document discusses 6 open source data science projects that could impress an interviewer: 1) Facebook AI's DETR model for computer vision, 2) A real-time image animation project using OpenCV, 3) OpenAI's massive GPT-3 natural language processing model, 4) A Python audio analysis library called PyAudio, 5) A Python tool called TextShot for extracting text from screenshots, and 6) A collaborative effort called ML Visuals for communicating data science work through visuals and templates.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
More Related Content
Similar to Open-Source Software's Responsibility to Science
Data excellence: Better data for better AILora Aroyo
The document discusses the importance of data quality and a data lifecycle approach for artificial intelligence. Some key points made include:
- A data lifecycle is needed to guide best practices for data research and development, similar to how a software lifecycle guides software engineering.
- Data quality must be addressed through practices and standards to help avoid unintended AI behaviors that can result from low quality data.
- Disagreement in annotation tasks can provide valuable signals about ambiguity and diversity rather than just being considered noise.
- Achieving high quality, reliable data requires consideration of aspects like validity, fidelity, reproducibility and maintaining data over time - an approach toward "data excellence".
Open IoT Made Easy - Introduction to OGC SensorThings APISensorUp
This is the slides presented at FOSS4G N.A. Conference.
This presentation will introduce and demonstrate the OGC SensorThings API. The OGC SensorThings API is a new Open Geospatial Consortium standard that provides an open and unified way to interconnect the Internet of Things (IoT) devices, data, and applications over the Web. Unlike the traditional OGC standards, SensorThings API is very simple and efficient. At the same time, it is also comprehensive and designed to handle complex use cases. It builds on a rich set of proven-working and widely-adopted open standards, such as the OGC Sensor Web Enablement (SWE) standards, including the ISO/OGC Observation and Measurement (O&M) and Sensor Observation Services (SOS). The main difference between the SensorThings API and the OGC SOS is that the SensorThings API is designed specifically for the resource-constrained IoT devices and the Web developer community. As a result, the SensorThings API follows the REST principles, the use of an efficient JSON encoding, and the use of the flexible OASIS OData protocol and URL conventions. We will also demonstrate several real-world applications of the OGC SensorThings API in emergency management and smart cities. In addition to introduce the specification, this talk will also demonstrate an end-to-end IoT application based on the open source libraries of the SensorUp SensorThings platform.
In 2011, the website edge.org asked the question, 'What scientific concept would improve everybody's cognitive toolkit?'. There were answers from renowned intellectuals. I have listed some of the concepts which are relevant to software development.
An Ontology Engineering Approach to Gamify Collaborative Learning ScenariosGeiser Chalco
This document proposes an ontological engineering approach to gamify collaborative learning scenarios. It introduces the concept of gamification and identifies the problem of applying gamification to collaborative learning. The approach involves developing an ontological structure to describe gamified collaborative learning scenarios. The structure would define the elements needed to apply game design concepts to collaborative learning activities. The goal is to engage students and satisfy psychological needs through problem-solving in a playful context.
Artificial Intelligence power point presentation documentDavid Raj Kanthi
This document provides a certificate for a seminar report on the topic of artificial intelligence. It was completed by a student in partial fulfillment of an M.C.A. degree program in 2016-2017. The document includes an acknowledgment, declaration, abstract, and index sections that provide information about the student, guide, and overall content covered in the seminar report on artificial intelligence.
Trusted, Transparent and Fair AI using Open SourceAnimesh Singh
The document discusses IBM's efforts to bring trust and transparency to AI through open source. It outlines IBM's work on several open source projects focused on different aspects of trusted AI, including robustness (Adversarial Robustness Toolbox), fairness (AI Fairness 360), and explainability (AI Explainability 360). It provides examples of how bias can arise in AI systems and the importance of detecting and mitigating bias. The overall goal is to leverage open source to help ensure AI systems are fair, robust, and understandable through contributions to tools that can evaluate and improve trusted AI.
The document is a transcript from a Splunk presentation about using Splunk for IT operations. It discusses using Splunk to correlate machine data from different sources like servers, applications, and databases to gain visibility into IT services and their components. It provides a live demonstration of how Splunk can be used to monitor system performance, create tickets or alerts when issues arise, and troubleshoot issues by searching through logs and events. The presentation emphasizes how the common information model in Splunk allows mapping these components like hosts, applications, and services for improved IT operations and issue resolution.
This session is meant for intermediate to advanced learners who want to expand their knowledge of artificial intelligence (AI) and machine learning (ML). In this session, you will learn about cutting-edge AI and machine learning techniques, as well as real-world applications, and get the skills required to create your own intelligent systems.
How Four Statistical Rules Forecast Who Wins a Competitive BidIntelCollab.com
Can Bayesian statistics really determine in advance if the bid you are offering will be the winner or just another loser? And, if the metrics forecast a loss, can the same algorithm tell you what to change in order to win instead?
Competitive bidding is where big money sales opportunities are won or lost, and there are four (4) rules that can help you turn a losing situation into a winning sale.
These four rules help you better understand what the customer wants, examine what competitors might do in response and how to beat them, while helping you to offer the best bid, optimized for yours and your prospective customer’s intended outcome. Statistical metrics evaluate your probability of success against the competition and help you more objectively determine how to win. But how can you get at the foundational issues that will determine who will win?
Learning objectives:
Learn the Four Rules that help you understand what will actually determine the customer’s decision.
Visualize your bid head-to-head against the competition and employ objective metrics to determine if you will win.
Identify weaknesses in your offer that must be improved for your bid to beat the competition.
Bill Zangwill is a Professor, Emeritus, from the University of Chicago, Booth School of Business. He has authored four published books, one of which was selected by the Library Journal as “One of the Best Business Books of the Year,” and had over 50 papers in academic journals. In addition, he has had three articles published in the Wall Street Journal. His consulting engagements include top firms such as IBM, AT&T, Motorola, many smaller firms and the US government. He has also taught at the University of Illinois and the University of California, Berkeley. He is considered one of the most innovative thinkers in his field.
Bill will present 30 minutes on how the four rules can help you turn a losing situation into a winning sale and will be joined by webinar moderator Arik Johnson, Founder & Chairman at Aurora WDC.
Mobile devices, games and education: exploring an untapped potential through ...Ruth S. Contreras Espinosa
Why game testing is important? The game development process is long and complex with game testing done at different points of the process. The Open Device Lab community is an important alternative to help the software development community. EJML presentation. Porto, Portugal, June, 2018.
1. Introduction and how to get into Data
2. Data Engineering and skills needed
3. Comparison of Data Analytics for statistic and real time streaming data
4. Bayesian Reasoning for Data
The document is a summary of the 2015 Data Science Salary Survey conducted by O'Reilly Media. Over 600 respondents from various industries completed the anonymous online survey about their demographics, tasks, tools used, and compensation. Key findings include that the top four tools - SQL, Excel, R, and Python - remained the same as the previous year. Use of Spark and Scala has grown significantly compared to last year, and their users tend to earn more. Even when controlling for other factors, women are paid less than men. About 40% of the variation in salaries can be explained by the data provided in the survey.
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. This session will discuss hidden design assumptions, review design principles to apply when building multi-use data infrastructure, and provide a reference architecture to use as you work to unify your analytics infrastructure.
The focus in our market has been on acquiring technology, and that ignores the more important part: the larger IT landscape within which this technology lives and the data architecture that lies at its core. If one expects longevity from a platform then it should be a designed rather than accidental architecture.
Architecture is more than just software. It starts from use and includes the data, technology, methods of building and maintaining, and organization of people. What are the design principles that lead to good design and a functional data architecture? What are the assumptions that limit older approaches? How can one integrate with, migrate from or modernize an existing data environment? How will this affect an organization's data management practices? This tutorial will help you answer these questions.
Topics covered:
* A brief history of data infrastructure and past design assumptions
* Categories of data and data use in organizations
* Analytic workload characteristics and constraints
* Data architecture
* Functional architecture
* Tradeoffs between different classes of technology
* Technology planning assumptions and guidance
#strataconf
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT
Over the years, the Netflix UI has evolved from a sparse and static webpage into an immersive, video-centric experience tailored to a variety of platforms. In this talk, I’ll describe the simple but powerful framework that Netflix uses to evolve the product experience: we ask our members, through online A/B tests, which of several possible experiences resonate with them. I’ll also describe the steps we are taking to democratize access to experimentation across the company so that we can explore more ideas and identify those that deliver more value to our members.
The State of AI in Insights and Research 2024: Results and FindingsRay Poynter
This presentation from Ray Poynter of NewMR shares the findings from the latest NewMR study.
The study looks at the use of Artificial Intelligence in insights and research, including a deeper dive into the topic of Synthetic Data.
Find out how many people are using AI, how many are testing it, what they are doing, and what people think about Synthetic Data.
And, hear about where AI and research might be in two year’s time.
Listen to the recording of this presentation via NewMR.org/Play-Again, and also access other presentations from NewMR on all things market research, insight, and AI.
This document provides an introduction to data science, including definitions, key concepts, and applications. It discusses what data science is, the differences between data science, big data, and artificial intelligence. It also outlines several applications of data science like internet search, recommendation systems, image/speech recognition, gaming, and price comparison websites. Finally, it discusses the data science life cycle and some popular tools used in data science like Python, NumPy, Pandas, Matplotlib, and Scikit-learn.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Using Python makes Programmers more productive and their programs ultimately better. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python runs on Windows, Linux/Unix, Mac OS and has been ported to Java and .NET virtual machines. Python is free to use, even for the commercial products, because of its OSI-approved open source license.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
6 Open Source Data Science Projects To Impress Your InterviewerPrachiVarshney7
This document discusses 6 open source data science projects that could impress an interviewer: 1) Facebook AI's DETR model for computer vision, 2) A real-time image animation project using OpenCV, 3) OpenAI's massive GPT-3 natural language processing model, 4) A Python audio analysis library called PyAudio, 5) A Python tool called TextShot for extracting text from screenshots, and 6) A collaborative effort called ML Visuals for communicating data science work through visuals and templates.
Similar to Open-Source Software's Responsibility to Science (20)
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
2. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 2
Me in open source
Mostly contributed to popular Scientific Python libraries:
scikit-learn, nltk, scipy.sparse, pandas, ipython, numpydoc
Also information extraction evaluation (neleval), etc.
Community service
“Volunteer software development”
With thanks to our financial sponsors
Caretakers aren’t always founders
Founders aren’t always caretakers
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
3. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 3
Overheard at ICML
Don’t worry about how tricky it is to implement . . .
Someone will put it in Scikit-learn and you can just use it.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
4. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 4
Thoughts on an arrogant ML researcher
Scientists think software maintenance is no big deal
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
5. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 4
Thoughts on an arrogant ML researcher
Scientists think software maintenance is no big deal
Science and engineering rely heavily on open-source infrastructure
Popular tools become de-facto standards
Most users are uncomfortable building their own tools
Many will only use what’s provided in a popular library
Many will not inspect how it works on the inside
Volunteer maintainers act as gatekeepers
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
6. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 5
The power of the gatekeeper
decides which algorithms are available
decides how to ensure correctness and stability
decides how to name or describe the algorithm
decides whether to be faithful to a published description
decides on an API that may facilitate good science/engineering
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
7. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 5
The power of the gatekeeper
decides which algorithms are available
decides how to ensure correctness and stability
decides how to name or describe the algorithm
decides whether to be faithful to a published description
decides on an API that may facilitate good science/engineering
OSS maintainers can enable or inhibit scientific best practices
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
8. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 6
But you can’t blame the gatekeeper
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
9. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 7
This presentation is a series of examples
Risks to good science and engineering related to software design
Some things we do to help science
Some things we have changed to help science
Some things we have yet to solve
There’s not a great deal of NLP in here
but there’s a lot of ML and software engineering in NLP
(so I hope it is relevant, interesting and accessible)
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
10. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 8
Scikit-learn preliminaries
An ecosystem of estimators
Fit an estimator on some data, so that it can:
describe the training data
transform unseen data
predict a target for unseen data
Data is usually a numeric matrix X (samples × features)
May provide a target vector or matrix y at training time
real valued for regression
categories for multiclass classification
multiple columns of binary targets for multilabel classification
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
11. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 9
Methods and results and are indecipherable if researchers publish an
inappropriate, or underspecified, algorithm name
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
12. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 10
A simple example of a bad name
sklearn.covariance.GraphLasso for sparse inverse covariance estimation
but Graph Lasso is sparse regression where the features lie on a graph
the paper for covariance estimation named it Graphical Lasso
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
13. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 10
A simple example of a bad name
sklearn.covariance.GraphLasso for sparse inverse covariance estimation
but Graph Lasso is sparse regression where the features lie on a graph
the paper for covariance estimation named it Graphical Lasso
Solution deprecate GraphLasso and rename it GraphicalLasso
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
14. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 11
Tripping over hidden parameters
precision_score([‘a’,‘a’,‘b’,‘b’,‘c’], [‘a’,‘b’,‘b’,‘c’,‘c’])
For multi-class or multi-label, how should you average across classes?
Pa = 1
1 , Pb = 1
2 , Pc = 1
2
average=’micro’ ⇒ 3
5
average=’macro’ ⇒ (1 + 1
2 + 1
2 )/3 = 2
3
average=’weighted’ ⇒ (2 × 1 + 2 × 1
2 + 1 × 1
2 )/5 = 7
10
for a long time, prevalence-weighted macro average was the default
∴ papers say “We achieved a precision of . . . ”
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
15. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 11
Tripping over hidden parameters
precision_score([‘a’,‘a’,‘b’,‘b’,‘c’], [‘a’,‘b’,‘b’,‘c’,‘c’])
For multi-class or multi-label, how should you average across classes?
Pa = 1
1 , Pb = 1
2 , Pc = 1
2
average=’micro’ ⇒ 3
5
average=’macro’ ⇒ (1 + 1
2 + 1
2 )/3 = 2
3
average=’weighted’ ⇒ (2 × 1 + 2 × 1
2 + 1 × 1
2 )/5 = 7
10
for a long time, prevalence-weighted macro average was the default
∴ papers say “We achieved a precision of . . . ”
Solution precision_score raises an error if the data is not binary,
unless the user specifies average.
Use the API to force literacy & awareness
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
16. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 12
What’s in a name?
What makes an implementation of some named algorithm correct?
Faithful to a published research paper?
Faithful to a reference implementation?
Faithful to some community of practice?
Consistent with other components of our software library?
Consistent with previous versions of the library?
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
17. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 13
Experimenters report sub-optimal results
because they assume our implementation is nicely behaved
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
18. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 14
The fit you thought was finished
Many optimisations are iterative
have criteria to test if it has converged on an optimum
Predictions and inferences may be poor if parameters did not converge
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
19. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 14
The fit you thought was finished
Many optimisations are iterative
have criteria to test if it has converged on an optimum
Predictions and inferences may be poor if parameters did not converge
Solution Warn if we did not detect convergence
but if we have too many warnings, users ignore them. . .
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
20. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15
The words you didn’t mean to stop
CountVectorizer turns text into a term-document matrix
can choose stop words: None, ‘english’ or BYO
‘english’ will remove system (and used to remove computer)
‘english’ will remove five, six, eight but not seven
‘english’ will remove we have but treat we’ve as ve
This is documented nowhere.
See my NLP-OSS paper with Hanmin Qin and Roman Yurchak
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
21. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15
The words you didn’t mean to stop
CountVectorizer turns text into a term-document matrix
can choose stop words: None, ‘english’ or BYO
‘english’ will remove system (and used to remove computer)
‘english’ will remove five, six, eight but not seven
‘english’ will remove we have but treat we’ve as ve
This is documented nowhere.
See my NLP-OSS paper with Hanmin Qin and Roman Yurchak
Solution Deprecate ‘english’
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
22. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 15
The words you didn’t mean to stop
CountVectorizer turns text into a term-document matrix
can choose stop words: None, ‘english’ or BYO
‘english’ will remove system (and used to remove computer)
‘english’ will remove five, six, eight but not seven
‘english’ will remove we have but treat we’ve as ve
This is documented nowhere.
See my NLP-OSS paper with Hanmin Qin and Roman Yurchak
Solution Deprecate ‘english’ and add another perfect stop list . . .
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
23. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16
The intercept you didn’t mean to regularise
In Logistic Regression, we learn a weight vector β
and a bias term β0 which corresponds to a feature x0 of all-1s
Regularisation: minimise i β2
i to ensure small weights as well as small loss
liblinear regularises β0. You probably never want to do this.
All our other linear estimators do not regularise the intercept.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
24. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16
The intercept you didn’t mean to regularise
In Logistic Regression, we learn a weight vector β
and a bias term β0 which corresponds to a feature x0 of all-1s
Regularisation: minimise i β2
i to ensure small weights as well as small loss
liblinear regularises β0. You probably never want to do this.
All our other linear estimators do not regularise the intercept.
Sol’n 1 intercept_scaling: also need to optimise x0’s fill value
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
25. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 16
The intercept you didn’t mean to regularise
In Logistic Regression, we learn a weight vector β
and a bias term β0 which corresponds to a feature x0 of all-1s
Regularisation: minimise i β2
i to ensure small weights as well as small loss
liblinear regularises β0. You probably never want to do this.
All our other linear estimators do not regularise the intercept.
Sol’n 1 intercept_scaling: also need to optimise x0’s fill value
. . . but most users don’t see/do this
Sol’n 2 Implement alternative optimisers, and deprecate liblinear as default
LogisticRegression solver
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
26. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 17
Analysis of code on GitHub shows that people use default
parameters when they shouldn’t
Andreas M¨uller
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
27. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 18
Most users are lazy
Users don’t explore alternatives
alternative parameters values
alternative software libraries
My students tend to use a CountVectorizer even
when they’re counting non-words (e.g. synsets)
We try to provide sensible default parameters
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
28. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 19
Sensible default fail
Ten-tree forests
Three-fold cross validation
?? A tokeniser that splits on word-internal punctuation
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
29. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 20
What makes a default value good?
Good defaults should give good predictive models and reliable statistics
? Good defaults should behave how users expect
but different communities of practice
Good defaults should be invariant to:
sample size (for stability in cross validation)
number of features (for stability in model selection)
?feature scaling (for stability in different tasks/datasets)
Example: finding a good default γ for an RBF kernel (#779, #10331)
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
30. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 21
Good parametrisation
We choose the defaults, but also how parameters are expressed
(and whether they can be changed at all)
Should number of nearest neighbors be specified as:
an absolute value (e.g. 10)?
a proportion of training samples (e.g. 2%)?
an arbitrary function of training data shape?
Algorithms and optimisation research often don’t report on this
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
31. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 22
Scientific software should make it easy for users to do good science.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
32. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 23
Have you ever tried to de-tokenise the Penn TreeBank?
PTB is delivered with each token POS tagged or bracketed
We wanted to know how paragraph structure, etc. informed parsing
The source text before tokenisation is available
An easy case:
[ Imports/NNS ]
were/VBD at/IN
[ $/$ 50.38/CD billion/CD ]
,/, up/RB
[ 19/CD %/NN ]
./.
Imports were at $50.38
billion, up 19%.
Imports 1–7
were 9–12
at 14–15
$ 16–16
50.38 17–21
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
33. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 24
Have you ever tried to de-tokenise the Penn TreeBank?
PTB is delivered with each token POS tagged or bracketed
We wanted to know how paragraph structure, etc. informed parsing
The source text before tokenisation is available
⇒ alignment hell due to typos corrected/inserted, reorderings, missed text, etc.
Hindsight: delivering annotations on tokenised text is a bad idea
It restricts what you can do with it later
NLP software should always provide stand-off markup
Or store the whitespace/non-token data as spaCy does
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
34. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 25
Avoiding leakage in cross validation
Bad X_preprocessed = preprocessor.fit_transform(X)
result = cross_validate(classifier, X_preprocessed, y)
Test data statistics leak into preprocessing
⇒ inflated cross validation results
Good pipeline = make_pipeline(preprocessor, classifier)
result = cross_validate(pipeline, X, y)
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
35. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 25
Avoiding leakage in cross validation
Bad X_preprocessed = preprocessor.fit_transform(X)
result = cross_validate(classifier, X_preprocessed, y)
Test data statistics leak into preprocessing
⇒ inflated cross validation results
Good pipeline = make_pipeline(preprocessor, classifier)
result = cross_validate(pipeline, X, y)
Solution De-emphasise fit_transform.
And make sure Pipeline works with everything;
and make sure cross_validate works with everything.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
36. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 26
Maintainers of large projects
can’t be experts in all the things they maintain
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018
37. Gatekeepers Names Surprises Sensibility Misdirection Collaborate 27
Scientists can (and do) help us:
make sure the implementation matches the name
make users aware of or avoid unexpected behaviour
parametrise algorithms and set defaults helpfully
understand how our design choices lead to flawed experiments
Users trust popular OSS.
Thank you for helping us make OSS trustworthy.
Joel Nothman Open-Source Software’s Responsibility to Science 20th July 2018