In this lecture, we give a practical guide on how to detect ambiguities in natural language requirements by means of GATE and by means of Python. A brief guide to Python is also included.
The previous lecture gives an introduction to the problem of ambiguity in requirements engineering. Find it here: https://www.slideshare.net/alessio_ferrari/requirements-engineering-focus-on-natural-language-processing-lecture-1
Extracting Archival-Quality Information from Software-Related ChatsPreetha Chatterjee
Software developers are increasingly having conversations about software development via online chat services. Many of those chat communications contain valuable information, such as code descriptions, good programming practices, and causes of common errors/exceptions. However, the nature of chat community content is transient, as opposed to the archival nature of other developer communications such as email, bug reports and Q&A forums. As a result, important information and advice are lost over time.
The focus of this dissertation is Extracting Archival Information from Software-Related Chats, specifically to (1) automatically identify conversations which contain archival-quality information, (2) accurately reduce the granularity of the information reported as archival information, and (3) conduct a case study to investigate how archival quality information extracted from chats compare to related posts in Q&A forums. Archiving knowledge from developer chats that could be used potentially in several applications such as: creating a new archival mechanism available to a given chat community, augmenting Q&A forums, or facilitating the mining of specific information and improving software maintenance tools.
DeepPavlov is an open-source framework for the development of production-ready chat-bots and complex conversational systems, as well as NLP and dialog systems research.
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Preetha Chatterjee
Monthly, 50 million users visit Stack Overflow, a popular Q&A forum used by software developers, to share and gather knowledge and help with coding problems. Although Q&A forums serve as a good resource for seeking help from developers beyond the local team, the abundance of information can cause developers, especially novice software engineers, to spend considerable time in identifying relevant answers and suitable suggested fixes.
This exploratory study aims to understand how novice software engineers direct their efforts and what kinds of information they focus on within a post selected from the results returned in response to a search query on Stack Overflow. The results can be leveraged to improve the Q&A forum interface, guide tools for mining forums, and potentially improve granularity of traceability mappings involving forum posts. We qualitatively analyze the novice software engineers’ perceptions from a survey as well as their annotations of a set of Stack Overflow posts. Our results indicate that novice software engineers pay attention to only 27% of code and 15-21% of text in a Stack Overflow post to understand and determine how to apply the relevant information to their context. Our results also discern the kinds of information prominent in that focus.
Extracting Archival-Quality Information from Software-Related ChatsPreetha Chatterjee
Software developers are increasingly having conversations about software development via online chat services. Many of those chat communications contain valuable information, such as code descriptions, good programming practices, and causes of common errors/exceptions. However, the nature of chat community content is transient, as opposed to the archival nature of other developer communications such as email, bug reports and Q&A forums. As a result, important information and advice are lost over time.
The focus of this dissertation is Extracting Archival Information from Software-Related Chats, specifically to (1) automatically identify conversations which contain archival-quality information, (2) accurately reduce the granularity of the information reported as archival information, and (3) conduct a case study to investigate how archival quality information extracted from chats compare to related posts in Q&A forums. Archiving knowledge from developer chats that could be used potentially in several applications such as: creating a new archival mechanism available to a given chat community, augmenting Q&A forums, or facilitating the mining of specific information and improving software maintenance tools.
DeepPavlov is an open-source framework for the development of production-ready chat-bots and complex conversational systems, as well as NLP and dialog systems research.
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Preetha Chatterjee
Monthly, 50 million users visit Stack Overflow, a popular Q&A forum used by software developers, to share and gather knowledge and help with coding problems. Although Q&A forums serve as a good resource for seeking help from developers beyond the local team, the abundance of information can cause developers, especially novice software engineers, to spend considerable time in identifying relevant answers and suitable suggested fixes.
This exploratory study aims to understand how novice software engineers direct their efforts and what kinds of information they focus on within a post selected from the results returned in response to a search query on Stack Overflow. The results can be leveraged to improve the Q&A forum interface, guide tools for mining forums, and potentially improve granularity of traceability mappings involving forum posts. We qualitatively analyze the novice software engineers’ perceptions from a survey as well as their annotations of a set of Stack Overflow posts. Our results indicate that novice software engineers pay attention to only 27% of code and 15-21% of text in a Stack Overflow post to understand and determine how to apply the relevant information to their context. Our results also discern the kinds of information prominent in that focus.
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
Introduction to Prolog (PROramming in LOGic)Ahmed Gad
As part of artificial intelligence course given in faculty of computers and information, Prolog was the first tool to make intelligent decisions like making relations between different objects.
Prolog has a strong history in AI starting in 1972 as a logic programming language that solves problems by logic.
Prolog is a general-purpose logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is declarative: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. The language was first conceived by a group around Alain Colmerauer in Marseille, France, in the early 1970s and the first Prolog system was developed in 1972 by Colmerauer with Philippe Roussel. Prolog was one of the first logic programming languages, and remains the most popular among such languages today, with several free and commercial implementations available. The language has been used for theorem proving, expert systems, type inference systems, and automated planning, as well as its original intended field of use, natural language processing. Modern Prolog environments support creating graphical user interfaces, as well as administrative and networked applications. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.
Find me on:
AFCIT
http://www.afcit.xyz
YouTube
https://www.youtube.com/channel/UCuewOYbBXH5gwhfOrQOZOdw
Google Plus
https://plus.google.com/u/0/+AhmedGadIT
SlideShare
https://www.slideshare.net/AhmedGadFCIT
LinkedIn
https://www.linkedin.com/in/ahmedfgad/
ResearchGate
https://www.researchgate.net/profile/Ahmed_Gad13
Academia
https://www.academia.edu/
Google Scholar
https://scholar.google.com.eg/citations?user=r07tjocAAAAJ&hl=en
Mendelay
https://www.mendeley.com/profiles/ahmed-gad12/
ORCID
https://orcid.org/0000-0003-1978-8574
StackOverFlow
http://stackoverflow.com/users/5426539/ahmed-gad
Twitter
https://twitter.com/ahmedfgad
Facebook
https://www.facebook.com/ahmed.f.gadd
Pinterest
https://www.pinterest.com/ahmedfgad/
What's Spain's Paris? Mining Analogical Libraries from Q&A DiscussionsChunyang Chen
Best paper award at SANER'16. Third-party libraries are an integral part of many software projects. It often happens that developers need to find analogical libraries that can provide comparable features to the libraries they are already familiar with. Existing methods to find analogical libraries are limited by the community-curated list of libraries, blogs, or Q&A posts, which often contain overwhelming or out-of-date information. In this paper, we present a new approach to recommend analogical libraries based on a knowledge base of analogical libraries mined from tags of millions of Stack Overflow questions. The novelty of our approach is to solve analogical-libraries questions by combining state-of-the-art word embedding technique and domain-specific relational and categorical knowledge mined from Stack Overflow. We implement our approach in a proof-of-concept web application (https://graphofknowledge.appspot.com/similartech). The evaluation results show that our approach can make accurate recommendation of analogical libraries (Precision@1=0.81 and Precision@5=0.67). Google Analytics of the website traffic provides initial evidence of the potential usefulness of our web application for software developers.
Developing software often requires using a number of tools and languages. In embedded software, for example, it is common to use C and its IDE, Matlab/Simulink, a number of custom XML files, a requirements management tool such as DOORS and possibly a UML tool and a variant management tool. The integration of such a zoo of tools is often a major source of (accidental) complexity in development projects.
Contrast that with the "good old days" when everything was text files and command line executables running on the unix shell. This approach had two important properties: the infrastructure was extremely generic (unix shell, pipes, text editors) and the actual contents were easily extensible and composable (new text file formats/languages and new command line tools); a productive environment for a given project or domain could easily be built from the generic infrastructure plus a few custom extensions.
In this talk I want to propose that creating (domain-specific) development environments based on language workbenches results in many of the same advantages that we all valued in the unix shell-world. A language workbench is an extremely generic infrastructure that is easily extensible with new languages. It is easy to create domain-specific development tools that can address different aspects of the system with suitable abstractions, but are nonetheless very well integrated in terms of semantics and tooling.
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.
Named Entity Recognition for Twitter Microposts (only) using Distributed Word...fgodin
As part of the Named Entity Recognition for Twitter microposts shared task at ACL2015, we propose a solution which only uses word embeddings. The word embeddings model is trained on 400 million tweets and is available at http://www.fredericgodin.com/software/.
https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
Introduction to Prolog (PROramming in LOGic)Ahmed Gad
As part of artificial intelligence course given in faculty of computers and information, Prolog was the first tool to make intelligent decisions like making relations between different objects.
Prolog has a strong history in AI starting in 1972 as a logic programming language that solves problems by logic.
Prolog is a general-purpose logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is declarative: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. The language was first conceived by a group around Alain Colmerauer in Marseille, France, in the early 1970s and the first Prolog system was developed in 1972 by Colmerauer with Philippe Roussel. Prolog was one of the first logic programming languages, and remains the most popular among such languages today, with several free and commercial implementations available. The language has been used for theorem proving, expert systems, type inference systems, and automated planning, as well as its original intended field of use, natural language processing. Modern Prolog environments support creating graphical user interfaces, as well as administrative and networked applications. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.
Find me on:
AFCIT
http://www.afcit.xyz
YouTube
https://www.youtube.com/channel/UCuewOYbBXH5gwhfOrQOZOdw
Google Plus
https://plus.google.com/u/0/+AhmedGadIT
SlideShare
https://www.slideshare.net/AhmedGadFCIT
LinkedIn
https://www.linkedin.com/in/ahmedfgad/
ResearchGate
https://www.researchgate.net/profile/Ahmed_Gad13
Academia
https://www.academia.edu/
Google Scholar
https://scholar.google.com.eg/citations?user=r07tjocAAAAJ&hl=en
Mendelay
https://www.mendeley.com/profiles/ahmed-gad12/
ORCID
https://orcid.org/0000-0003-1978-8574
StackOverFlow
http://stackoverflow.com/users/5426539/ahmed-gad
Twitter
https://twitter.com/ahmedfgad
Facebook
https://www.facebook.com/ahmed.f.gadd
Pinterest
https://www.pinterest.com/ahmedfgad/
What's Spain's Paris? Mining Analogical Libraries from Q&A DiscussionsChunyang Chen
Best paper award at SANER'16. Third-party libraries are an integral part of many software projects. It often happens that developers need to find analogical libraries that can provide comparable features to the libraries they are already familiar with. Existing methods to find analogical libraries are limited by the community-curated list of libraries, blogs, or Q&A posts, which often contain overwhelming or out-of-date information. In this paper, we present a new approach to recommend analogical libraries based on a knowledge base of analogical libraries mined from tags of millions of Stack Overflow questions. The novelty of our approach is to solve analogical-libraries questions by combining state-of-the-art word embedding technique and domain-specific relational and categorical knowledge mined from Stack Overflow. We implement our approach in a proof-of-concept web application (https://graphofknowledge.appspot.com/similartech). The evaluation results show that our approach can make accurate recommendation of analogical libraries (Precision@1=0.81 and Precision@5=0.67). Google Analytics of the website traffic provides initial evidence of the potential usefulness of our web application for software developers.
Developing software often requires using a number of tools and languages. In embedded software, for example, it is common to use C and its IDE, Matlab/Simulink, a number of custom XML files, a requirements management tool such as DOORS and possibly a UML tool and a variant management tool. The integration of such a zoo of tools is often a major source of (accidental) complexity in development projects.
Contrast that with the "good old days" when everything was text files and command line executables running on the unix shell. This approach had two important properties: the infrastructure was extremely generic (unix shell, pipes, text editors) and the actual contents were easily extensible and composable (new text file formats/languages and new command line tools); a productive environment for a given project or domain could easily be built from the generic infrastructure plus a few custom extensions.
In this talk I want to propose that creating (domain-specific) development environments based on language workbenches results in many of the same advantages that we all valued in the unix shell-world. A language workbench is an extremely generic infrastructure that is easily extensible with new languages. It is easy to create domain-specific development tools that can address different aspects of the system with suitable abstractions, but are nonetheless very well integrated in terms of semantics and tooling.
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.
Named Entity Recognition for Twitter Microposts (only) using Distributed Word...fgodin
As part of the Named Entity Recognition for Twitter microposts shared task at ACL2015, we propose a solution which only uses word embeddings. The word embeddings model is trained on 400 million tweets and is available at http://www.fredericgodin.com/software/.
Public Administration, Laws Requirements, Natural LanguageProjectLearnPAd
This presentation is based on LearnPAd EU Project: model-based learning for Public Administrations (www.learnpad.eu). It is about:
- Requirements in Learn PAd
- Natural language pragmatic ambiguities
PECB Webinar: Challenges in string search in Digital Computer ForensicsPECB
This webinar will give you a better understanding of fundamental computer forensics, based on the best practices used to implement the forensics evidence recovery and analytical processes.. The CLFE certification focuses on core skills required to collect and analyze data from Windows, Mac OS X, Linux computer systems, as well as from mobile devices.
About Presenter
This webinar will be presented by Fong Choong Fook, Director of LE Global Services. In 2013 he was awarded as the IDG ASEAN CSO of the year. Mr. Fong has had considerable experience in the IT industry and is a 15 year veteran in the specialized and highly-demanding area of information security; his special focus is undertaking IT security trainings and consultations to match strategic business objectives.
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
Challenges for Speech Translation Business Applications - The "Factory" Case. Michael Paul, ATR-TREK.
We will describe our efforts to integrate speech translation technologies into the daily routines at overseas factories. We present a rubber industry use-case that enables company employees to interact with local workers and supports daily life in foreign countries.
Getting started on your natural language processing project? First you'll need to extract some features from your corpus. Frequency, Syntax parsing, word vectors are good ones to start with.
PROGRAMMING AND LANGUAGES
Describe the six steps of programming
Discuss design tools
Describe program testing
Describe CASE tools & object-oriented software development
Explain the five generations of programming languages
he time spent looking for and not finding information cost an organization a total of $6 million a year, not including opportunity costs or the costs of reworking existing information that could not be located. Only 41% of localization-mature organizations have some terminology management policy in place, almost solely translation-oriented. Today we will talk about how terminology management works, demonstrate its power, through controlled languages, ontologies, search engine applications, content and knowledge management applications, and e-learning systems.
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
Lecture slides on Systematic Literature Reviews and Systematic Mapping Studies in software engineering. It describes the different steps, discusses differences between the two methods, and gives guidelines on how to conduct these types of study.
Lecture on case study design and reporting in empirical software engineering. The lecture touches on the topics of units of analysis, data collection, data analysis, validity procedures, and collaboration with industries.
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
This
Lecture about qualitative data collection methods and qualitative data analysis in software engineering. Topics covered are:
1. Sampling
2. Interviews
3. Observation and Participant Observation
4. Archival Data Collection
5. Grounded theory, Coding, Thematic Analysis
6. Threats to validity in qualitative studies
Find the videos at: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validityalessio_ferrari
Complete lecture on controlled experiments in software engineering. It explains practical guidelines on conducting controlled experiments and describes the concepts of dependent, independent, and control variables, significance, and p-value. It also explains how to select the appropriate statistic test for a hypothesis, and gives example of data for different typical tests.
Finally, it discusses threats to validity in controlled experiments and gives indications for reporting.
Find the video lectures here: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Requirements Engineering: focus on Natural Language Processing, Lecture 1alessio_ferrari
This is an introduction to requirements engineering, with a focus on ambiguity problems related to the use of natural language. It is the first of two lectures. In this lecture we give an overview of the requirements engineering problem, and an overview of ambiguity issues in requirements elicitation interviews and in requirements documents.
This presentation introduces the problem of ambiguity in software engineering in general and requirements engineering in particular. It discusses the various way in which ambiguity can be classified according to the literature and presents the work on detecting pragmatic ambiguity by means of collective intelligence (https://doi.org/10.1109/RE.2012.6345803).
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
A first introductory lecture on empirical methods in software engineering. It includes:
1) Motivation for empirical software engineering studies
2) How to define research questions
3) Measures and data collection methods
4) Formulating theories in software engineering
5) Software engineering research strategies
Find the videos at: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Requirements Engineering: focus on Natural Language Processing, Lecture 2
1. Requirements Engineering:
Focus on Natural Language
Requirements Processing
Alessio Ferrari1
1ISTI-CNR, Pisa, Italy
January 15, 2016
A. Ferrari (ISTI) Requirements Engineering 1 / 88
cf. https://docplayer.net/16051752-Introduction-to-programming-languages-and-
techniques-xkcd-com-full-python-tutorial.html
2. Objectives of the Course
Objectives: lesson 1
Understand what are requirements, specifications
and domain assertions
Understand the problems behind requirements elicitation
and SRS definition
Learn how to write a decent SRS
Objectives: lesson 2
Learn the basics of Natual Language Processing (NLP)
Learn the relation between NLP and Requirements
Learn the basics of GATE to detect NL ambiguities in
requirements
Learn the basics of Python NLTK (for further manipulation of NL
requirements)
A. Ferrari (ISTI) Requirements Engineering 2 / 88
3. NLP and Requirements Engineering
A. Ferrari (ISTI) Requirements Engineering 3 / 88
5. Natural Language Processing
Language is inherently ambiguous and hard to be handled
I Imagine a well-defined syntax and semantics for an input message:
you can write software to process it
I With NL you do not have a well defined syntax and semantics:
there is always some exception
Rule-based approaches are hard to build
Luckily, there’s machine-learning! . . . but
I :( Large amounts of data are needed for training/test/validation
I :( Tagging data requires domain expertise
I :( Requirements use domain-specific jargon (features based on
terms might work only for a specific company)
I :( Requirements structure vary from company to company (features
based on syntax might need adjustments)
I :( Some tasks (e.g., ambiguity) have too many negative cases (i.e.,
well-written), and too few positive (i.e., ambiguous)
A. Ferrari (ISTI) Requirements Engineering 5 / 88
6. 4 Dimensions of Natural Language Requirements Processing
Discipline: requirements have to be non-ambiguous but
readable, no equivalent requirement
Dynamism: requirements evolve, and have to be traced and
categorised
Domain Knowledge: domain knowledge has to be gathered to
understand requirements, proper stakeholders have to be
identified
Data-sets: tagged data-sets are required, requirements are
private
A. Ferrari (ISTI) Requirements Engineering 6 / 88
7. 4-D in the Requirements Process
Elicitation Analysis
ValidationManagement
Traceability
Domain Term
Identification
Stakeholder
Identification
Domain Scoping
Ambiguity
Readability
ComplianceCategorisation
Equivalence
D
I
S
C
I
P
L
I
N
E
DOMAIN KNOWLEDGE
DYNAMISM
DATA-SETS
Retrieval of Internal
Requirements
Market Analysis
A. Ferrari (ISTI) Requirements Engineering 7 / 88
8. Focus on Discipline
An ambiguity in a requirement might cause severe consequences
along the development
False negatives are NOT tolerable (i.e., I have to discover ALL
potential ambiguities)
It is preferable to have a certain amount of false positive cases,
which the engineer can easily discard
Rec = | true ambiguities items retrieved | / | true ambiguities |
Prec = | true ambiguities items retrieved | / | items retrieved |
Hence, a system for ambiguity detection has to provide
100% recall!
Rule-based approaches are preferred
However, for some tasks, rule-based approaches give really too many
false positives.
A. Ferrari (ISTI) Requirements Engineering 8 / 88
9. What you’ll learn
Rule-based approaches
Basic tools that you can use to implement rule-based approaches
These tools can be used as a basis if you want to move to
machine learning
A. Ferrari (ISTI) Requirements Engineering 9 / 88
10. NLP Minimal Concepts and Terminology
A. Ferrari (ISTI) Requirements Engineering 10 / 88
11. NLP Minimal Concepts
Every task in NLP needs to perform Text Normalization:
Segmenting/Tokenizing words in the text
Normalizing word formats (Stemming and Lemmatization)
Segmenting/Tokenizing sentences
Besides that, many tasks need Part-Of-Speech (POS) Tagging
A. Ferrari (ISTI) Requirements Engineering 11 / 88
12. NLP Minimal Concepts - Tokenization (or Word Segmentation)
A token is an occurrence of a WORD in a stream of text
Tokenization split a stream of text into a list of WORDS
Dan(Jurafsky(
How(many(words?(
they(lay(back(on(the(San(Francisco(grass(and(looked(at(the(stars(and(their(
• Type:(an(element(of(the(vocabulary.(
• Token:(an(instance(of(that(type(in(running(text.(
• How(many?(
• 15(tokens((or(14)(
• 13(types((or(12)((or(11?)(
A. Ferrari (ISTI) Requirements Engineering 12 / 88
13. NLP Minimal Concepts - Multi-word Terms
Multi-word Terms
Example: “The wayside equipment shall respond within 5
seconds”
wayside equipment is a multi-word term like San Francisco
We do not discuss about multi-word terms, but keep them in mind
because they are frequent in requirements
A. Ferrari (ISTI) Requirements Engineering 13 / 88
17. NLP Minimal Concepts - POS Tagging
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snow
see
registered
can
had
old older oldest
slowly
to with
off up
the some
and or
he its
Numbers
122,312
one
Interjections Ow Eh
A. Ferrari (ISTI) Requirements Engineering 17 / 88
20. IR vs IE
Information Retrieval (IR): pulls documents from large corpora
Information Extraction (IE): retrieves facts and structured
information from large corpora
IR returns documents containing the relevant information
somewhere (normally fast)
IE returns precise and structured information (can be slow)
GATE (General Architecture for Text Engineering) supports IE
Potential Usage
Entity, Events, Relation extraction
Enrich features for retrieval
Annotate documents for ML
Ambiguity identification with rule-based approaches
A. Ferrari (ISTI) Requirements Engineering 20 / 88
21. Document, Corpus, Annotation, Feature
Document = text + annotations + features
Corpus = collection of documents
Linguistic information in documents is encoded in the form of
annotations (like colored mark-ups)
Annotations have features with relative types and values
Example
Annotation: Sentence Length
Feature 1: Length in Characters, Value 1 = 100
Feature 2: Length in Token, Value 2 = 15
A. Ferrari (ISTI) Requirements Engineering 21 / 88
23. Processing Resources (PR)
PR are algorithms that make NLP easy
Document Reset PR - delete Annotations
ANNIE English Tokeniser - identify tokens (words, numbers, etc.)
ANNIE Sentence Splitter - identify sentence boundaries
ANNIE Gazetteer - identify specific tokens in a list
ANNIE POS Tagger - identify part-of-speech (POS), like name,
adjective, etc.
JAPE Transducers - user-defined annotations based on regular
expressions over annotations
ANNIE collects all the algorithms and run them in PIPELINE
A. Ferrari (ISTI) Requirements Engineering 23 / 88
24. Using Processing Resources in GATE
Open GATE
Language Resources > New > GATE Document
Load Source Document: exercises/requirements.txt
Right-click> New Corpus from this document
Right-click>Processing Resources (then select the resources
reported in the next slides)
A. Ferrari (ISTI) Requirements Engineering 24 / 88
28. ANNIE POS Tagger
Modifies the Token annotations with a new feature category = NN, JJ,
VB, etc.
A. Ferrari (ISTI) Requirements Engineering 28 / 88
29. JAPE Transducers
Gazetteer lists are designed for annotating simple, regular
features
Even identifying simple patterns like e-mails is impossible with a
Gazetteer
What is JAPE?
JAPE provides pattern matching in GATE
Each JAPE rule consists of:
I LHS which contains patterns to match
I RHS which details the annotations (and optionally features) to be
created
A. Ferrari (ISTI) Requirements Engineering 29 / 88
30. Example
I want to find all the occurrences of the term “level” followed by a
number (level 1, level 2, etc.)
A. Ferrari (ISTI) Requirements Engineering 30 / 88
33. GATE for Ambiguity Detection
A. Ferrari (ISTI) Requirements Engineering 33 / 88
34. Steps - Detecting Vagueness
Open GATE
Language Resources > New > GATE Document
Load Source Document: exercises/requirements.txt
Right-click> New Corpus from this document
Copy file exercises/vagueness.lst to
/GATE_Developer_8.0/plugins/ANNIE/resources/gazetteer
Open file lists.def (here you have all the file pointers to the lists
checked by the Look-up)
Add the line: vagueness.lst:vague_term
A. Ferrari (ISTI) Requirements Engineering 34 / 88
35. Steps - Detecting Vagueness (continue)
Go back to GATE
Processing Resources>New>ANNIE Gazetteer, and give it a
name
Check: case sensitive = false
Click on the Gazetteer, and check the presence of vagueness.lst
Applications>ANNIE
Move the Gazetteer to the right panel
Run!
A. Ferrari (ISTI) Requirements Engineering 35 / 88
36. Steps - Annotating Vagueness
Processing Resources>New>Jape Transducer
Load annotate_vagueness.jape
Applications>ANNIE
Move the JAPE rule to the right panel
Run!
A. Ferrari (ISTI) Requirements Engineering 36 / 88
37. Coordination Ambiguities
Detect all sentences that include a potential coordination
ambiguity
Coordination ambiguities are of two types:
I Example: The system shall produce the plot or the data-log and the
legend.
I Example: New employees and directors shall be provided with a
user-name and password.
A. Ferrari (ISTI) Requirements Engineering 37 / 88
38. Coordination Ambiguities - Jape Rule 0
Retrieve all occurrences of And / Or
Phase: MatchAndOr
Input: Token
Options: control = appelt
//Operator ==~ returns only whole string matches
//Operator (?i) tells that the matching is case insensitive
//Operator | is the logical "or" operator
Rule: checkAndOr
Priority: 1
(
{Token.string ==~ "(?i)and"} |
{Token.string ==~ "(?i)or"}
):coord
-->
:coord.AndOr = {}
A. Ferrari (ISTI) Requirements Engineering 38 / 88
39. Coordination Ambiguities - Jape Rule 1
Annotate all the sequences of And/Or in the same sentence
Phase: MatchCoordinationSequences
Input: Split AndOr
Options: control = appelt
//Note that, having Split among the input Annotations allows
//us to identify sequences of And/Or in the same sentence
//The ’+’ operator means "one or more occurrences"
//The ’*’ operator means "zero or more occurrences"
//The ’?’ operator means "zero or one occurrences"
Rule: checkCoordinationSequences
Priority: 1
(
{AndOr}
({AndOr})+
):coordSequence
-->
:coordSequence.AndOrSequence = {}
A. Ferrari (ISTI) Requirements Engineering 39 / 88
40. Coordination Ambiguities - Jape Rule 2
Annotate all the sentences with And/Or sequences
Phase: MatchCoordinationSentences
Input: Sentence AndOrSequence
Options: control = appelt
//A "contains" B: searches for
//annotations A that contain annotations B
Rule: checkCoordinationSentences
Priority: 1
(
{Sentence contains AndOrSequence}
):coordSentence
-->
:coordSentence.AndOrSentence = {}
A. Ferrari (ISTI) Requirements Engineering 40 / 88
41. Your Turn! - Detect Coordination Ambiguities of the Second Type
Add the Example: new employees and directors shall be
provided with a user-name and password
Remember that ANNIE POS Tagger gives you the category of the
Token
JJ = Adjective, NN = Noun, NNS = Plural Noun
More information on the tag-set, and many other things, can be
found in: https://gate.ac.uk/sale/tao/tao.pdf
A. Ferrari (ISTI) Requirements Engineering 41 / 88
43. Python Basic Concepts
Python Online! http://www.learnpython.org/
Tutorial: https://docs.python.org/2/tutorial/
A. Ferrari (ISTI) Requirements Engineering 43 / 88
44. Python - Main Features
Interpreted language
Dynamically typed (variables do not have a predefined type!)
Rich, built-in collection types
I Lists
I Tuples
I Dictionaries
I Sets
Concise
My Humble Opinion
It is perfect if you want to develop something quickly
It is a mess if you want to develop a structured software
A. Ferrari (ISTI) Requirements Engineering 44 / 88
45. Which Python?
Python 2.7: that’s the one we’ll use here
I Last stable release before version 3
I Implements parts of the version 3 features
I Fully backwards compatible
Python 3.0: that’s the one you should use in the future
I Released on Dec. 3, 2008
I Many changes (also incompatible ones)
I . . . But: few important third-party libraries are not yet compatible
with Python 3.0
I Version compatible with NLTK 3.0: either 2.7, or 3.2+
Hint: if you want to use NLTK, go to:
http://www.nltk.org/install.html, and check the links
to the additional libraries and python recommended releases,
before downloading Python
A. Ferrari (ISTI) Requirements Engineering 45 / 88
46. Python - Example
A Code Sample (in IDLE)
x = 34 - 23 # A comment.
y = “Hello” # Another one.
z = 3.45
if z == 3.45 or y == “Hello”:
x = x + 1
y = y + “ World” # String concat.
print x
print y
A. Ferrari (ISTI) Requirements Engineering 46 / 88
47. Python - Enough to Understand the Code
Indentation matters to the meaning of the code
The first assignment to a variable creates it (x = 3)
I Variable types do not need to be declared
I Python figures out the variable types on its own
Logical operators are words (and, or, not), NOT symbols
Simple printing can be done with print
Use a newline to end a line of code
Use when must go to next line prematurely
A. Ferrari (ISTI) Requirements Engineering 47 / 88
48. Python - Assignment
Binding a variable in Python means setting a name to hold a
reference to some object
Hence, assignment creates references, not copies (like in Java)
An object is deleted (by the garbage collector) once it becomes
unreachable
A. Ferrari (ISTI) Requirements Engineering 48 / 88
49. Python - Language Features
Indentation instead of braces
Several sequence types
I Strings (immutable):
’This is a string’, and “Also this is a string”
I Tuples (immutable):
(’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’)
I Lists (mutable):
[’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’]
Lists and Tuples can contain anything:
I ((’REQ’, 1), ’the system shall measure the heartbeat’)
I [((’REQ’, 1), ’the system shall measure the heartbeat’), ((’REQ’, 2),
’the system shall alert the nurse’))]
A. Ferrari (ISTI) Requirements Engineering 49 / 88
50. Python - Sequence Types
Tuples are defined using parentheses (and commas).
tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
Lists are defined using square brackets (and commas)
li = [“abc”, 34, 4.34, 23]
Strings are defined using quotes (“, ‘, or “““)
st = “Hello World”
st = ‘Hello World’
st = “““Hello World ”””
A. Ferrari (ISTI) Requirements Engineering 50 / 88
51. Python - Sequence Types
We can access individual members of a tuple, list, or string using
square bracket “array” notation.
We can access individual members of a tuple, lis
using square bracket “array” notation.
Note that all are 0 based…
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1] # Second item in the tuple.
‘abc’
>>> li = [“abc”, 34, 4.34, 23]
>>> li[1] # Second item in the list.
34
>>> st = “Hello World”
>>> st[1] # Second character in string.
‘e’
A. Ferrari (ISTI) Requirements Engineering 51 / 88
52. Python - Negative IndicesNegative indices
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Positive index: count from the left, starting with 0.
>>> t[1]
‘abc’
Negative lookup: count from right, starting with –1.
>>> t[-3]
4.56
A. Ferrari (ISTI) Requirements Engineering 52 / 88
53. Python - Slicing
30
Slicing: Return Copy of a Subset 1
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Return a copy of the container with a subset of the original
members. Start copying at the first index, and stop copying
before the second index.
>>> t[1:4]
(‘abc’, 4.56, (2,3))
You can also use negative indices when slicing.
>>> t[1:-1]
(‘abc’, 4.56, (2,3))
Optional argument allows selection of every nth item.
>>> t[1:-1:2]
(‘abc’, (2,3))
A. Ferrari (ISTI) Requirements Engineering 53 / 88
54. Python - ‘in’ OperatorThe ‘in’ Operator
Boolean test whether a value is inside a collection (often
called a container in Python:
>>> t = [1, 2, 4, 5]
>>> 3 in t
False
>>> 4 in t
True
>>> 4 not in t
False
For strings, tests for substrings
>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
Be careful: the in keyword is also used in the syntax of
for loops and list comprehensions.
A. Ferrari (ISTI) Requirements Engineering 54 / 88
55. Python - ‘+’ Operator
34
The + Operator
The + operator produces a new tuple, list, or string whose
value is the concatenation of its arguments.
Extends concatenation from strings to other types
>>> (1, 2, 3) + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> “Hello” + “ ” + “World”
‘Hello World’
A. Ferrari (ISTI) Requirements Engineering 55 / 88
56. Lists - Mutable
36
Lists: Mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
We can change lists in place.
Name li still points to the same memory reference when
we’re done.
A. Ferrari (ISTI) Requirements Engineering 56 / 88
57. Tuples - Immutable
37
Tuples: Immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
You can’t change a tuple.
You can make a fresh tuple and assign its reference to a previously
used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
The immutability of tuples means they’re faster than lists.
A. Ferrari (ISTI) Requirements Engineering 57 / 88
58. Operations on Lists
38
Operations on Lists Only 1
>>> li = [1, 11, 3, 4, 5]
>>> li.append(‘a’) # Note the method syntax
>>> li
[1, 11, 3, 4, 5, ‘a’]
>>> li.insert(2, ‘i’)
>>>li
[1, 11, ‘i’, 3, 4, 5, ‘a’]
A. Ferrari (ISTI) Requirements Engineering 58 / 88
59. Dictionaries
Dictionaries store a mapping between a set of keys and a set of
values
Keys can be any immutable type
Values can be any type
A. Ferrari (ISTI) Requirements Engineering 59 / 88
60. Dictionaries - Creating and AccessingCreating and accessing
dictionaries
>>> d = {‘user’:‘bozo’, ‘pswd’:1234}
>>> d[‘user’]
‘bozo’
>>> d[‘pswd’]
1234
>>> d[‘bozo’]
Traceback (innermost last):
File ‘<interactive input>’ line 1, in ?
KeyError: bozo
A. Ferrari (ISTI) Requirements Engineering 60 / 88
61. Dictionaries - Updating
Updating Dictionaries
>>> d = {‘user’:‘bozo’, ‘pswd’:1234}
>>> d[‘user’] = ‘clown’
>>> d
{‘user’:‘clown’, ‘pswd’:1234}
Keys must be unique.
Assigning to an existing key replaces its value.
>>> d[‘id’] = 45
>>> d
{‘user’:‘clown’, ‘id’:45, ‘pswd’:1234}
Dictionaries are unordered
New entry might appear anywhere in the output.
(Dictionaries work by hashing)
A. Ferrari (ISTI) Requirements Engineering 61 / 88
62. Dictionaries - Removing
Removing dictionary entries
>>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34}
>>> del d[‘user’] # Remove one. Note that del is
# a function.
>>> d
{‘p’:1234, ‘i’:34}
>>> d.clear() # Remove all.
>>> d
{}
>>> a=[1,2]
>>> del a[1] # (del also works on lists)
>>> a
[1]
47
A. Ferrari (ISTI) Requirements Engineering 62 / 88
63. Dictionaries - Useful Methods
Useful Accessor Methods
>>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34}
>>> d.keys() # List of current keys
[‘user’, ‘p’, ‘i’]
>>> d.values() # List of current values.
[‘bozo’, 1234, 34]
>>> d.items() # List of item tuples.
[(‘user’,‘bozo’), (‘p’,1234), (‘i’,34)]
A. Ferrari (ISTI) Requirements Engineering 63 / 88
64. If Statements (as expected)
if Statements (as expected)
if x == 3:
print "X equals 3."
elif x == 2:
print "X equals 2."
else:
print "X equals something else."
print "This is outside the ‘if’."
Note:
Use of indentation for blocks
Colon (:) after boolean expression
54
A. Ferrari (ISTI) Requirements Engineering 64 / 88
65. While Loops (as expected)
while Loops (as expected)
>>> x = 3
>>> while x < 5:
print x, "still in the loop"
x = x + 1
3 still in the loop
4 still in the loop
>>> x = 6
>>> while x < 5:
print x, "still in the loop"
>>>
55
A. Ferrari (ISTI) Requirements Engineering 65 / 88
66. For Loops (very cool)For Loops 1
For-each is Python’s only form of for loop
A for loop steps through each of the items in a collection
type, or any other type of object which is “iterable”
for <item> in <collection>:
<statements>
If <collection> is a list or a tuple, then the loop steps
through each element of the sequence.
If <collection> is a string, then the loop steps through each
character of the string.
for someChar in “Hello World”:
print someChar
59
A. Ferrari (ISTI) Requirements Engineering 66 / 88
67. For Loops
Items can be more complex than a simple variable name
<statements>
<item> can be more complex than a single
variable name.
If the elements of <collection> are themselves collections,
then <item> can match the structure of the elements. (We
saw something similar with list comprehensions and with
ordinary assignments.)
for (x, y) in [(a,1), (b,2), (c,3), (d,4)]:
print x
60
A. Ferrari (ISTI) Requirements Engineering 67 / 88
68. For Loops and range() Function
For loops and the range() function
We often want to write a loop where the variables ranges
over some sequence of numbers. The range() function
returns a list of numbers from 0 up to but not including the
number we pass to it.
range(5) returns [0,1,2,3,4]
So we can say:
for x in range(5):
print x
(There are several other forms of range() that provide
variants of this functionality…)
xrange() returns an iterator that provides the same
functionality more efficiently
61
A. Ferrari (ISTI) Requirements Engineering 68 / 88
69. For Loops and range() Function
Abuse of the range() function
Don't use range() to iterate over a sequence solely to have
the index and elements available at the same time
Avoid:
for i in range(len(mylist)):
print i, mylist[i]
Instead:
for (i, item) in enumerate(mylist):
print i, item
This is an example of an anti-pattern in Python
For more, see:
http://www.seas.upenn.edu/~lignos/py_antipatterns.html
http://stackoverflow.com/questions/576988/python-specific-antipatterns-and-bad-
practices
A. Ferrari (ISTI) Requirements Engineering 69 / 88
70. List Comprehensions (remarkably cool)
List Comprehensions 1
A powerful feature of the Python language.
Generate a new list by applying a function to every member
of an original list.
Python programmers use list comprehensions extensively.
You’ll see many of them in real code.
[ expression for name in list ]
64
A. Ferrari (ISTI) Requirements Engineering 70 / 88
71. List Comprehensions (remarkably cool)
List Comprehensions 2
>>> li = [3, 6, 2, 7]
>>> [elem*2 for elem in li]
[6, 12, 4, 14]
[ expression for name in list ]
Where expression is some calculation or operation
acting upon the variable name.
For each member of the list, the list comprehension
1. sets name equal to that member, and
2. calculates a new value using expression,
It then collects these new values into a list which is the
return value of the list comprehension.
[ expression for name in list ]
65
A. Ferrari (ISTI) Requirements Engineering 71 / 88
72. List Comprehensions (remarkably cool)
List Comprehensions 3
If the elements of list are other collections, then
name can be replaced by a collection of names
that match the “shape” of the list members.
>>> li = [(‘a’, 1), (‘b’, 2), (‘c’, 7)]
>>> [ n * 3 for (x, n) in li]
[3, 6, 21]
[ expression for name in list ]
66
A. Ferrari (ISTI) Requirements Engineering 72 / 88
73. Filtered List Comprehensions (remarkably cool)
Filtered List Comprehension 1
Filter determines whether expression is performed
on each member of the list.
When processing each element of list, first check if
it satisfies the filter condition.
If the filter condition returns False, that element is
omitted from the list before the list comprehension
is evaluated.
[ expression for name in list if filter]
67
A. Ferrari (ISTI) Requirements Engineering 73 / 88
74. Filtered List Comprehensions (remarkably cool)
>>> li = [3, 6, 2, 7, 1, 9]
>>> [elem * 2 for elem in li if elem > 4]
[12, 14, 18]
Only 6, 7, and 9 satisfy the filter condition.
So, only 12, 14, and 18 are produced.
Filtered List Comprehension 2
[ expression for name in list if filter]
A. Ferrari (ISTI) Requirements Engineering 74 / 88
75. Functions
First line with less
indentation is considered to be
outside of the function definition.
Defining Functions
No declaration of types of arguments or result
def get_final_answer(filename):
"""Documentation String"""
line1
line2
return total_counter
...
Function definition begins with def Function name and its arguments.
‘return’ indicates the
value to be sent back to the caller.
Colon.
72
A. Ferrari (ISTI) Requirements Engineering 75 / 88
76. Function Call
Calling a Function
>>> def myfun(x, y):
return x * y
>>> myfun(3, 4)
12
73
A. Ferrari (ISTI) Requirements Engineering 76 / 88
77. Import
87
Import I
import somefile
Everything in somefile.py can be referred to by:
somefile.className.method(“abc”)
somefile.myFunction(34)
from somefile import *
Everything in somefile.py can be referred to by:
className.method(“abc”)
myFunction(34)
Careful! This can overwrite the definition of an existing
function or variable!
A. Ferrari (ISTI) Requirements Engineering 77 / 88
78. Python NLTK Basic Concepts
http://textminingonline.com/
dive-into-nltk-part-i-getting-started-with-nltk
A. Ferrari (ISTI) Requirements Engineering 78 / 88
79. Python NLTK Main Features
Sentence and word tokenization
POS tagging
Chunking and Named Entity Recognition (NER)
Text Classification
Many included corpora
A. Ferrari (ISTI) Requirements Engineering 79 / 88
80. Python NLTK - sentence tokenization (splitting)
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("hello this is nltk")
[’hello this is nltk’]
>>> sent_tokenize("hello. this is nltk")
[’hello.’, ’this is nltk’]
A. Ferrari (ISTI) Requirements Engineering 80 / 88
81. Python NLTK - word tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("this is nltk")
[’this’, ’is’, ’nltk’]
A. Ferrari (ISTI) Requirements Engineering 81 / 88
82. Python NLTK - word tokenization with punctuation
>>> word_tokenize("The woman’s car")
[’The’, ’woman’, "’s", ’car’]
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("The woman’s car")
[’The’, ’woman’, "’", ’s’, ’car’]
A. Ferrari (ISTI) Requirements Engineering 82 / 88
83. Python NLTK - stemming
>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’requirements’)
’requir’
A. Ferrari (ISTI) Requirements Engineering 83 / 88
85. Python NLTK - POS tagging
>>> t = word_tokenize("The system shall be nice")
>>> from nltk import pos_tag
>>> pos_tag(t)
[(’The’, ’DT’), (’system’, ’NN’),
(’shall’, ’MD’), (’be’, ’VB’), (’nice’, ’JJ’)]
A. Ferrari (ISTI) Requirements Engineering 85 / 88
86. Python NLTK - POS tagging
How can I know the meaning of the tags?
>>> import nltk
>>> nltk.help.upenn_tagset(’DT’)
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
>>> nltk.help.upenn_tagset(’NN’)
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
>>> nltk.help.upenn_tagset(’NNP’)
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
A. Ferrari (ISTI) Requirements Engineering 86 / 88
87. Python NLTK - stopword removal
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words(’english’)
>>> sentence = ’The system shall be usable’
>>> print [i for i in word_tokenize(sentence) if i not in stoplist]
[’The’, ’system’, ’shall’, ’usable’]
>>> print [i for i in word_tokenize(sentence.lower())
if i not in stoplist]
[’system’, ’shall’, ’usable’]
A. Ferrari (ISTI) Requirements Engineering 87 / 88
88. When to use What
GATE is good to annotate documents, and find patterns
Python NLTK is good for fast prototyping of analysis tools
If you have to extract information from text: use GATE
If you have to do more fine-grained analyses, and you do not need
to build a well-structured program: use Python NLTK
If you have to build a structured program: study the OpenNLP
library (Java)
A. Ferrari (ISTI) Requirements Engineering 88 / 88