This document provides an overview of knowledge representation in natural language processing. It discusses part-of-speech tagging using various taggers like the default tagger, regular expression tagger, and lookup tagger. It also covers n-gram tagging using a unigram tagger. The document compares the performance of these taggers on test data from the Brown corpus and finds that the lookup tagger and unigram tagger perform best with accuracies of around 58% and higher.
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
Presentation for "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
See paper: http://ceur-ws.org/Vol-1653/paper_11.pdf
Using NP Problems to Share Keys in Secret-Key Cryptographyiosrjce
Public key cryptography has now become an important means for providing confidentiality by its use
of key distribution, in which users can do private communication with the help of encryption keys. It also
provides digital signatures which allow users to sign keys to verify their identities. But public key cryptography
has its own shortcoming regarding to high cost in keys distribution and excessive computation in encoding and
decoding it.
Whereas private key can omit all above problems but only if we can find a way to share private key
confidentially.
This research presents an innovation, which can be our future approach, using technology so-called NP
problems, of sending or sharing keys to the receiver without any need of the third party. This will provide an
open idea where sender and receiver can share any key for any number of times for encrypting data
confidentially that also helpful in overcoming problem of brute force attack
This document provides an overview of a Ph.D. viva presentation on using game theory to model and address network attacks involving multiple compromised nodes. It discusses how hide-and-seek games can model such attacks as two-sided search problems. An empirical game theoretic analysis approach is proposed to study richer hide-and-seek models and derive strategies for both attackers and defenders. The methodology involves defining computational models, enumerating strategies, simulating strategy matchups, and analyzing results to determine optimal strategies.
The document discusses a workshop on folksonomies held in Singapore. It provides an agenda for the workshop which includes several lectures on topics related to folksonomies, such as how they can be used for indexing and knowledge representation. It also discusses how folksonomies can be used in information retrieval and enhanced through techniques such as tag gardening.
The document discusses different types of knowledge that may need to be represented in AI systems, including objects, events, performance, and meta-knowledge. It also discusses representing knowledge at two levels: the knowledge level containing facts, and the symbol level containing representations of objects defined in terms of symbols. Common ways of representing knowledge mentioned include using English, logic, relations, semantic networks, frames, and rules. The document also discusses using knowledge for applications like learning, reasoning, and different approaches to machine learning such as skill refinement, knowledge acquisition, taking advice, problem solving, induction, discovery, and analogy.
This document provides guidance on how to make, confirm, cancel, and reschedule appointments in English. It includes sample dialogues for requesting or making an appointment, responding to a request, confirming or agreeing on details, disagreeing and proposing alternatives, and canceling or changing an appointment. Examples are provided for each case. The document concludes with instructions for a role play activity where the reader takes on the role of a sales representative scheduling meetings with managers in Europe.
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
Presentation for "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
See paper: http://ceur-ws.org/Vol-1653/paper_11.pdf
Using NP Problems to Share Keys in Secret-Key Cryptographyiosrjce
Public key cryptography has now become an important means for providing confidentiality by its use
of key distribution, in which users can do private communication with the help of encryption keys. It also
provides digital signatures which allow users to sign keys to verify their identities. But public key cryptography
has its own shortcoming regarding to high cost in keys distribution and excessive computation in encoding and
decoding it.
Whereas private key can omit all above problems but only if we can find a way to share private key
confidentially.
This research presents an innovation, which can be our future approach, using technology so-called NP
problems, of sending or sharing keys to the receiver without any need of the third party. This will provide an
open idea where sender and receiver can share any key for any number of times for encrypting data
confidentially that also helpful in overcoming problem of brute force attack
This document provides an overview of a Ph.D. viva presentation on using game theory to model and address network attacks involving multiple compromised nodes. It discusses how hide-and-seek games can model such attacks as two-sided search problems. An empirical game theoretic analysis approach is proposed to study richer hide-and-seek models and derive strategies for both attackers and defenders. The methodology involves defining computational models, enumerating strategies, simulating strategy matchups, and analyzing results to determine optimal strategies.
The document discusses a workshop on folksonomies held in Singapore. It provides an agenda for the workshop which includes several lectures on topics related to folksonomies, such as how they can be used for indexing and knowledge representation. It also discusses how folksonomies can be used in information retrieval and enhanced through techniques such as tag gardening.
The document discusses different types of knowledge that may need to be represented in AI systems, including objects, events, performance, and meta-knowledge. It also discusses representing knowledge at two levels: the knowledge level containing facts, and the symbol level containing representations of objects defined in terms of symbols. Common ways of representing knowledge mentioned include using English, logic, relations, semantic networks, frames, and rules. The document also discusses using knowledge for applications like learning, reasoning, and different approaches to machine learning such as skill refinement, knowledge acquisition, taking advice, problem solving, induction, discovery, and analogy.
This document provides guidance on how to make, confirm, cancel, and reschedule appointments in English. It includes sample dialogues for requesting or making an appointment, responding to a request, confirming or agreeing on details, disagreeing and proposing alternatives, and canceling or changing an appointment. Examples are provided for each case. The document concludes with instructions for a role play activity where the reader takes on the role of a sales representative scheduling meetings with managers in Europe.
This document contains a lecture on knowledge representation in digital humanities. It discusses using strings to represent text in Python programming. The lecture includes exercises on defining functions to print prime numbers under 100 and exploring string indices. It also covers functions, data types like integers and strings, and using strings to access individual characters and slices of text.
This document summarizes a lecture on knowledge representation in digital humanities. It discusses formalizing the modeling of real-world domains and representing complex objects. The lecture covers more complex data types in Python like lists, tuples, and dictionaries. It explains accessing, modifying, and deleting items from these data types. The document also discusses object-oriented programming concepts like classes, objects, attributes and methods for modeling domains.
This document summarizes a lecture on knowledge representation in digital humanities. It discusses:
- The contents and objectives of the lecture, which trains problem solving skills through algorithm formalization.
- Last assignment discussion to consolidate concepts and discuss specific project solutions.
- Chapter 3 which covers fundamentals of programming, including designing algorithms, elements of a program, and the programming process.
This document provides an overview of a lecture on knowledge representation in digital humanities. It begins with an introduction to the course, its justification and goals, including explaining why knowledge representation and skills like modeling, programming, and natural language processing are important for digital humanities. It then discusses what digital humanities encompasses and provides some definitions of the field from various scholars. Examples are given of digital humanities projects, including the Sylva Project, which involves modeling, knowledge representation, data visualization, and collaboration.
This document discusses a lecture on knowledge representation in digital humanities. It covers:
1. An introduction to the lecture, which teaches Python programming and develops programming skills for knowledge representation and modeling.
2. A discussion of the previous assignment to consolidate concepts from readings and discuss specific solutions.
3. An overview of Chapter 4 on the Python programming language, covering features of Python, programming in Python using variables, expressions, conditionals and iterations.
The document discusses privacy in social networks and the design of a social media simulator called MCAS. MCAS aims to predict information cascades across platforms using endogenous and exogenous signals. Scenario 1 uses only endogenous Reddit data to predict discussion thread growth, evaluating against baselines. Scenario 2 predicts Twitter activity using both endogenous social media discussions and exogenous news articles. The goal is to generate realistic simulations for applications like disaster response and trend analysis.
The document explores linguistic complexity in two text types (news and letters) from 1750-1990 in English. It analyzes subjects and objects of sentences using several metrics related to length, structure, and processing efficiency. Preliminary results show objects are generally more complex than subjects, and news may be more complex than letters based on longer constituents and intermediate nodes. Further analysis is needed to better understand complexity differences over time and between text types.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
This study examined how 18 high school students responded to and authored poetry using multimodal approaches. For responding, students identified keywords in poems and found online images to represent tone and negotiate meaning. When authoring, students composed extended metaphor poems using presentation software and Internet images. Data sources included student work, observations, and materials. Thematic analysis identified 3 networks: 1) Meaning-making is an active negotiation process over time and space; 2) Multimodal, non-verbal approaches increased engagement; 3) Technology allows expression of identity and agency in authorship. The study found multimodal approaches supported meaning-making and student agency more than constrained activities. It provides implications for practice incorporating images, design literacy, and avenues for student
Digital Humanities: A brief introduction to the fieldaelang
This document summarizes a presentation on digital humanities. It discusses working with both structured and unstructured data, challenges around data collection and representation, and examples of textual, spatial and network analysis projects. Resources mentioned include summer schools and tutorials for learning tools and methods in the field.
The document summarizes two research articles about international adoption. It discusses how the first article by Russell (2005) examines the relationship between students and teachers in ethnographic research and the challenges of gaining trust. The second article by Brown et al. (2005) studies factors that influence adjustment of internationally adopted children. The document then provides a critical evaluation of the research methods used in both articles.
This document outlines the key topics and objectives covered in Chapter 3 of a Principles of Communication course. It discusses different models of the communication process, including Shannon's model, the interactive model, the gatekeeper model, and the trans-active model. It also covers signals, systems, communication systems types, noise and its impact on communication, and a brief history of communication systems. Students are assigned homework to read Chapter 4 and complete exercises to help summarize their understanding.
This document discusses representing computing concepts like Turing machines, programming patterns, and virtual machines using semantic networks and RDF graphs. It describes how instructions, data structures, objects, and software patterns can be modeled as nodes and relationships in a graph. It also introduces RDF as a standardized data model for semantic networks and triplestores for efficiently storing and querying large semantic graphs.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Statistics is the collection, organization, analysis, interpretation and presentation of data. It deals with both descriptive statistics, which summarize and describe data, and inferential statistics, which are used to draw conclusions about populations based on sample data. The key aspects of statistics discussed in the document are:
- Populations and samples
- Parameters and statistics
- Quantitative and qualitative variables
- Levels of measurement including nominal, ordinal, interval and ratio scales
- Types of data including primary and secondary data
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
This document contains a lecture on knowledge representation in digital humanities. It discusses using strings to represent text in Python programming. The lecture includes exercises on defining functions to print prime numbers under 100 and exploring string indices. It also covers functions, data types like integers and strings, and using strings to access individual characters and slices of text.
This document summarizes a lecture on knowledge representation in digital humanities. It discusses formalizing the modeling of real-world domains and representing complex objects. The lecture covers more complex data types in Python like lists, tuples, and dictionaries. It explains accessing, modifying, and deleting items from these data types. The document also discusses object-oriented programming concepts like classes, objects, attributes and methods for modeling domains.
This document summarizes a lecture on knowledge representation in digital humanities. It discusses:
- The contents and objectives of the lecture, which trains problem solving skills through algorithm formalization.
- Last assignment discussion to consolidate concepts and discuss specific project solutions.
- Chapter 3 which covers fundamentals of programming, including designing algorithms, elements of a program, and the programming process.
This document provides an overview of a lecture on knowledge representation in digital humanities. It begins with an introduction to the course, its justification and goals, including explaining why knowledge representation and skills like modeling, programming, and natural language processing are important for digital humanities. It then discusses what digital humanities encompasses and provides some definitions of the field from various scholars. Examples are given of digital humanities projects, including the Sylva Project, which involves modeling, knowledge representation, data visualization, and collaboration.
This document discusses a lecture on knowledge representation in digital humanities. It covers:
1. An introduction to the lecture, which teaches Python programming and develops programming skills for knowledge representation and modeling.
2. A discussion of the previous assignment to consolidate concepts from readings and discuss specific solutions.
3. An overview of Chapter 4 on the Python programming language, covering features of Python, programming in Python using variables, expressions, conditionals and iterations.
The document discusses privacy in social networks and the design of a social media simulator called MCAS. MCAS aims to predict information cascades across platforms using endogenous and exogenous signals. Scenario 1 uses only endogenous Reddit data to predict discussion thread growth, evaluating against baselines. Scenario 2 predicts Twitter activity using both endogenous social media discussions and exogenous news articles. The goal is to generate realistic simulations for applications like disaster response and trend analysis.
The document explores linguistic complexity in two text types (news and letters) from 1750-1990 in English. It analyzes subjects and objects of sentences using several metrics related to length, structure, and processing efficiency. Preliminary results show objects are generally more complex than subjects, and news may be more complex than letters based on longer constituents and intermediate nodes. Further analysis is needed to better understand complexity differences over time and between text types.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
This study examined how 18 high school students responded to and authored poetry using multimodal approaches. For responding, students identified keywords in poems and found online images to represent tone and negotiate meaning. When authoring, students composed extended metaphor poems using presentation software and Internet images. Data sources included student work, observations, and materials. Thematic analysis identified 3 networks: 1) Meaning-making is an active negotiation process over time and space; 2) Multimodal, non-verbal approaches increased engagement; 3) Technology allows expression of identity and agency in authorship. The study found multimodal approaches supported meaning-making and student agency more than constrained activities. It provides implications for practice incorporating images, design literacy, and avenues for student
Digital Humanities: A brief introduction to the fieldaelang
This document summarizes a presentation on digital humanities. It discusses working with both structured and unstructured data, challenges around data collection and representation, and examples of textual, spatial and network analysis projects. Resources mentioned include summer schools and tutorials for learning tools and methods in the field.
The document summarizes two research articles about international adoption. It discusses how the first article by Russell (2005) examines the relationship between students and teachers in ethnographic research and the challenges of gaining trust. The second article by Brown et al. (2005) studies factors that influence adjustment of internationally adopted children. The document then provides a critical evaluation of the research methods used in both articles.
This document outlines the key topics and objectives covered in Chapter 3 of a Principles of Communication course. It discusses different models of the communication process, including Shannon's model, the interactive model, the gatekeeper model, and the trans-active model. It also covers signals, systems, communication systems types, noise and its impact on communication, and a brief history of communication systems. Students are assigned homework to read Chapter 4 and complete exercises to help summarize their understanding.
This document discusses representing computing concepts like Turing machines, programming patterns, and virtual machines using semantic networks and RDF graphs. It describes how instructions, data structures, objects, and software patterns can be modeled as nodes and relationships in a graph. It also introduces RDF as a standardized data model for semantic networks and triplestores for efficiently storing and querying large semantic graphs.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Statistics is the collection, organization, analysis, interpretation and presentation of data. It deals with both descriptive statistics, which summarize and describe data, and inferential statistics, which are used to draw conclusions about populations based on sample data. The key aspects of statistics discussed in the document are:
- Populations and samples
- Parameters and statistics
- Quantitative and qualitative variables
- Levels of measurement including nominal, ordinal, interval and ratio scales
- Types of data including primary and secondary data
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
2. Lecture 9
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard
* Contents:
1. Why this lecture?
2. Discussion
3. Chapter 9
4. Assignment
5. Bibliography
2
3. Why this lecture?
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard
* This lecture...
· teaches some NLP techniques subject
to be applied to real problems
· presents another example of how DH put
together various disciplines (Linguistics,
Artificial Intelligence, Information
Science, Statistics...) to solve problems
3
4. Last assignment discussion
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard
* Time to...
· consolidate ideas and
concepts dealt in the readings
4
5. Chapter 9
Natural Language Processing
in Python
1. Preliminary theory
2. Word tagging and categorization
3. Text classification
4. Text information extraction
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard5
6. Chapter 9
1 Preliminary theory
1.1 Linguistics
1.2 Statistics
1.3 Artificial Intelligence
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard6
7. Chapter 9
2 Word tagging and categorization
2.1 Tagger
2.2 Automatic tagging
2.3 n-gram tagging
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard7
8. Chapter 9
3 Text classification
3.1 Supervised classification
3.2 Document classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard8
9. Chapter 9
4 Text information extraction
4.1 Information extraction
4.2 Entity recognition
4.3 Relation extraction
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard9
13. Linguistics
* These word classes are also known as
part-of-speech
* They arise from simple analysis of the
distribution of words in text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard13
14. Statistics
* Frequency distribution
· Arrangement of the values that one or
more variables take in a sample
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard14
15. Statistics
* Frequency distribution
· Example: vocabulary in a text
+ how many times each word appears in
the text?
+ it is a “distribution” since it tells us
how the total number of word tokens
in the text are distributed across the
vocabulary items
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard15
17. Statistics
* Conditional frequency distribution
· A collection of frequency distributions,
each one for a different condition
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard17
18. Statistics
* Conditional frequency distribution
· Example: vocabulary in a text
+ when the texts of a corpus are
divided into several categories we can
maintain separate frequency
distributions for each category
+ the condition will often be the
category of the text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard18
19. Statistics
* Conditional frequency distribution
· Example: vocabulary in a text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard19
20. Artificial Intelligence
* Supervised vs unsupervised learning
· Supervised learning:
+ Possible results are known
+ Data is labeled
· Unsupervised learning:
+ Results are unknown
+ Data is clustered
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard20
21. Artificial Intelligence
* Decision trees
· Flowchart that selects labels for input
values
· Formed by decision and leaf nodes
· Decision nodes: check feature values
· Leaf nodes: assign labels
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard21
22. Artificial Intelligence
* Decision trees
· Example: “Going out?”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard22
23. Artificial Intelligence
* Naive Bayes classifiers
1. Begins by calculating the prior
probability of each label, determined by
checking the frequency of each label in
the training set
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard23
24. Artificial Intelligence
* Naive Bayes classifiers
2. The contribution from each feature is
combined with this prior probability, to
arrive at a likelihood estimate for each
label
3. The label whose likelihood estimate is
the highest is then assigned to the input
value
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard24
25. Artificial Intelligence
* Naive Bayes classifiers
· Example: document classification
Prior probability: close “Automotive”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard25
26. References
“Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.
Mitchell, Tom M. “Chapter 6: Bayesian Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.
“Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Steven Bird, Ewan Klein, and Edward Loper. “Conditional Frequency Distributions.” Natural Language Processing with
Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Steven Bird, Ewan Klein, and Edward Loper. “Frequency Distributions.” Natural Language Processing with Python. O’Reilly
Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard26
27. Word tagging and classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard27
28. Tagger
* Processes a sequence of words, and
attaches a part of speech tag to each
word
* Procedure:
1. Tokenization
2. Tagging
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard28
29. Tagger
* Example 1:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard29
In [1]: text = 'And now for something completely different'
In [2]: tokens = nltk.word_tokenize(text)
In [3]: nltk.pos_tag(tokens)
Out[3]:
[('And', 'CC'),
('now', 'RB'),
('for', 'IN'),
('something', 'NN'),
('completely', 'RB'),
('different', 'JJ')]
30. Tagger
* Example 2:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard30
In [1]: text = 'They refuse to permit us to obtain the
refuse permit'
In [2]: tokens = nltk.word_tokenize(text)
In [3]: nltk.pos_tag(tokens)
Out[3]:
[('They', 'PRP'),
('refuse', 'VBP'),
('to', 'TO'),
('permit', 'VB'),
('us', 'PRP'),
('to', 'TO'),
('obtain', 'VB'),
('the', 'DT'),
('refuse', 'NN'),
('permit', 'NN')]
31. Automatic tagging
* The tag of a word depends on the word
itself and its context within a sentence
* Working with data at the level of tagged
sentences rather than tagged words
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard31
32. Automatic tagging
* Loading data
· Example: tagged and non-tagged
sentences of “news” category
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard32
In [1]: from nltk.corpus import brown
In [2]: brown_tagged_sents =
brown.tagged_sents(categories='news')
In [3]: brown_sents = brown.sents(categories='news')
33. Automatic tagging
* Default tagger
· Chose the most likely tag
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard33
In [4]: tags = [tag for (word, tag) in
brown.tagged_words(categories='news')]
In [4]: nltk.FreqDist(tags).max()
Out[4]: 'NN'
34. Automatic tagging
* Default tagger
· Assign the most likely tag to each token
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard34
In [5]: text = 'I do not like green eggs and ham, I do not
like them Sam I am!'
In [6]: tokens = nltk.word_tokenize(text)
In [7]: default_tagger = nltk.DefaultTagger('NN')
35. Automatic tagging
* Default tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard35
In [8]: default_tagger.tag(tokens)
Out[8]:
[('I', 'NN'),
('do', 'NN'),
('not', 'NN'),
('like', 'NN'),
('green', 'NN'),
('eggs', 'NN'),
('and', 'NN'),
('ham', 'NN'),
(',', 'NN'),
37. Automatic tagging
* Default tagger
· This method performs rather poorly
· Unknown words will be nouns (as it
happens, most new words are nouns)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard37
In [9]: default_tagger.evaluate(brown_tagged_sents)
Out[9]: 0.13089484257215028
38. Automatic tagging
* Regular expression tagger
· Assigns tags to tokens on the basis of
matching patterns
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard38
In [10]: patterns = [
...: (r'.*ing$', 'VBG'), # gerounds
...: (r'.*ed$', 'VBD'), # simple past
...: (r'.*es$', 'VBZ'), # 3rd sing present
...: (r'.*ould$', 'MD'), # modals
...: (r'.*'s$', 'NN$'), # possessive nouns
...: (r'.*s$', 'NNS'), # plural nouns
...: (r'^?[09]+(.[09]+)?$', 'CD'), # cardinal numbers
...: (r'.*', 'NN'), # nouns (default)
...: ]
In [11]: regexp_tagger = nltk.RegexpTagger(patterns)
39. Automatic tagging
* Regular expression tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard39
In [12]: regexp_tagger.tag(brown_sents[3])
Out[12]:
[('``', 'NN'),
('Only', 'NN'),
('a', 'NN'),
('relative', 'NN'),
('handful', 'NN'),
('of', 'NN'),
('such', 'NN'),
('reports', 'NNS'),
('was', 'NNS'),
('received', 'VBD'),
...]
40. Automatic tagging
* Regular expression tagger
· This method is correct about a fifth of
the time
· The final regular expression «.*» is a
catch-all that tags everything as a noun
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard40
In [13]: regexp_tagger.evaluate(brown_tagged_sents)
Out[13]: 0.20326391789486245
41. Automatic tagging
* Lookup tagger
· Problem: a lot of high-frequency words
do not have the NN tag
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard41
42. Automatic tagging
* Lookup tagger
· Solution:
+ Find the hundred most frequent words
and store their most likely tag
+ Use this information as model for a
lookup tagger (NLTK UnigramTagger)
+ Tag everything else as a noun
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard42
43. Automatic tagging
* Lookup tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard43
In [14]: fd = nltk.FreqDist(brown.words(categories='news'))
In [15]: cfd = #counts how many times a word belongs to a category
nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
In [16]: most_freq_words = fd.keys()[:100]
In [17]: likely_tags = dict((word, cfd[word].max()) for word in
most_freq_words) #from all categories of a word, take the maximum
In [18]: baseline_tagger = nltk.UnigramTagger(model=likely_tags,
backoff=nltk.DefaultTagger('NN'))
In [19]: baseline_tagger.evaluate(brown_tagged_sents)
Out[19]: 0.5817769556656125
44. Automatic tagging
* Lookup tagger
· The tagger
accuracy
increases as
the model
size grows
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard44
45. n-gram tagging
* Unigram tagger
· As the lookup tagger, assign the most
likely tag to each token
· As opposed to the default tagger, it is
trained for setting it up
· Training: initialize the tagger with a
tagged sentence data as a parameter
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard45
46. n-gram tagging
* Unigram tagger
· Separate the data in:
+ Training data (90%)
+ Testing data (10%)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard46
47. n-gram tagging
* Unigram tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard47
In [20]: size = int(len(brown_tagged_sents) * 0.9)
In [21]: train_sents = brown_tagged_sents[:size]
In [22]: test_sents = brown_tagged_sents[size:]
In [23]: unigram_tagger = nltk.UnigramTagger(train_sents)
48. n-gram tagging
* Unigram tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard48
In [24]: unigram_tagger.tag(brown_sents[2007])
Out[24]:
[('Various', 'JJ'),
('of', 'IN'),
('the', 'AT'),
('apartments', 'NNS'),
('are', 'BER'),
('of', 'IN'),
('the', 'AT'),
('terrace', 'NN'),
('type', 'NN'),
(',', ','),
...
50. n-gram tagging
* An n-gram tagger picks the tag that is
most likely in the given context
* Unigram (1-gram) tagger
· Context:
+ current token in isolation
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard50
51. n-gram tagging
* Bigram (2-gram) tagger
· Context:
+ current token
+ POS tag of the 1 preceding token
* Trigram (3-gram) tagger
· Context:
+ current token
+ POS tag of the 2 preceding tokens
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard51
52. n-gram tagging
* n-gram tagger
· Context:
+ current token
+ POS tag of the n-1 preceding tokens
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard52
53. n-gram tagging
* n-gram tagger
· Example: bigram
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard53
In [22]: bigram_tagger = nltk.BigramTagger(train_sents)
In [23]: bigram_tagger.evaluate(train_sents)
Out[23]: 0.7853094861965731
In [24]: bigram_tagger.evaluate(test_sents)
Out[24]: 0.10216286255357321
54. n-gram tagging
* n-gram tagger
· Example: bigram
+ Problem: it manages to tag words in
sentences of training data but
- it is unable to tag a new word
(assigns None)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard54
55. n-gram tagging
* n-gram tagger
· Example: bigram
+ Problem: it manages to tag words in
sentences of training data but
- it cannot tag the following word
(even if it is not new) because it
never saw it during training with
a None tag on the previous word
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard55
56. n-gram tagging
* n-gram tagger
· Example: bigram
+ Name: sparse data
+ Reason: specific contexts with no
default tagger
+ Solution: trade-off between accuracy
and coverage
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard56
57. n-gram tagging
* Combining taggers
· Trade-off between accuracy and
coverage
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard57
58. n-gram tagging
* Combining taggers
1. Try tagging with the n-gram tagger
2. If unable, try the (n-1)-gram tagger
3. If unable, try the (n-2)-gram tagger
...
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard58
59. n-gram tagging
* Combining taggers
...
n-2. If unable, try the trigram tagger
n-1. If unable, try the bigram tagger
n. If unable, try the unigram tagger
n+1. If unable, use the default tagger
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard59
60. n-gram tagging
* Combining taggers
· Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard60
In [25]: t0 = nltk.DefaultTagger('NN')
In [26]: t1 = nltk.UnigramTagger(train_sents, backoff=t0)
In [27]: t2 = nltk.BigramTagger(train_sents, backoff=t1)
In [28]: t2.evaluate(test_sents)
Out[28]: 0.8447124489185687
61. n-gram tagging
* Exercise 1
· Build a tagger by combining
a trigram, a bigram, a unigram
and a regular expression tagger (in the
default case)
· Use it to tag a sentence
· Evaluate its performance
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard61
62. n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard62
import nltk
import re
from nltk.corpus import brown
64. n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard64
brown_tagged_sents =
brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
65. n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard65
t0 = nltk.RegexpTagger(patterns)
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t1)
66. n-gram tagging
* Exercise 1 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard66
brown_sents = brown.sents(categories='news')
sent = brown_sents[2007]
t3.tag(sent)
t3.evaluate(brown_tagged_sents)
67. References
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 5: Categorizing and Tagging Words.” Natural Language Processing
with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard67
70. Supervised classification
* Process
1. Features
2. Encode
3. Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard70
71. Supervised classification
* The process involves important skills:
· Abstraction
· Modelling
· Programming
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard71
72. Supervised classification
* Features
· Abstraction: decide the relevant
information of the data set
* Encode
· Modelling: choose a sound representation
(data structure)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard72
73. Supervised classification
* Feature extractor
· Programming: program a function that
extracts the features in the chosen
representation
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard73
74. Supervised classification
* Applications:
· Deciding the lexical category of words:
POS tagging
· Deciding the topic of a document from
a list of topics (“sports”, “technology”,
etc.): document classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard74
75. Document classification
* Example 1: gender identification
(solved by Naive Bayesian Classifier)
· Evidence
+ Names ending in a, e, i => female
+ Names ending in k, o, r, s, t => male
· Features: last letter
· Encode: dictionary
· Feature extractor: “name => {last letter}”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard75
76. Document classification
* Example 1: gender identification
· Data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard76
In [1]: from nltk.corpus import names
In [2]: import random
In [3]: all_names =
[(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')]
In [4]: random.shuffle(all_names)
77. Document classification
* Example 1: gender identification
· Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard77
In [5]: def gender_features(word):
return {'last_letter': word[1]}
# Example
In [6]: gender_features('Shrek')
Out[6]: {'last_letter': 'k'}
78. Document classification
* Example 1: gender identification
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard78
In [7]: featuresets =
[(gender_features(n), g) for (n,g) in all_names]
In [8]: train_set = featuresets[500:]
In [9]: test_set = featuresets[:500]
In [10]: classifier =
nltk.NaiveBayesClassifier.train(train_set)
In [11]: nltk.classify.accuracy(classifier, test_set)
Out[11]: 0.778
79. Document classification
* Example 2: POS tagging
(solved by Decision Tree Classifier)
· Results: POS tag
· Features: Suffixes
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard79
80. Document classification
* Example 2: POS tagging
· Data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard80
In [1]: from nltk.corpus import brown
In [2]: suffix_fdist = nltk.FreqDist()
In [3]: for word in brown.words():
word = word.lower()
suffix_fdist.inc(word[1:])
suffix_fdist.inc(word[2:])
suffix_fdist.inc(word[3:])
In [4]: common_suffixes = suffix_fdist.keys()[:100]
81. Document classification
* Example 2: POS tagging
· Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard81
In [5]: def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] =
word.lower().endswith(suffix)
return features
82. Document classification
* Example 2: POS tagging
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard82
In [6]: tagged_words = brown.tagged_words(categories='news')
In [7]: featuresets =
[(pos_features(n), g) for (n,g) in tagged_words]
In [8]: size = int(len(featuresets) * 0.1)
In [9]: train_set, test_set =
featuresets[size:], featuresets[:size]
83. Document classification
* Example 2: POS tagging
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard83
In [10]: classifier =
nltk.DecisionTreeClassifier.train(train_set)
In [11]: classifier.classify(pos_features('cats'))
Out[11]: 'NNS'
In [12]: nltk.classify.accuracy(classifier, test_set)
0.62705121829935351
84. Document classification
* Example 3: document classification
(solved by Naive Bayesian Classifier)
· Corpus: Movie Reviews Corpus
· Results: Positive or negative review
· Features: Indicate whether or not the
2000 most frequent words are present in
each review
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard84
85. Document classification
* Example 3: document classification
· Data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard85
In [1]: from nltk.corpus import movie_reviews
In [2]: import random
In [3]: documents =
[(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
In [4]: random.shuffle(documents)
86. Document classification
* Example 3: document classification
· Feature extractor
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard86
In [5]: all_words = nltk.FreqDist(
w.lower() for w in movie_reviews.words())
In [6]: word_features = all_words.keys()[:2000]
In [7]: def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] =
(word in document_words)
return features
87. Document classification
* Example 3: document classification
· Classification
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard87
In [7]: featuresets =
[(document_features(d), c) for (d,c) in documents]
In [8]: train_set = featuresets[100:]
In [9]: test_set = featuresets[:100]
In [10]: classifier =
nltk.NaiveBayesClassifier.train(train_set)
In [11]: nltk.classify.accuracy(classifier, test_set)
Out[11]: 0.84
88. Document classification
* Example 3: document classification
· 5 most informative features
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard88
In [12]: classifier.show_most_informative_features(5)
Most Informative Features
contains(outstanding) = True pos : neg = 10.7 : 1.0
contains(mulan) = True pos : neg = 9.0 : 1.0
contains(seagal) = True neg : pos = 8.2 : 1.0
contains(wonderfully) = True pos : neg = 6.4 : 1.0
contains(damon) = True pos : neg = 6.4 : 1.0
89. Document classification
* Exercise 2
· “Reuters-21578 benchmark corpus /
ApteMod version” is a collection of 10,788
documents from the Reuters financial
newswire service
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard89
90. Document classification
* Exercise 2
· Train a naive Bayes classifier with
ApteMod corpus
· Use it to classify a document
· Evalutate its performance
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard90
91. Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard91
import nltk
import random
from nltk.corpus import reuters
92. Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard92
documents = [(list(reuters.words(fileid)), category)
for category in reuters.categories()
for fileid in reuters.fileids(category)]
random.shuffle(documents)
93. Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard93
all_words = nltk.FreqDist(w.lower() for w in
reuters.words())
word_features = all_words.keys()[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] =
(word in document_words)
return features
94. Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard94
featuresets = [(document_features(d), c) for (d,c) in
documents]
size = int(len(featuresets) * 0.9)
train_set = featuresets[size:]
test_set = featuresets[:size]
classifier =
nltk.NaiveBayesClassifier.train(train_set)
95. Document classification
* Exercise 2 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard95
document = reuters.words('test/14826')
classifier.classify(document_features(document))
nltk.classify.accuracy(classifier, test_set)
96. References
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text.” Natural Language Processing with
Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard96
98. Information extraction
* Definition:
· Convert unstructured data of natural
language into structured data of table
· Get information from tabulated data
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard98
100. Entity recognition
* Chunking
· Segments and labels multitoken sequences
· Selects a subset of the tokens (chunks)
· Chunks do not overlap in the source text
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard100
101. Entity recognition
* Chunking
· Entities are mostly nouns
· Let us search for the noun phrase chunks
(NP-chunks)
· Grammar: set of rules that indicate how
sentences should be chunked
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard101
102. Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard102
In [1]: import nltk, re, pprint
In [2]: grammar = r"""
# chunk optional determiner/possessive, adjectives and nouns
NP: {<DT|PP$>?<JJ>*<NN>}
# chunk sequences of proper nouns
{<NNP>+}
"""
In [3]: cp = nltk.RegexpParser(grammar)
103. Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard103
In [4]: sentence1 = [("the", "DT"), ("little", "JJ"),
("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at",
"IN"), ("the", "DT"), ("cat", "NN")]
In [5]: sentence2 = [("Rapunzel", "NNP"), ("let", "VBD"),
("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden",
"JJ"), ("hair", "NN")]
In [6]: result1 = cp.parse(sentence)
In [7]: result2 = cp.parse(sentence)
104. Entity recognition
* NP-chunker
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard104
In [8]: print result1
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))
In [9]: print result2
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
106. Entity recognition
* Chunking text corpora
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard106
In [11]: for sent in brown.tagged_sents():
tree = cp.parse(sent)
for subtree in tree.subtrees():
if subtree.node == 'NP':
nps.append(subtree)
In [12]: for np in nps[:10]:
print np
(NP investigation/NN)
(NP widespread/JJ interest/NN)
(NP this/DT city/NN)
(NP new/JJ multimilliondollar/JJ airport/NN)
(NP his/PP$ wife/NN)
(NP His/PP$ political/JJ career/NN)
...
107. Entity recognition
* Named entities
· Are definite noun phrases
· Refer to specific types of individuals:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard107
108. Entity recognition
* Named entity recognition
· Task well suited to classifier-based
approach for noun phrase chunking
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard108
109. Entity recognition
* Named entity recognition
· Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard109
In [1]: sent = nltk.corpus.treebank.tagged_sents()[22]
In [2]: print nltk.ne_chunk(sent)
(S
The/DT
(GPE U.S./NNP)
is/VBZ
one/CD
...
according/VBG
to/TO
(PERSON Brooke/NNP T./NNP Mossman/NNP)
...)
110. Relation extraction
* Extraction of relations that exists between
the named entities recognized
* Approach: initially look for all triples of
the form (X, , Y)α
· X and Y are named entities of specific
types
· is the relationα
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard110
111. Relation extraction
* Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard111
In [1]: import nltk
In [2]: import re
In [3]: IN = re.compile(r'.*binb(?!b.+ing)')
In [4]: for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
corpus='ieer', pattern = IN):
print nltk.sem.relextract.show_raw_rtuple(rel)
112. Relation extraction
* Example:
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard112
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC:
'Washington']
[ORG: 'Idealab'] ', a selfdescribed business incubator based in'
[LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'GeorgiaPacific'] 'in' [LOC: 'Atlanta']
113. Relation extraction
* Exercise 3
· From the corpus ieer, extract
all the relations of type “people were
born in a location”
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard113
114. Relation extraction
* Exercise 3
· Extract all the relations of type
“people were born in a location” from
the corpus ieer
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard114
115. Relation extraction
* Exercise 3 (solution)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard115
import nltk
import os
import re
BORN = re.compile(r'.*bbornb')
files = filter(lambda x: x != 'README',
os.listdir('nltk_data/corpora/ieer'))
for f in files:
for doc in nltk.corpus.ieer.parsed_docs(f):
for rel in nltk.sem.extract_rels('PER', 'LOC', doc,
corpus='ieer', pattern=BORN):
print nltk.sem.relextract.show_raw_rtuple(rel)
116. References
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 7: Extracting Information from Text.” Natural Language Processing
with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard116
117. Assignment
* Assignment 9
· Readings
+ Supervised classification (Natural
Language Processing with Python)
+ Decision Tree Learning (Machine
Learning)
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard117
118. References
Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.
Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text - Supervised Classification.” Natural
Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard118
119. Bibliography
“Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Mitchell, Tom M. Machine Learning. New York: McGraw-Hill, 1997. Print.
“Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009. 504.
shop.oreilly.com. Web. 8 Mar. 2014.
Knowledge Representation in Digital Humanities
Antonio Jiménez Mavillard119