This document summarizes a presentation on different approaches to solving natural language processing problems. It discusses rule-based, counting-based, and deep learning-based approaches. For each approach, it provides examples including English tokenization, error correction, dialogue systems, and spam filtering. The presentation covers key stages of an NLP project including domain analysis, data preparation and collection, iterating on solutions, and gathering feedback. It emphasizes the importance of data and discusses techniques for data acquisition such as scraping, annotation, and crowdsourcing.
"Ning's ""Your Own Social Network"" application is 160,000 lines of PHP that powers hundreds of thousands of social networks, each different than the others. This talk discusses the static and dynamic analysis techniques that we use at Ning to understand and optimize our platform, including the PHP tokenizer, regular expressions, the vld and xdebug extensions, and the PHP DTrace provider.
"
"Ning's ""Your Own Social Network"" application is 160,000 lines of PHP that powers hundreds of thousands of social networks, each different than the others. This talk discusses the static and dynamic analysis techniques that we use at Ning to understand and optimize our platform, including the PHP tokenizer, regular expressions, the vld and xdebug extensions, and the PHP DTrace provider.
"
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Steven Francia
Object Oriented (OO) programming has dominated software engineering for the last two decades. The paradigm built on powerful concepts such as Encapsulation, Inheritance, and Polymoprhism has been internalized by the majority of software engineers. Although Go is not OO in the strict sense, we can continue to leverage the skills we’ve honed as OO engineers to come up with simple and solid designs.
Gopher Steve Francia, Author of
[Hugo](http://hugo.spf13.com), [Cobra](http://github.com/spf13/cobra), and many
other popular Go packages makes these difficult concepts accessible for everyone.
If you’re a OO programmer, especially one with a background with dynamic languages and are curious about go then this talk is for you. We will cover everything you need to know to leverage your existing skills and quickly start coding in go including:
How to use our Object Oriented programming fundamentals in go
Static and pseudo dynamic typing in go
Building fluent interfaces in go
Using go interfaces and duck typing to simplify architecture
Common mistakes made by those coming to go from other OO languages (Ruby, Python, Javascript, etc.),
Principles of good design in go.
** Python Certification Training: https://www.edureka.co/python **
This Edureka PPT on 'Introduction To Python' will help you establish a strong hold on all the fundamentals in the Python programming language. Below are the topics covered in this PPT:
Introduction To Python
Keywords And Identifiers
Variables And Data Types
Operators
Loops In Python
Functions
Classes And Objects
OOPS Concepts
File Handling
YouTube Video: https://youtu.be/uYjRzbP5aZs
Python Tutorial Playlist: https://goo.gl/WsBpKe
Blog Series: http://bit.ly/2sqmP4s
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Augmenting and structuring user queries to support efficient free-form code s...Dongsun Kim
Presented at the 40th International Conference on Software Engineering, May 27 - 3 June 2018, Gothenburg, Sweden (ICSE 2018) as Journal first publication.
A Recovering Java Developer Learns to GoMatt Stine
As presented at OSCON 2014.
The Go programming language has emerged as a favorite tool of DevOps and cloud practitioners alike. In many ways, Go is more famous for what it doesn’t include than what it does, and co-author Rob Pike has said that Go represents a “less is more” approach to language design.
The Cloud Foundry engineering teams have steadily increased their use of Go for building components, starting with the Router, and progressing through Loggregator, the CLI, and more recently the Health Manager. As a “recovering-Java-developer-turned-DevOps-junkie” focused on helping our customers and community succeed with Cloud Foundry, it became very clear to me that I needed to add Go to my knowledge portfolio.
This talk will introduce Go and its distinctives to Java developers looking to add Go to their toolkits. We’ll cover Go vs. Java in terms of:
* type systems
* modularity
* programming idioms
* object-oriented constructs
* concurrency
Another day, another buzzword in the world of software development! ‘Microservices’ is a new approach to structuring server-side software. But is it really new? In this talk I’ll walk you through the birth and ‘raison d’etre’ of microservices and tell about pro’s and con’s of the approach.
Having laid the foundation, we will take a look at best-practices and patterns for building micro service architectures and combine this with a tour of current technologies and development tools.
Finally, I will take a quick look at the future and discuss some of the remaining challenges. All parts of the presentation will be accompanied by structural examples based on a real ecommerse system.
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Steven Francia
Object Oriented (OO) programming has dominated software engineering for the last two decades. The paradigm built on powerful concepts such as Encapsulation, Inheritance, and Polymoprhism has been internalized by the majority of software engineers. Although Go is not OO in the strict sense, we can continue to leverage the skills we’ve honed as OO engineers to come up with simple and solid designs.
Gopher Steve Francia, Author of
[Hugo](http://hugo.spf13.com), [Cobra](http://github.com/spf13/cobra), and many
other popular Go packages makes these difficult concepts accessible for everyone.
If you’re a OO programmer, especially one with a background with dynamic languages and are curious about go then this talk is for you. We will cover everything you need to know to leverage your existing skills and quickly start coding in go including:
How to use our Object Oriented programming fundamentals in go
Static and pseudo dynamic typing in go
Building fluent interfaces in go
Using go interfaces and duck typing to simplify architecture
Common mistakes made by those coming to go from other OO languages (Ruby, Python, Javascript, etc.),
Principles of good design in go.
** Python Certification Training: https://www.edureka.co/python **
This Edureka PPT on 'Introduction To Python' will help you establish a strong hold on all the fundamentals in the Python programming language. Below are the topics covered in this PPT:
Introduction To Python
Keywords And Identifiers
Variables And Data Types
Operators
Loops In Python
Functions
Classes And Objects
OOPS Concepts
File Handling
YouTube Video: https://youtu.be/uYjRzbP5aZs
Python Tutorial Playlist: https://goo.gl/WsBpKe
Blog Series: http://bit.ly/2sqmP4s
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Augmenting and structuring user queries to support efficient free-form code s...Dongsun Kim
Presented at the 40th International Conference on Software Engineering, May 27 - 3 June 2018, Gothenburg, Sweden (ICSE 2018) as Journal first publication.
A Recovering Java Developer Learns to GoMatt Stine
As presented at OSCON 2014.
The Go programming language has emerged as a favorite tool of DevOps and cloud practitioners alike. In many ways, Go is more famous for what it doesn’t include than what it does, and co-author Rob Pike has said that Go represents a “less is more” approach to language design.
The Cloud Foundry engineering teams have steadily increased their use of Go for building components, starting with the Router, and progressing through Loggregator, the CLI, and more recently the Health Manager. As a “recovering-Java-developer-turned-DevOps-junkie” focused on helping our customers and community succeed with Cloud Foundry, it became very clear to me that I needed to add Go to my knowledge portfolio.
This talk will introduce Go and its distinctives to Java developers looking to add Go to their toolkits. We’ll cover Go vs. Java in terms of:
* type systems
* modularity
* programming idioms
* object-oriented constructs
* concurrency
Another day, another buzzword in the world of software development! ‘Microservices’ is a new approach to structuring server-side software. But is it really new? In this talk I’ll walk you through the birth and ‘raison d’etre’ of microservices and tell about pro’s and con’s of the approach.
Having laid the foundation, we will take a look at best-practices and patterns for building micro service architectures and combine this with a tour of current technologies and development tools.
Finally, I will take a quick look at the future and discuss some of the remaining challenges. All parts of the presentation will be accompanied by structural examples based on a real ecommerse system.
Interactive Questions and Answers - London Information Retrieval MeetupSease
Answers to some questions about Natural Language Search, Language Modelling (Google Bert, OpenAI GPT-3), Neural Search and Learning to Rank made during our London Information Retrieval Meetup (December).
Redis is an open source advanced key-value store, created by antirez. Here is a quick overview of this awesome NoSql DB.
Like a swiss knife, Redis will help you by many ways : LRU cache, high scores, UID generator, queues, social feeds, autocomplete …
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
This presentation discusses the most neglected quality axis : code documentation. See good and bad examples, best practices on documenting code and why you should try not to ignore it :)
Presentation date: 2014-11-28
Place: Thessaloniki Java Meetup Group (SKG)
It tells about how dom really used in javascript & html.And it tells about its levels and its w3c standards. And some Dom example programs with source code and screenshots.
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA
Russell Jurney, Founder, Graphlet AI
The knowledge graph and graph database markets have long asked themselves: why aren't we larger? The vision of the semantic web was that many datasets could be cross-referenced between independent graph databases to map all knowledge on the web from myriad disparate datasets into one or more authoritative ontologies which could be accessed by writing SPARQL queries to work across knowledge graphs. The reality of dirty data made this vision impossible. Most time is spent cleaning data which isn't in the format you need to solve your business problems. Multiple datasets in different formats each have quirks. Deduplicate data using entity resolution is an unsolved problem for large graphs. Once you merge duplicate nodes and edges, you rarely have the edge types you need to make a problem easy to solve. It turns out the most likely type of edge in a knowledge graph that solves your problem easily is defined by the output of a Python program using the machine learning. For large graphs, this program needs to run on a horizontally scalable platform PySpark and extend rather than be isolated inside a graph databases. The quality of developer's experience is critical. In this talk I will review an approach to an Open Source Large Knowledge Graph Factory built on top of Spark that follows the ingest / build / refine / public / query model that open source big data is based upon.
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
This is my solution for the Cloudera Data Science Challenge 3. I use Spark MLLib for problem1, and Spark GraphX for problem3. Problem2 is "simple" streaming map-reduce.
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
4. NLP Project Stages
1) Domain analysis
2) Data preparation
3) Iterating the solution
4) Productizing the result
5) Gathering feedback
and returning to stages 1/2/3
5. 1. Domain Analysis
1) Problem formulation
2) Existing solutions
3) Possible datasets
4) Identifying Challenges
5) Metrics, baseline & SOTA
6) Selecting possible
approaches and devising
a plan of attack
16. Test Data
* Twitter evaluation dataset
* Datasets with fewer langs
* Extract from Wikipedia
+ smoke test
17. Challenges
* Linguistic challenges
- know next to nothing about
90% of the languages
- languages and scripts
http://www.omniglot.com/writing/langalph.htm
- language distributions
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites
- word segmentation?
* Engineering challenges
19. NLP Evaluation
Intrinsic: use a gold-
standard dataset and some
metric to measure the system’s
performance directly.
(in-domain & out-of-domain)
Extrinsic: use a system as
part of an upstream task(s)
and measure performance change
there.
27. Baseline
Quality that can be achieved
using some reasonable “basic”
approach
(on the same data).
2 ways to measure improvement:
* quality improvement
* error reduction
32. The Unreasonable
Effectiveness of Data
“Data is ten times more powerful than algorithms.”
-- Peter Norvig, http://youtu.be/yvDCzhbjYWs
https://twitter.com/shivon/status/864889085697024000
33. Uses of Data in NLP
* understanding the problem
* statistical analysis
* connecting separate domains
* evaluation data set
* training data set
* real-time feedback
* marketing/PR
* external competitions
* what else?
43. Raw Text Pros & Cons
+ Can collect stats
=> build LMs, word vectors…
+ Can have a variety of domains
- … but hard to control the
distribution of domains
- Web artifacts
- Web noise
- Huge processing effort
44. Creating Own Data
* Scraping
* Annotation
* Crowdsourcing
* Getting from users
* Generating
45. Data Scraping
* Full-page web-scraping
(see: readability)
* Custom web-scraping
(see: scrapy, crawlik)
* Extracting from non-HTML
formats (.pdf, .doc…)
* Getting from API
(see: twitter, webhose)
46. Scraping Pros & Cons
+ possible to get needed domain
+ may be the only “fast” way
+ it’s the “google” way
- licensing issues
- getting volume requires time
- need to deal with antirobot
protection
- other engineering challenges
47. Corpus Annotation
* Initial data gathering
* Annotation guidelines
* Annotator selection
* Annotation tools
* Processing of raw annotations
* Result analysis & feedback
49. Crowdsourcing
- Like annotation but harder
requires a custom solution
With gamification
+ If successful, can get
volumes and diversity
50. User Feedback Data
+ Your product, your domain
+ Real-time, allows adaptation
+ Ties into customer support
- Chicken & egg problem
- Requires anonymization
Approach: “lean startup”
51. Generating Data
+ Potentially unlimited volume
+ Control of the parameters
- Artificial (if you can code
a generation function,
why not code a reverse one?)
52. Data Best Practices
* Proper ML dataset handling
* Domain adequacy, diversity
* Inter-annotator agreement,
reasonable baselines
* Error analysis
* Real-time tracking
54. 3. Possible Approaches
1) Symbolical/rule-based
“The (Weak) Empire of Reason”
2) Counting-based
“The Empiricist Invasion”
3) Deep Learning-based
“The Revenge of the Spherical Cows”
http://www.earningmyturns.org/2017/06/
a-computational-linguistic-farce-in.ht
ml
56. English Tokenization
(small problem)
* Simplest regex:
[^s]+
* More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—―
* Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Add post-processing:
* concatenate abbreviations and decimals
* split contractions with regexes
- 2-character abbreviations regex:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$
- 3-character abbreviations regex:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni
ze_uk.py
57. Error Correction
(inherently rule-based)
LanguageTool:<category name=" " id="STYLE" type="style">Стиль
<rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками
<rule>
<pattern>
<token> </token>більш
<token postag_regexp="yes" postag="ad[jv]:.*compc.*">
<exception> </exception>менш
</token>
</pattern>
<message> « » </message>Після більш не може стояти вища форма прикметника
<!-- TODO: when we can bind comparative forms togher again
<suggestion><match no="2"/></suggestion>
<suggestion><match no="1"/> <match no="2" postag_regexp="yes"
postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion>
<example correction=" | "><marker>Світліший Більш світлий Більш
</marker>.</example>світліший
-->
<example correction=""><marker> </marker>.</example>Більш світліший
<example> </example>все закінчилось більш менш
</rule>
...
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu
les/uk/src/main/resources/org/languagetool/rules/uk/
58. Dialog Systems
(inherently scripted)
ELIZA:
(defparameter *eliza-rules*
'((((?* ?x) hello (?* ?y))
(How do you do. Please state your problem.))
(((?* ?x) I want (?* ?y))
(What would it mean if you got ?y)
(Why do you want ?y) (Suppose you got ?y soon))
(((?* ?x) if (?* ?y))
(Do you really think its likely that ?y) (Do you wish that ?y)
(What do you think about ?y) (Really-- if ?y))
(((?* ?x) no (?* ?y))
(Why not?) (You are being a bit negative)
(Are you saying "NO" just to be negative?))
(((?* ?x) I was (?* ?y))
(Were you really?) (Perhaps I already knew you were ?y)
(Why do you tell me you were ?y now?))
(((?* ?x) I feel (?* ?y))
(Do you often feel ?y ?))
(((?* ?x) I felt (?* ?y))
(What other feelings do you have?))))
https://norvig.com/paip/eliza1.lisp
59.
60. Fighting Spam
(just a bad idea :)
SpamAssasin:
# bodyn
# example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam,
zwykle pisze tu - http://hanna.3xa.info
uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/
header __LOCAL_FROM_O2 From =~ /@o2.pl/
meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2
describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl
score LOCAL_LINK_INFO 5
https://wiki.apache.org/spamassassin/CustomRulesets
66. Rules Pros & Cons
+ compact & fast
+ full control
+ arbitrary recall
+ iterative
+ best interpetability
+ perfectly accommodate domain experts
- precision ceiling
- non-optimal weights
- require a lot of human labor
- hard to interpret/calibrate score
71. BiLSTM+attention as SOTA
“Basically, if you want to do an
NLP task, no matter what it is,
what you should do is throw your
data into a Bi-directional
long-short term memory network,
and augment its information flow
with the attention mechanism.”
– Chris Manning
https://twitter.com/mayurbhangale
/status/988332845708886016
73. Bias
* Domain bias - systematic statistical
differences between “domains”, usually
in the form of vocab, class priors and
conditional probabilities
* Dataset bias - sampling bias in a
given dataset, giving rise to (often
unintentional) content bias
* Model bias - bias in the output of a
model wrt gold standard
* Social bias - model bias
towards/against particular socio-
demographic groups of users
74. Counteracting Bias
* Rule-based post-processing
* Mix multiple datasets
+ special feature-selection
* Train multiple models on
different datasets and
create an ensemble
* Use domain adaptation
techniques
83. Adaptation
(manual vs automatic)
* for RB systems: add more
rules :D
* for ML systems: online
learning algorithms
* for DL systems: recent
“universal” models
* Lambda architecture
84. The Pipeline
“Getting a data pipeline into production for the
first time entails dealing with many different
moving parts in these cases, spend more time—
thinking about the pipeline itself, and how it
will enable experimentation, and less about its
first algorithm.”
https://medium.com/@neal_lathia/five-lessons-from
-building-machine-learning-systems-d703162846ad