LT-Accelerate 2016: Between Custom and Off-the-shelf NLP

•

1 like•325 views

This presentation explores the changing landscape of NLP software solutions and their application to the popular task of sentiment classification.

Technology

Between Custom and
Off-the-shelf NLP
Yves Peirsman

NLP LANDSCAPE
Cloud APIs:
train your own
models
+
Libraries:
pre-trained
models
Cloud APIs:
pre-trained
models
Libraries:
train your own
models
2

SENTIMENT ANALYSIS
“The process of computationally
identifying and categorizing opinions
expressed in a piece of text, especially
in order to determine whether the
writer's attitude towards a particular
topic, product, etc., is positive,
negative, or neutral.”
3

SENTIMENT ANALYSIS: APPLICATIONS
http://tarsier.monkeylearn.com
4

SENTIMENT ANALYSIS: APPLICATIONS
http://varianceexplained.org/r/trump-tweets/
5

SENTIMENT ANALYSIS: APPLICATIONS
http://www.stockfluence.com/
6

OVERVIEW
Off-the-shelf NLP DIY NLP Conclusions
7

NLP LANDSCAPE
Cloud APIs:
train your own
models
+
Libraries:
pre-trained
models
Cloud APIs:
pre-trained
models
Libraries:
train your own
models
8

DATA SETS
Domain Source Categories Baseline
Movie reviews rottentomatoes.com positive, negative 50.0%
Baby products Amazon
positive (****, *****),
negative (*, **)
81.5%
Android apps Amazon
positive (****, *****),
negative (*, **)
72.7%
Android apps Amazon, Wikipedia
positive (****, *****),
negative (*, **), neutral
51.7%
Hotels, restaurants Yelp
positive (****, *****),
negative (*, **)
70.8%
9

DATA SETS: EXAMPLES
Positive
If you're looking for
something scary,
this is the first great
horror film of the
spooky season.
Neutral
Avernum is a series of
demoware role-playing
video games by Jeff Vogel
of Spiderweb Software
available for Macintosh and
Windows-based computers.
Several are available for
iPad and Android tablet.
Negative
It's Starbucks only
with bad customer
service. Baristas with
attitude that don't
know their own
product. If I'm paying
$7.00 for a coffee at
least drop the 'tude.
10

OFF-THE-SHELF NLP
Is it OK to be lazy?
11
1.

OFF-THE-SHELF MODELS: MOVIE REVIEWS 13
Indico 76.8%
IBM AlchemyAPI 73.4%
Stanford CoreNLP 71.9%

OFF-THE-SHELF MODELS: BABY PRODUCTS 14
Indico 92.6%
MonkeyLearn
(Product)
87.5%
TextBlob Pattern 82.5%

OFF-THE-SHELF MODELS: YELP REVIEWS 15
Indico 92.9%
Google 91.0%
IBM AlchemyAPI 90.4%

OFF-THE-SHELF MODELS: ANDROID APPS (2-WAY) 16
Indico 90.6%
Google 90.5%
MonkeyLearn
(Product)
87.1%

OFF-THE-SHELF MODELS: ANDROID APPS (3-WAY) 17
Indico 80.0%
HavenOnDemand 79.1%
Google 77.8%

OFF-THE-SHELF MODELS: CONCLUSIONS
There is enormous
variation between and
within off-the-shelf
solutions.
High quality is
possible, but not
guaranteed.
Comparing available
solutions on your
data is crucial.
18

DIY MODELS
Are you better off building your own custom models?
19
2.

DIY MODELS: PROCESS 20
Data Library Model

DIY MODELS: BABY PRODUCTS 22
DIY SVM 93.4%
Best off-the-shelf 92.6%
DIY Naive Bayes 86.3%

DIY MODELS: ANDROID APPS 23
DIY SVM 93.1%
Best off-the-shelf 90.6%
DIY Naive Bayes 90.3%

DIY MODELS: YELP REVIEWS 24
DIY SVM 95.6%
Best off-the-shelf 92.9%
DIY Naive Bayes 89.2%

DIY MODELS: CONCLUSIONS
You need sufficient
relevant data to build
a good model.
DIY models built with
sufficient data will
typically outperform
off-the-shelf
solutions.
DIY models may or
may not be worth the
effort.
27

CONCLUSIONS
Off-the-shelf
no data
little effort
good quality possible, but
not guaranteed
no control
DIY
lots of data
more effort
superior quality
full control
29

BUT SOME ARE
30
WRONG
ALL MODELS ARE
USEFUL
- George Box

Any questions?
You can find me at:
» @yvespeirsman
» yves@nlp.town
32THANKS!

The document discusses the traits of highly skilled "10x" programmers. It notes that programming is a creative profession requiring logic-based creativity. It argues that strong programmers derive strength from principles, values, behaviors and practices that are not always visible, like an iceberg with most of its mass underwater. These include qualities like purpose, autonomy, mastery, communication, simplicity, flexibility, testing principles, learning, knowledge sharing and enjoying the work. Clean code results from craftsmanship rather than following rules, and programmers should focus on producing value rather than just code.

Software Development in the Brave New world

David Leip

The document discusses the agile software development methodology of Extreme Programming (XP). It provides an overview of XP, including its values, practices, and roles. It notes that XP focuses on communication, simplicity, feedback, and courage. Key practices include pair programming, user stories, planning iterations based on velocity, and daily stand-up meetings. The document also covers challenges and lessons learned with adopting XP.

introduction to software enginering

prasanna chitra

Software engineering is an engineering discipline that applies scientific principles and methods to the development of software. It aims to deliver reliable and efficient software on time and within budget by defining processes and procedures. The need for software engineering has arisen due to factors like increasing changes in user requirements and environments, large and complex software projects, ensuring scalability and cost-effectiveness, and managing quality.

Startups & the Product Management Perspective

Amarpreet Kalkat

Engineers tend to start most of the technology startups. While this gives them an inherent advantage as far as engineering the product goes, it also tends to put them at a disadvantage when it comes to designing (non-technically) and commercializing the product. This slide deck takes up the key concepts from PdM that apply to startup-mode products. This is not a case for having Product Managers onboard, 80% of the startups don’t need a dedicated PM. Towards the end, it introduces the funky concept of Product Entropy.

Pair programming demystified

Marek Kirejczyk

Introduction to software Engineering

Mohamed Gaafar

Software is ubiquitous in modern society and can have huge impacts, both positive and negative. However, simply programming a software is not enough - software engineering principles must be followed to develop reliable, high-quality software that meets customer needs. Some common software development issues include not fulfilling customer requirements, being difficult to improve or extend, and lacking documentation. Following a systematic process involving requirements analysis, design, implementation, testing, and maintenance can help address these issues and produce software delivered on time and budget that works as intended.

Building a New Product vs. Iterating on the Old

Product School

An important skill for a Product Manager is the ability to parachute into an existing product and land on your feet. In this workshop Paul Yokota, the Director of Product for Animoto, talked about how to immerse yourself in your new company and learn as much as you can as quickly as you can. The important thing is understanding your product's problem domain, knowing how to collect good qualitative data, learning how to set expectations with an new team and learning how to become the product diplomat.

Lean Software Development by DeKnowledge.net ----------------------------------------------------------------------------- DeKnowledge is the leading provider of project management certifications training workshops and consultancy. In addition to our open enrollment certifications training workshops, we also offer a wide range of management, leadership and technical based courses that can be tailored to fit your organization's needs. With offices in the USA, The Netherlands and India, we work with clients in USA, Europe, South Africa and Asia. Our mission is to help companies manage their projects/programs more effortlessly and efficiently. We do this by collaborating with our clients in the areas of portfolio/program and project management training workshops and consultancy.

Ruby codebases in an entropic universe

Niranjan Paranjape

The document discusses how the entropy of Ruby codebases increases over time if changes are not limited, making future changes more difficult. It advocates for writing specs to establish confidence in code and observing trends in metrics like code coverage, complexity, and churn to catch signs of rising entropy early. Sticking to conventions but knowing when to deviate, and focusing on principles over mechanics can help limit a codebase's entropy.

Django in the Real World

Jacob Kaplan-Moss

Infrastructure is development

stahnma

Fixing the program my computer learned: End-user debugging of machine-learned...

City University London

This document summarizes Dr. Simone Stumpf's research into enabling end users to debug machine-learned programs. It discusses how machine-learned programs work and the challenges end users face in debugging programs they can't see the source code of. It describes formative studies exploring different explanation approaches and the types of feedback users provide. It also covers integrating user feedback to change the machine's reasoning, identifying unpredictable user-provided features, and directions for future work.

PHP, AWS, and Sleep - Hampton Roads DevFest 2016

Guillermo A. Fisher

The document discusses implementing continuous delivery for PHP applications deployed to AWS. It covers topics like using the latest stable version of PHP, solid object-oriented design principles, automated testing tools for PHP, build automation with Jenkins and Phing, application monitoring, and infrastructure automation with AWS services like EC2, RDS, and Elastic Beanstalk. Continuous delivery is presented as a solution to dysfunctional code deployments and lack of sleep by establishing automated, reliable deployment processes.

Recommender Systems at Scale

Eoin Hurrell, PhD

The document discusses recommender systems and scaling machine learning models using Apache Spark. It introduces recommender systems and collaborative filtering using matrix factorization. It then explains how to implement alternating least squares in Spark to scale recommender systems. The document provides code examples in Python using Spark and the MovieLens dataset to demonstrate an alternating least squares model for movie recommendations.

API Athens Meetup - API standards 25-6-2014

openi_ict

The document discusses different API description formats including API Blueprint, RAML, and Swagger. It provides an overview of each format's key features such as how they model REST, available tooling, community size, and licensing. The conclusion is that while Swagger has the largest adoption, both RAML and API Blueprint offer some advanced features but lack tooling and adoption. The best choice depends on needs and technologies. Examples of other formats like WADL, Discovery Docs, and Hydra are also briefly mentioned.

API Athens Meetup - API standards 25-6-2014

Michael Petychakis

The document discusses different API description formats including API Blueprint, RAML, and Swagger. It provides an overview of each format's key features such as how they model REST, available tooling, community size, and licensing. The conclusion is that while Swagger has the largest adoption, both RAML and API Blueprint offer some advanced features but lack tooling and community. The best choice depends on needs and technologies. Examples of other formats like WADL, Discovery Docs, and Hydra are also briefly mentioned.

Mining apps for anomalies

Ahmed Kamel Taha

The document discusses mining apps to detect abnormal behavior. It describes how app mining leverages common patterns across thousands of apps to learn what normal behavior is and identify anomalies. The document introduces CHABADA, a tool that detects mismatches between an app's behavior and description by analyzing APIs and clustering apps by topic. While app stores provide a treasure trove of data, obstacles include limited access to apps, metadata, and developer information.

Quality of Bug Reports in Open Source

Thomas Zimmermann

The document discusses research into what makes a good bug report based on a survey of over 150 developers. It finds that the most helpful items for fixing bugs are steps to reproduce, stack traces, and observed behavior. The biggest problems causing delays are incomplete information, wrong steps to reproduce, and wrong expected behavior. The research also measured bug report quality and found stack traces and readability correlated with shorter fix times.

Atmosphere Conference 2015: The 10 Myths of DevOps

PROIDEA

Speaker: Seth Vargo Language: English Although not officially coined until 2009, DevOps ideals have been explicitly discussed since at least 2006. Recently, however, the term "DevOps" has gained increasing popularity across a variety of fields and industries. DevOps is not a development methodology or technology; DevOps is an ideology. It is a way to facilitate organizational prosperity and growth while increasing each individual employee's happiness along the way. As DevOps has gained in prominence, a gap has been created between the original definition of DevOps and this new "enterprise-ready" buzzword. For organizations beginning DevOps practices, this talk will provide a 10,000ft view of DevOps and how you can properly implement DevOps practices in your organization. For organizations that are currently practicing DevOps, this talk will cover common pitfalls, ways to sustain a happy culture, and new tips to foster organizational prosperity. Visit our website: http://atmosphere-conference.com/

APIdays Paris 2019 Backend is the new frontend by Antoine Cheron

apidays

Bob and Alice are building a new app together, with Bob working on the backend API and Alice on the frontend. As they work, the API requirements change frequently, requiring changes to parameters, operations, and data models. This can lead to errors when the frontend code is not updated to match. The presentation proposes a solution using OpenAPI specifications enriched with semantics and hypermedia controls to advertise available operations and parameters at runtime. This would allow the frontend code to be written in a more business-focused way and automatically integrate API changes without breaking. Limitations include the difficulty of deeply linking operations to data and the work required to implement and maintain the vocabulary.

Machine Learning Model for Gender Detection

TecnoIncentive

Defend against adversarial AI using Adversarial Robustness Toolbox

Animesh Singh

With great power comes great responsibility. Adversarial examples in AI pose an asymmetrical challenge with respect to attackers and defenders. AI developers must be empowered to defend deep neural networks against adversarial attacks and allow rapid crafting and analysis of attack and defense methods for machine learning models. Animesh Singh and Tommy Li explain how to implement state-of-the-art methods for attacking and defending classifiers using the open source Adversarial Robustness Toolbox. The library provides AI developers with interfaces that support the composition of comprehensive defense systems using individual methods as building blocks. Animesh and Tommy then demonstrate how to use a Jupyter notebook to leverage attack methods from the Adversarial Robustness Toolbox (ART) into a model training pipeline. This notebook trains a CNN model on the Fashion MNIST dataset, and the generated adversarial samples are used to evaluate the robustness of the trained model.

Leveraging Open Source Automated Data Science Tools

Domino Data Lab

The data science process seeks to transform and empower organizations by finding and exploiting market inefficiencies and potentially hidden opportunities, but this is often an expensive, tedious process. However, many steps can be automated to provide a streamlined experience for data scientists. Eduardo Arino de la Rubia explores the tools being created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation and impact validation. The promise of the automated statistician is almost as old as statistics itself. From the creations of vast tables, which saved the labor of calculation, to modern tools which automatically mine datasets for correlations, there has been a considerable amount of advancement in this field. Eduardo compares and contrasts a number of open source tools, including TPOT and auto-sklearn for automated model generation and scikit-feature for feature generation and other aspects of the data science workflow, evaluates their results, and discusses their place in the modern data science workflow. Along the way, Eduardo outlines the pitfalls of automated data science and applications of the “no free lunch” theorem and dives into alternate approaches, such as end-to-end deep learning, which seek to leverage massive-scale computing and architectures to handle automatic generation of features and advanced models.

Cinci ug-january2011-anti-patterns

Steven Smith

The document discusses anti-patterns and worst practices in software development. Some examples covered include static cling pattern, flags over objects, premature optimization, copy-paste-compile, and reinventing the wheel. It also shares lessons learned from experiences, such as being mindful of date times across time zones, avoiding building SQL from untrusted inputs, and not being too cute with test data. Overall, the document aims to help developers learn from the mistakes of others and adopt better practices.

Web And App Design

Outreach Digital

Web & Mobile App Design for Non-Coders with Bubble.is

James Eckhardt

Walter api

Nicholas Schiller

Nicholas Schiller presented on using APIs to customize library services. He demonstrated how to build a web application using the WorldCat Search API that automatically adds Boolean search terms to a user's query and formats the results. The application was built with PHP for server-side scripting, HTML5 for interface design, and jQuery Mobile to optimize for different devices. The presentation provided examples of APIs, guidelines for API projects, and resources for further learning about APIs and programming.

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/ DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen! Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell. Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten. Diese Themen werden behandelt - Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten - Wie funktionieren CCB- und CCX-Lizenzen wirklich? - Verstehen des DLAU-Tools und wie man es am besten nutzt - Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw. - Praxisbeispiele und Best Practices zum sofortigen Umsetzen

Similar to LT-Accelerate 2016: Between Custom and Off-the-shelf NLP

DeKnowledge - Try us

Bob Pinto

Ruby codebases in an entropic universe

Niranjan Paranjape

Django in the Real World

Jacob Kaplan-Moss

Infrastructure is development

stahnma

Fixing the program my computer learned: End-user debugging of machine-learned...

City University London

PHP, AWS, and Sleep - Hampton Roads DevFest 2016

Guillermo A. Fisher

Recommender Systems at Scale

Eoin Hurrell, PhD

API Athens Meetup - API standards 25-6-2014

openi_ict

API Athens Meetup - API standards 25-6-2014

Michael Petychakis

The document discusses different API description formats including API Blueprint, RAML, and Swagger. It provides an overview of each format's key features such as how they model REST, available tooling, community size, and licensing. The conclusion is that while Swagger has the largest adoption, both RAML and API Blueprint offer some advanced features but lack tooling and community. The best choice depends on needs and technologies. Examples of other formats like WADL, Discovery Docs, and Hydra are also briefly mentioned.

Mining apps for anomalies

Ahmed Kamel Taha

Quality of Bug Reports in Open Source

Thomas Zimmermann

Atmosphere Conference 2015: The 10 Myths of DevOps

PROIDEA

APIdays Paris 2019 Backend is the new frontend by Antoine Cheron

apidays

Machine Learning Model for Gender Detection

TecnoIncentive

Defend against adversarial AI using Adversarial Robustness Toolbox

Animesh Singh

Leveraging Open Source Automated Data Science Tools

Domino Data Lab

Cinci ug-january2011-anti-patterns

Steven Smith

Web And App Design

Outreach Digital

Web & Mobile App Design for Non-Coders with Bubble.is

James Eckhardt

Walter api

Nicholas Schiller

Similar to LT-Accelerate 2016: Between Custom and Off-the-shelf NLP (20)

DeKnowledge - Try us

Ruby codebases in an entropic universe

Django in the Real World

Infrastructure is development

Fixing the program my computer learned: End-user debugging of machine-learned...

PHP, AWS, and Sleep - Hampton Roads DevFest 2016

Recommender Systems at Scale

API Athens Meetup - API standards 25-6-2014

Mining apps for anomalies

Quality of Bug Reports in Open Source

Atmosphere Conference 2015: The 10 Myths of DevOps

APIdays Paris 2019 Backend is the new frontend by Antoine Cheron

Machine Learning Model for Gender Detection

Defend against adversarial AI using Adversarial Robustness Toolbox

Leveraging Open Source Automated Data Science Tools

Cinci ug-january2011-anti-patterns

Web And App Design

Web & Mobile App Design for Non-Coders with Bubble.is

Walter api

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

panagenda

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Speck&Tech

ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune. Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile. BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

“I’m still / I’m still / Chaining from the Block”

Claudio Di Ciccio

Building Production Ready Search Pipelines with Spark and Milvus

Zilliz

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

Best 20 SEO Techniques To Improve Website Visibility In SERP

Pixlogix Infotech

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

UiPath Test Automation using UiPath Test Suite series, part 6

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Programming Foundation Models with DSPy - Meetup Slides

HCL Notes and Domino License Cost Reduction in the World of DLAU

Communications Mining Series - Zero to Hero - Session 1

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Removing Uninteresting Bytes in Software Fuzzing

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

“I’m still / I’m still / Chaining from the Block”

Building Production Ready Search Pipelines with Spark and Milvus

Pushing the limits of ePRTC: 100ns holdover for 100 days

National Security Agency - NSA mobile device best practices

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Video Streaming: Then, Now, and in the Future

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Best 20 SEO Techniques To Improve Website Visibility In SERP

LT-Accelerate 2016: Between Custom and Off-the-shelf NLP

1. Between Custom and Off-the-shelf NLP Yves Peirsman

2. NLP LANDSCAPE Cloud APIs: train your own models + Libraries: pre-trained models Cloud APIs: pre-trained models Libraries: train your own models 2

3. SENTIMENT ANALYSIS “The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.” 3

4. SENTIMENT ANALYSIS: APPLICATIONS http://tarsier.monkeylearn.com 4

5. SENTIMENT ANALYSIS: APPLICATIONS http://varianceexplained.org/r/trump-tweets/ 5

6. SENTIMENT ANALYSIS: APPLICATIONS http://www.stockfluence.com/ 6

7. OVERVIEW Off-the-shelf NLP DIY NLP Conclusions 7

8. NLP LANDSCAPE Cloud APIs: train your own models + Libraries: pre-trained models Cloud APIs: pre-trained models Libraries: train your own models 8

9. DATA SETS Domain Source Categories Baseline Movie reviews rottentomatoes.com positive, negative 50.0% Baby products Amazon positive (****, *****), negative (*, **) 81.5% Android apps Amazon positive (****, *****), negative (*, **) 72.7% Android apps Amazon, Wikipedia positive (****, *****), negative (*, **), neutral 51.7% Hotels, restaurants Yelp positive (****, *****), negative (*, **) 70.8% 9

10. DATA SETS: EXAMPLES Positive If you're looking for something scary, this is the first great horror film of the spooky season. Neutral Avernum is a series of demoware role-playing video games by Jeff Vogel of Spiderweb Software available for Macintosh and Windows-based computers. Several are available for iPad and Android tablet. Negative It's Starbucks only with bad customer service. Baristas with attitude that don't know their own product. If I'm paying $7.00 for a coffee at least drop the 'tude. 10

11. OFF-THE-SHELF NLP Is it OK to be lazy? 11 1.

12. Variables OFF-THE-SHELF MODELS Data 12

13. OFF-THE-SHELF MODELS: MOVIE REVIEWS 13 Indico 76.8% IBM AlchemyAPI 73.4% Stanford CoreNLP 71.9%

14. OFF-THE-SHELF MODELS: BABY PRODUCTS 14 Indico 92.6% MonkeyLearn (Product) 87.5% TextBlob Pattern 82.5%

15. OFF-THE-SHELF MODELS: YELP REVIEWS 15 Indico 92.9% Google 91.0% IBM AlchemyAPI 90.4%

16. OFF-THE-SHELF MODELS: ANDROID APPS (2-WAY) 16 Indico 90.6% Google 90.5% MonkeyLearn (Product) 87.1%

17. OFF-THE-SHELF MODELS: ANDROID APPS (3-WAY) 17 Indico 80.0% HavenOnDemand 79.1% Google 77.8%

18. OFF-THE-SHELF MODELS: CONCLUSIONS There is enormous variation between and within off-the-shelf solutions. High quality is possible, but not guaranteed. Comparing available solutions on your data is crucial. 18

19. DIY MODELS Are you better off building your own custom models? 19 2.

20. DIY MODELS: PROCESS 20 Data Library Model

21. DIY MODELS: PROCESS 21

22. DIY MODELS: BABY PRODUCTS 22 DIY SVM 93.4% Best off-the-shelf 92.6% DIY Naive Bayes 86.3%

23. DIY MODELS: ANDROID APPS 23 DIY SVM 93.1% Best off-the-shelf 90.6% DIY Naive Bayes 90.3%

24. DIY MODELS: YELP REVIEWS 24 DIY SVM 95.6% Best off-the-shelf 92.9% DIY Naive Bayes 89.2%

25. DIY MODELS: BABY PRODUCTS 25

26. DIY MODELS: ANDROID APPS 26

27. DIY MODELS: CONCLUSIONS You need sufficient relevant data to build a good model. DIY models built with sufficient data will typically outperform off-the-shelf solutions. DIY models may or may not be worth the effort. 27

28. Conclusions 28 3.

29. CONCLUSIONS Off-the-shelf no data little effort good quality possible, but not guaranteed no control DIY lots of data more effort superior quality full control 29

30. BUT SOME ARE 30 WRONG ALL MODELS ARE USEFUL - George Box

31. 31

32. Any questions? You can find me at: » @yvespeirsman » yves@nlp.town 32THANKS!

LT-Accelerate 2016: Between Custom and Off-the-shelf NLP

Recommended

Recommended

More Related Content

Similar to LT-Accelerate 2016: Between Custom and Off-the-shelf NLP

Similar to LT-Accelerate 2016: Between Custom and Off-the-shelf NLP (20)

Recently uploaded

Recently uploaded (20)

LT-Accelerate 2016: Between Custom and Off-the-shelf NLP