SiLCC is a cloud based service for parsing text and extracting relevant keywords. To use it, you must first apply for an API key. Input the API key into your application and then push content to our server. As we receive your content, we parse it, extract relevant 'tags', then send it back to your app. From there user interaction with those tags (editing or removal) helps to improve our algorithms.
SiLLC also features robust glossaries for Twitter pico-formats and SMS txtSpeak. It specializes in the semantic tagging of content that's 280 characters and less.
Iterator - a powerful but underappreciated design patternNitin Bhide
Iterator design pattern is described in GoF ‘Design Patterns’ book. It is used at many places (e.g. Sql Cursor is a ‘iterator’), C++ standard template library uses iterators heavily. .Net Linq interfaces are based IEnumerable (i.e. iterator). However, I don’t see projects creating/using ‘custom’ iterator classes. Many problems can be solved ‘elegantly’ by use of customized iterators. This talk is about ‘power of iterators’ and how custom iterators can solve common problems and help create modular/reusable code components.
Key Discussion Points
Typical examples of iterators in common use.
Kind of problems that can be ‘elegantly’ solved with iterators
When to use custom iterators?
How write custom iterators in C++/C#
From webinar I did on TechGig
http://www.techgig.com/expert-speak/Iterator-a-powerful-but-underappreciated-pattern-449
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks
The more time you spend developing within a framework such as Apache Spark, you learn there are additional features that would be helpful to have given the context and details of your specific use case. Spark supports a very concise and readable coding style using functional programming paradigms. Wouldn’t it be awesome to add your own functions into the mix using the same style? Well you can!
In this session, you will learn about using Scala’s “Enrich my library” programming pattern to add new functionality to Spark’s APIs. We will dive into a how-to guide with code snippets and present an example where this strategy was used to develop a validation framework for Spark Datasets in a production pipeline. Come learn how to enrich your Spark!
presentation on online movie ticket bookingdharmawath
The goals of our system are:
To provide a anytime anyplace service for the customer
To minimize the number of staff at the ticket box
To promote the film on the internet
To increase the profit
APIs at Scale with TypeSpec by Mandy Whaley, MicrosoftNordic APIs
A presentation given by Mandy Whaley, Partner Director of Product, Azure Developer Tools at Microsoft, at our 2024 Austin API Summit, March 12-13.
Session Description:
TypeSpec is a new API description language developed and used by Microsoft to deliver APIs at a massive scale. Learn how Microsoft uses TypeSpec to deliver high quality services to millions of customers and across tens of thousands of API endpoints. We will show how to use this new language and the related IDE tooling to encapsulate common API patterns into reusable components, up-level API descriptions with business-specific metadata and behaviors, connect API guidelines to development time activities, maintain API consistency, and generate custom assets, all while interoperating with the OpenAPI ecosystem.
Applications increasingly talk to each other behind the scenes via APIs. Google’s recent acquisition of Apigee, an API management company, is an indicator of the continued importance of APIs. APIs are like building blocks, providing services and data that can be connected with other APIs to build powerful customized apps. However, developing and testing an API can be challenging because there is no built-in interface, breaking changes can cause widespread outages, sensitive data may be exposed or accessed, and accepted agile testing paradigms can be difficult to adapt to APIs. This session is an introduction to restful APIs and how to test them for security, performance, functionality, and backwards-compatibility risks.
TAUS Webinar - Introduction to the Gengo API EcosystemGengo
Learn how the Gengo API Ecosystem functions, its key components, and how Gengo is able to drive massively scalable translation.
You can also listen to the presentation here: https://www.taus.net/events/translation-technology-showcase-webinar#october-2
История появления REST
Описание и период популяризации технологии
Виды запросов и коды ответов
Элементы транзакции
Проблематика использования
Отсутствие общего согласования и стандартизации
Частичная поддержка заголовков запросов
Различное поведение клиентов при одинаковом коде ответа
Сложный процесс поиска ошибок во время разработки
Альтернативы
Использование SOAP, XML-RPC и websocket, особенности мониторинга и тестирования REST-сервисов.
Пути реализации в Symfony2-проектах
Voryx REST Generator bundle
FOSRestBundle
NelmioApiDocBundle
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Data Science for Social Good and Ushahidi - Final PresentationUshahidi
The Data Science for Social Good Fellows (dssg.io) collaborated with Ushahidi (Ushahidi.com)
Presented: August 20, 2013
Video - https://www.youtube.com/watch?v=4eK8HjVG2m0
Tool - http://dssg.ushahididev.com/
Around the world citizens and organizations are using online reporting tools, including Ushahidi to tell their story, amplify and action responses. This is part 2 of a summary of mapping projects. More on our wiki - https://wiki.ushahidi.com/display/WIKI/Anti-Corruption+and+Transparency
Iterator - a powerful but underappreciated design patternNitin Bhide
Iterator design pattern is described in GoF ‘Design Patterns’ book. It is used at many places (e.g. Sql Cursor is a ‘iterator’), C++ standard template library uses iterators heavily. .Net Linq interfaces are based IEnumerable (i.e. iterator). However, I don’t see projects creating/using ‘custom’ iterator classes. Many problems can be solved ‘elegantly’ by use of customized iterators. This talk is about ‘power of iterators’ and how custom iterators can solve common problems and help create modular/reusable code components.
Key Discussion Points
Typical examples of iterators in common use.
Kind of problems that can be ‘elegantly’ solved with iterators
When to use custom iterators?
How write custom iterators in C++/C#
From webinar I did on TechGig
http://www.techgig.com/expert-speak/Iterator-a-powerful-but-underappreciated-pattern-449
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks
The more time you spend developing within a framework such as Apache Spark, you learn there are additional features that would be helpful to have given the context and details of your specific use case. Spark supports a very concise and readable coding style using functional programming paradigms. Wouldn’t it be awesome to add your own functions into the mix using the same style? Well you can!
In this session, you will learn about using Scala’s “Enrich my library” programming pattern to add new functionality to Spark’s APIs. We will dive into a how-to guide with code snippets and present an example where this strategy was used to develop a validation framework for Spark Datasets in a production pipeline. Come learn how to enrich your Spark!
presentation on online movie ticket bookingdharmawath
The goals of our system are:
To provide a anytime anyplace service for the customer
To minimize the number of staff at the ticket box
To promote the film on the internet
To increase the profit
APIs at Scale with TypeSpec by Mandy Whaley, MicrosoftNordic APIs
A presentation given by Mandy Whaley, Partner Director of Product, Azure Developer Tools at Microsoft, at our 2024 Austin API Summit, March 12-13.
Session Description:
TypeSpec is a new API description language developed and used by Microsoft to deliver APIs at a massive scale. Learn how Microsoft uses TypeSpec to deliver high quality services to millions of customers and across tens of thousands of API endpoints. We will show how to use this new language and the related IDE tooling to encapsulate common API patterns into reusable components, up-level API descriptions with business-specific metadata and behaviors, connect API guidelines to development time activities, maintain API consistency, and generate custom assets, all while interoperating with the OpenAPI ecosystem.
Applications increasingly talk to each other behind the scenes via APIs. Google’s recent acquisition of Apigee, an API management company, is an indicator of the continued importance of APIs. APIs are like building blocks, providing services and data that can be connected with other APIs to build powerful customized apps. However, developing and testing an API can be challenging because there is no built-in interface, breaking changes can cause widespread outages, sensitive data may be exposed or accessed, and accepted agile testing paradigms can be difficult to adapt to APIs. This session is an introduction to restful APIs and how to test them for security, performance, functionality, and backwards-compatibility risks.
TAUS Webinar - Introduction to the Gengo API EcosystemGengo
Learn how the Gengo API Ecosystem functions, its key components, and how Gengo is able to drive massively scalable translation.
You can also listen to the presentation here: https://www.taus.net/events/translation-technology-showcase-webinar#october-2
История появления REST
Описание и период популяризации технологии
Виды запросов и коды ответов
Элементы транзакции
Проблематика использования
Отсутствие общего согласования и стандартизации
Частичная поддержка заголовков запросов
Различное поведение клиентов при одинаковом коде ответа
Сложный процесс поиска ошибок во время разработки
Альтернативы
Использование SOAP, XML-RPC и websocket, особенности мониторинга и тестирования REST-сервисов.
Пути реализации в Symfony2-проектах
Voryx REST Generator bundle
FOSRestBundle
NelmioApiDocBundle
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Data Science for Social Good and Ushahidi - Final PresentationUshahidi
The Data Science for Social Good Fellows (dssg.io) collaborated with Ushahidi (Ushahidi.com)
Presented: August 20, 2013
Video - https://www.youtube.com/watch?v=4eK8HjVG2m0
Tool - http://dssg.ushahididev.com/
Around the world citizens and organizations are using online reporting tools, including Ushahidi to tell their story, amplify and action responses. This is part 2 of a summary of mapping projects. More on our wiki - https://wiki.ushahidi.com/display/WIKI/Anti-Corruption+and+Transparency
Anti-Corruption Mapping (April 2013, part 1)Ushahidi
Around the world citizens and organizations are using online reporting tools, including Ushahidi to tell their story, amplify and action responses.
This is part 1 of a summary of mapping projects.
More on our wiki - https://wiki.ushahidi.com/display/WIKI/Anti-Corruption+and+Transparency
Ushahidi spoke with our community about how to make Ushahidi 3.0. We are building it with their input. Here are some of the original thoughts based on Community input from June - August 2013. There are updated wireframes available.
https://wiki.ushahidi.com/display/WIKI/Ushahidi+Platform%2C+v3.X
Around the Globe Corruption Mapping (part 2)Ushahidi
Around the Globe Corruption Mapping using Ushahidi and Crowdmap. (Part 2) Prepared for the 15th International Anti-Corruption Conference, Brasilia, Brazil. November 7, 2012. By Heather Leson
Around the Globe Corruption Mapping (part 1)Ushahidi
Around the Globe Corruption Mapping using Ushahidi and Crowdmap. (Part 1) Prepared for the 15th International Anti-Corruption Conference, Brasilia, Brazil. November 7, 2012. By Heather Leson
The Kenya Ushahidi Evaluation Project was 9-month Ushahidi evaluation project in partnership with the Harvard Humanitarian Initiative supported by the Knight Foundation. Jennifer Chan and Melissa Tully conducted research which lead to the creation of case studies and toolboxes. (2011) This is Toolbox #3: Real-Time Evaluation.
The Kenya Ushahidi Evaluation Project was 9-month Ushahidi evaluation project in partnership with the Harvard Humanitarian Initiative supported by the Knight Foundation. Jennifer Chan and Melissa Tully conducted research which lead to the creation of case studies and toolboxes. (2011) This is Toolbox #2: Implementation.
The Kenya Ushahidi Evaluation Project was 9-month Ushahidi evaluation project in partnership with the Harvard Humanitarian Initiative supported by the Knight Foundation. Jennifer Chan and Melissa Tully conducted research which lead to the creation of case studies and toolboxes. (2011) This is Toolbox #1: Assessment.
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesUshahidi
The Kenya Ushahidi Evaluation Project was 9-month Ushahidi evaluation project in partnership with the Harvard Humanitarian Initiative supported by the Knight Foundation. Jennifer Chan and Melissa Tully conducted research, created case studies and toolboxes. (2011) The Unsung Peace Heros/Building Bridges Case Study was created by Melissa Tully.
The Kenya Ushahidi Evaluation Project was 9-month Ushahidi evaluation project in partnership with the Harvard Humanitarian Initiative supported by the Knight Foundation. Jennifer Chan and Melissa Tully conducted research, created cases studies and toolboxes. (2011) The Uchaguzi Case Study was created by Jennifer Chan.
The Kenya Ushahidi Evaluation Project was 9-month Ushahidi Evaluation Project in partnership with the Harvard Humanitarian Initiative supported by the Knight Foundation. Jennifer Chan and Melissa Tully conducted research, created use cases and toolboxes. (2011) The following are blog posts about their work. (previously posted on blog.ushahidi.com)
Ushahidi is incorporating user feedback as we plan for our next stages of the software development.
Gabriel White of Small Surfaces has prepared these User Personas and Scenarios
Testimony
Mesh 2012
May 23, 2012 Toronto Canada
Heather Leson
meshconference.com
Discussion focused on maps for change with a number of Canadian examples.
TedxSilkRoad presentation by Heather Leson on April 11, 2012 in Istanbul, Turkey.
This presentation featured a number of Ushahidi and Crowdmap deployments used for election monitoring, crisis response and civil society activities. The video will be available at a later date.
Presented by Heather Leson @ Public Safety Canada, the Ontario Ministry of Community Safety and Correctional Services and the University of Toronto
Conference on Social Media for Emergency Management and Crisis Communications, held at the University of Toronto in downtown Toronto, March 29, 2012.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
4. SWIFTRIVER IS FOR...
Improving information findability
Surfacing content you didn't know you were looking for
Understanding media from other parts of the world (translation)
Making urgent data more discoverable (structured, published and accessible)
Verifying eyewitness accounts
Using location as context
Expanding the grassroots reporting network
Preserving information (archiving)
7. WHAT IS SILCC?
•Swift Language Computation Component
•One of the SwiftRiver Web Services
•Open Web API
•Semantic Tagging of Short Text
•Multilingual
•Multiple sources (twitter, email, SMS, blogs etc)
•Active Learning capability
•Open Source
•Easy to Deploy, Modify and Run
8. Swiftriver SiLCC Dataflow
SiSLS
Content Items coming from the SiSLS have where
Swiftriver Source SiSLS integrations is enabled global trust values
Library Service added to the object model.
SiLCC
Swiftriver Language An API key is sent along with the text to ensure that
the SiLCC is not open to any malicious usage.
Computational Core
The text of the
content is sent to the
SiLCC.
There is still a bit of ambiguity around what the NLP
should extract from the text but at its most simple,
Using NLP, the SiLCC all the nouns would be a good start.
extracts Nouns and
other keywords from
the text.
The SiLCC send back
The lists of tags sent back from the SiLCC can be
a list of tags that are
added to the Content Item along with any that were
added to the extracted from the source data by the parser.
Content Item
SLISa
Although the NLP tags have now been applied, the
SLISa is now responsible for applying instance
Swiftriver Language
specific tagging corrections.
Improvement Service
9. OUR GOALS
•Simple Tagging of short snippets of text
•Rapid tagging for high volume environments
•Simple API, easy to use
•Learns from user feedback
•Routing of messages to upstream services
•Semantic Classification
•Sorts rapid streams into buckets
•Clusters like messages
•Visual effects
•Cross-referencing
10. WHAT IT’S NOT
•Does not do deep analysis of text
•Only identifies words within original text
11. HOW DOES IT WORK?
•Step 1: Lexical Analysis
•Step 2: Parsing into constituent parts
•Step 3: Part of Speech tagging
•Step 4: Feature extraction
•Step 5: Compute using feature weights
•Lets examine each one in turn...
12. STEP 1: LEXICAL ANALYSIS
•For news headlines, email subjects this is trivial, just
split on spaces.
•For Twitter this is more complex...
13. TWEET ANALYSIS
•Tweets are surprisingly complex
•Only 140 characters but many features
•Emergent features from community (e.g. hashtags)
•Lets take a look at a typical tweet...
14. TWEET ANALYSIS
The typical Tweet: “RT @directrelief: RT
@PIH: PBS @NewsHour addresses mental health
needs in the aftermath of the #Haiti earthquake
#health #earthquake... http://bit.ly/bNhyK6”
•RT indicates a “re-tweet”
•@name indicates who the original tweeter was
•Multiple embedded retweets
•Hashtags (e.g. #Haiti) can play two roles, as a tag
and as part of the sentence
15. TWEET ANALYSIS 2
•Two or more hashtags within a tweet (e.g.
#health and #earthquake)
•Continuation dots “...” indicates that there
was more text that didn’t fit into the 140 limit
somewhere in it’s history
•Urls many tweets contain one or more urls
As we can see this simple tweet contains no less
than 7 different features and that’s not all!
16. TWEET ANALYSIS 3
We want to break up the tweet into the following
parts:
{
'text': ['PBS addresses mental health needs in the aftermath of the Haiti
earthquake'],
'hashtags': ['#Haiti', '#health', '#earthquake'],
'names': ['@directrelief', '@PIH', '@NewsHour'],
'urls': ['http://bit.ly/bNhyK6'],
}
17. TWEET ANALYSIS 4
Why do we want to break up the tweet into parts
(parsing)?
•Because we want to further process the
grammatically correct english text
•Part of speech tagging would otherwise be
corrupted by words it cannot recognize (e.g. urls,
hashtags, @names etc.)
•We want to save the hashtags for later use
•Many of the features are irrelevant to the task of
identifying tags (e.g. dots, punctuation, @name, RT)
18. TWEET ANALYSIS 5
•We now take the “text” portion of the tweet and
perform part of speech tagging on it
•After part of speech tagging, we perform feature
extraction
•Features are now passed through the keyword
classifier which returns a list of keywords / tags
•Finally we combine these tags with the hashtags we
saved earlier to give the complete tag set
19. HEADLINE AND EMAIL
SUBJECT ANALYSIS
•This is much simpler to do
•Its a subset of the steps in Tweet Analysis
•There is no parsing since there are no hashtags,
@names etc.
20. FEATURE EXTRACTION
• For the active learning algorithm we need to extract features to use in classification
• These features should be subject/domain independent
• We therefore never use the actual words as features
• This would for example give artificially high weights to words such as “earthquake”
• We don't want these artificial weights as we can’t foresee future disasters and we
want to be as generic with classification as possible
• The use of training sets does allow for domain customization if where necessary
21. FEATURE EXTRACTION
• Capitalization of individual words: Either first caps, or all caps, this is an
important indicator of proper nouns or other important words that make good tag
candidates
• Position in text: Tags seem to have a greater preponderance near the
beginning of text
• Part of Speech: Nouns and proper nouns are particularly important but so are
some adjectives and adverbs
• Capitalization of entire text: sometimes the whole text is capitalized and
this should reduce overall weighting of other features
• Length of the text: In shorter texts the words are more likely to be tags
• The parts of speech of previous and next words (effectively this means we
are using trigrams; or a window of 3)
22. TRAINING
• Requires user reviewed examples
• Lexical analysis, parsing and feature extraction on the examples
• Multinomial naïve Bayes algorithm
• NB: The granularity we are classifying is at the word level
• For each word in the text, we classify it as either a keyword or not
• This has pleasant side effect of providing several training examples from each user
reviewed text
• Even with less than 50 reviewed texts the results are comparable to the simple
approach of using nouns only
23. ACTIVE LEARNING
•The API also provides a method for users to send
back corrected text
•The corrected text is saved and then used in the
next iteration of training
•User may optionally specify a corpus for the
example to go into
•Training can be performed using any combination of
corpora
24. DEVELOPER FRIENDLY
•Two levels of API, the web API and the internal
Python API
•Either one may be used but most users will use the
web API
•Design is highly modular and maintainable
•For very rapid backend processing the native Python
API can be used
25. PYTHON CLASSES
Most of the classes that make up the library are
divided into three types:
1) Tokenizers
2) Parsers
3) Taggers
All three types have consistent API's and are
interchangeable.
26. PYTHON API
•A tagger calls a parser
•A parser calls a tokenizer
•Output of the tokenizer goes into the parser
•Output of the parser goes into the tagger
•Output of the tagger goes into the user!
27. CLASSES
• BasicTokenizer – This is used for splitting basic (non-tweet) text into individual
words
• TweetTokenizer – This is used to tokenize a tweet, it may also be used to
tokenize plain text since plain text is a subset of tweets
• TweetParser – Calls the TweetTokenizer and the parses the output (see
previous example)
• TweetTagger – Calls the TweetTokenizer and then tags the output of the text
part and adds the hashtags
• BasicTagger – Calls the BasicTokenizer and then tags the text, should only be
used for non-tweet text, uses simple Part of Speech to identify tags
• BayesTagger – Same as BasicTagger but uses weights from the naïve Bayes
training algorithm
28. DEPENDANCIES
•Part of speech tagging is currently performed by the
Python NLTK
•The Web API uses the Pylons web framework
29. CURRENT STATUS
•Tag method of API is ready for use, individual
deployments can choose between using the
BasicTagger or the BayesTagger
•Tell method (for user feedback) will be ready by
the time you read this!
•Training is possible on corpora of tagged data in .csv
format (see examples in distribution)
30. CURRENT LIMITATIONS
•Only English text is supported at the moment
•Tags are always one of the words in the supplied
text ie they can never be a word not in the supplied
text
•Very few training examples exist at the moment
31. FUTURE WORK
•Multilingual, use non-english part of speech taggers
•UTF8 compatible
•Experiment with different learning algorithms (e.g.
neural networks)
•Perform external text analysis (e.g. if there is a url,
analyze the text in the url as well as in the tweet)
•Allow users to specify required density of tags
32. SWIFT RIVER
jon@ushahidi.com
http://swift.ushahidi.com
http://github.com/appfrica/silcc
An Ushahidi Initiative
by Neville Newey and Jon Gosier