This document summarizes a study that compares different feature selection methods for text categorization. The study aims to reduce the number of dimensions to address issues with high dimensionality. It evaluates several statistical classification methods and feature selection techniques, including document frequency thresholding, information gain, mutual information, CHI statistic, and term strength. The experiments apply k-nearest neighbors and linear least squares fitting classifiers to Reuters and OHSUMED corpora. The results show the best performance with a vocabulary size of around 2,000 terms and that information gain and CHI are the most aggressive at term removal.
Simple fuzzy name matching in elasticsearch paris meetupBasis Technology
Those are the slides that were presented during the Elasticsearch meetup in Paris on July 29th.
Normalization is crucial to high quality search results -- who wants irrelevant variations between queries and documents leading to missed hits (e.g., “celebrity” v. “celebrities”)? Normalizing dictionary words works, but what if your application focuses on names? Whether you’re tackling log analysis, e-commerce, watch list screening or other applications, names are often the key. Can you find “Abdul Jabbar, Karim” if you search for “Kareem AbdalJabar” or “كريم عبد الجبار”?
Applications using Elasticsearch provide some fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. We’ve tried to go beyond that to provide both better matching and a simpler integration. We use a custom Mapper and Score Function so that linguistic nuances can be handled behind-the-scenes. We’ll talk about how we built this sort of plug-in for Rosette, its customization, and its connection to broader trend of entity-centric search.
Simple fuzzy name matching in elasticsearch paris meetupBasis Technology
Those are the slides that were presented during the Elasticsearch meetup in Paris on July 29th.
Normalization is crucial to high quality search results -- who wants irrelevant variations between queries and documents leading to missed hits (e.g., “celebrity” v. “celebrities”)? Normalizing dictionary words works, but what if your application focuses on names? Whether you’re tackling log analysis, e-commerce, watch list screening or other applications, names are often the key. Can you find “Abdul Jabbar, Karim” if you search for “Kareem AbdalJabar” or “كريم عبد الجبار”?
Applications using Elasticsearch provide some fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. We’ve tried to go beyond that to provide both better matching and a simpler integration. We use a custom Mapper and Score Function so that linguistic nuances can be handled behind-the-scenes. We’ll talk about how we built this sort of plug-in for Rosette, its customization, and its connection to broader trend of entity-centric search.
This presentation by CADE (Brazilian Competition Authority) was made during a workshop on “Cartel screening in the digital era” held by the OECD in Paris on 30 January 2018. More papers and presentations on the topic can be found out at oe.cd/wcsde.
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...epamspb
"Нейронные сети для извлечения структурированной информации из документов"
Печатные документы - важная часть нашей жизни. Из моего доклада вы узнаете, как применять глубокие нейронные сети, чтобы извлечь структурированную информацию из документов с различным шаблоном. Я также расскажу, как решить проблемы в проектах, где используется машинное обучение. Доклад будет полезен не только тем, кто уже успешно использует машинное обучение, но и всем, кому просто интересна эта тема.
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
How can text-mining leverage developments in Deep Learning?
Text-mining focusses primary on extracting complex patterns from unstructured electronic data sets and applying machine learning for document classification. During the last decade, a generation of efficient and successful algorithms has been developed using bag-of-words models to represent document content and statistical and geometrical machine learning algorithms such as Conditional Random Fields and Support Vector Machines. These algorithms require relatively little training data and are fast on modern hardware. However, performance seems to be stuck around 90% F1 values.
In computer vision, deep learning has shown great success where the 90% barrier has been broken in many application. In addition, deep learning also shows new successes for transfer learning and self-learning such as reinforcement leaning. Dedicated hardware helped us to overcome computational challenges and methods such as training data augmentation solved the need for unrealistically large data sets.
So, it would make sense to apply deep learning also on textual data as well. But how do we represent textual data: there are many different methods for word embeddings and as many deep learning architectures. Training data augmentation, transfer learning and reinforcement leaning are not fully defined for textual data.
Data Science Keys to Open Up OpenNASA DatasetsPyData
By Noemi Derzsy
PyData New York City 2017
Open source data has enabled society to engage in community-based research, and has provided government agencies with more visibility and trust from individuals. I will briefly introduce the openNASA platform with over 32,000 open NASA datasets, and I will present open NASA metadata analysis, and tools for applying NLP/topic modeling techniques to understand open government dataset associations.
Ever been stuck in a data science use case where any approach seems too hard? Graph theory, describing a system just in terms of nodes and links, could be your answer! In the practical example we’ll show, we’ll try to find data science communities and their leaders in LinkedIn. Challenge accepted?
Aurélia Nègre & Alberto Guggiola - Quantmetry
https://dataxday.fr/
Near duplicate detection method based on random projection.
This presentation gives an overview of existing category of NDD methods and introduces WSH (Weighted SimHash). It also presents some result comparing original Simhash with WSH and cosine similarity based method
A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...Data Con LA
Abstract:- Artificial Intelligence is an important topic in the fight against cancer. Clinical Trails are at the frontier of innovation. I will discuss techniques, data sets and platforms we use at Deep 6 to bring patients to clinical trials. The focus will be on practical, repeatable methods I've developed at MySpace, Greenplum, UCLA and the US Intelligence Community.
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
This presentation by CADE (Brazilian Competition Authority) was made during a workshop on “Cartel screening in the digital era” held by the OECD in Paris on 30 January 2018. More papers and presentations on the topic can be found out at oe.cd/wcsde.
#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of...epamspb
"Нейронные сети для извлечения структурированной информации из документов"
Печатные документы - важная часть нашей жизни. Из моего доклада вы узнаете, как применять глубокие нейронные сети, чтобы извлечь структурированную информацию из документов с различным шаблоном. Я также расскажу, как решить проблемы в проектах, где используется машинное обучение. Доклад будет полезен не только тем, кто уже успешно использует машинное обучение, но и всем, кому просто интересна эта тема.
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
How can text-mining leverage developments in Deep Learning?
Text-mining focusses primary on extracting complex patterns from unstructured electronic data sets and applying machine learning for document classification. During the last decade, a generation of efficient and successful algorithms has been developed using bag-of-words models to represent document content and statistical and geometrical machine learning algorithms such as Conditional Random Fields and Support Vector Machines. These algorithms require relatively little training data and are fast on modern hardware. However, performance seems to be stuck around 90% F1 values.
In computer vision, deep learning has shown great success where the 90% barrier has been broken in many application. In addition, deep learning also shows new successes for transfer learning and self-learning such as reinforcement leaning. Dedicated hardware helped us to overcome computational challenges and methods such as training data augmentation solved the need for unrealistically large data sets.
So, it would make sense to apply deep learning also on textual data as well. But how do we represent textual data: there are many different methods for word embeddings and as many deep learning architectures. Training data augmentation, transfer learning and reinforcement leaning are not fully defined for textual data.
Data Science Keys to Open Up OpenNASA DatasetsPyData
By Noemi Derzsy
PyData New York City 2017
Open source data has enabled society to engage in community-based research, and has provided government agencies with more visibility and trust from individuals. I will briefly introduce the openNASA platform with over 32,000 open NASA datasets, and I will present open NASA metadata analysis, and tools for applying NLP/topic modeling techniques to understand open government dataset associations.
Ever been stuck in a data science use case where any approach seems too hard? Graph theory, describing a system just in terms of nodes and links, could be your answer! In the practical example we’ll show, we’ll try to find data science communities and their leaders in LinkedIn. Challenge accepted?
Aurélia Nègre & Alberto Guggiola - Quantmetry
https://dataxday.fr/
Near duplicate detection method based on random projection.
This presentation gives an overview of existing category of NDD methods and introduces WSH (Weighted SimHash). It also presents some result comparing original Simhash with WSH and cosine similarity based method
A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...Data Con LA
Abstract:- Artificial Intelligence is an important topic in the fight against cancer. Clinical Trails are at the frontier of innovation. I will discuss techniques, data sets and platforms we use at Deep 6 to bring patients to clinical trials. The focus will be on practical, repeatable methods I've developed at MySpace, Greenplum, UCLA and the US Intelligence Community.
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
Similar to A Comparative Study On Featuree Selection In Text2 (20)
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
8. Statistic (CHI)
• Measure of the lack of independence between t
and c,
• A t and c occurs, B t and not c
• C not t and c , D not t and not c
• N total number of documents
It t and c independent value =0.
14. Creative commons license
You are free:
•to copy, distribute, display, and perform the work
•to make derivative works
Under the following conditions:
•Attribution. You must give the original author credit.
What does quot;Attribute this workquot; mean?
The page you came from contained embedded licensing metadata, including how the creator wishes to be
attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on
your page so that others can find the original work as well.
•Non-Commercial. You may not use this work for commercial purposes.
•For any reuse or distribution, you must make clear to others the licence terms of this work.
•Any of these conditions can be waived if you get permission from the copyright holder.
•Nothing in this license impairs or restricts the author's moral rights.