Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service.
In the first part of the talk, we give an overview of SystemT, a declarative IE system designed and developed to address the requirements driven by modern applications: scalability, expressivity, and transparency. SystemT is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy to use, maintain and customize. Today, SystemT ships with multiple products across 4 IBM Software Brands and it being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and database technology.
In the second part of the talk we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. The key feature of the system is that it treats the semantic labels of the English Proposition Bank as “universal semantic labels”: Given a sentence in any of the supported languages, POLYGLOT will predict appropriate English PropBank frame and role annotation. We illustrate how these universal semantic labels can be used within SystemT to create information extractors that immediately work across different languages. In addition, we illustrate how we automatically generate Proposition Banks for new languages in order to enable multilingual SRL and discuss some challenges of crosslingual semantics.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Pattern based approach for Natural Language Interface to DatabaseIJERA Editor
Natural Language Interface to Database (NLIDB) is an interesting and widely applicable research field. As the name suggests an NLIDB allows a naive user to ask query to database in natural language. This paper presents an NLIDB namely Pattern based Natural Language Interface to Database (PBNLIDB) in which patterns for simple query, aggregate function, relational operator, short-circuit logical operator and join are defined. The patterns are categorized into valid and invalid. Valid patterns are directly used to translate natural language query into Structured Query Language (SQL) query whereas an invalid pattern assists the query authoring service in generating options for user so that the query could be framed correctly. The system takes an English language query as input, recognizes pattern in the query, selects one of the before mentioned features of SQL based on the pattern, prepares an SQL statement, fires it on database and displays the result.
Anonymization techniques are used to ensure the privacy preservation of the data owners, especially for personal and sensitive data. While in most cases, data reside inside the database management system; most of the proposed anonymization techniques operate on and anonymize isolated datasets stored outside the DBMS. Hence, most of the desired functionalities of the DBMS are lost, e.g., consistency, recoverability, and efficient querying. In this paper, we address the challenges involved in enforcing the data privacy inside the DBMS. We implement the k-anonymity algorithm as a relational operator that interacts with other query operators to apply the privacy requirements while querying the data. We study anonymizing a single table, multiple tables, and complex queries that involve multiple predicates. We propose several algorithms to implement the anonymization operator that allow efficient non-blocking and pipelined execution of the query plan. We introduce the concept of k-anonymity view as an abstraction to treat k-anonymity (possibly, with multiple k preferences) as a relational view over the base table(s). For non-static datasets, we introduce the materialized k-anonymity views to ensure preserving the privacy under incremental updates. A prototype system is realized based on PostgreSQL with extended SQL and new relational operators to support anonymity views. The prototype system demonstrates how anonymity views integrate with other privacy- preserving components, e.g., limited retention, limited disclosure, and privacy policy management. Our experiments, on both synthetic and real datasets, illustrate the performance gain from the anonymity views as well as the proposed query optimization techniques under various scenarios.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Pattern based approach for Natural Language Interface to DatabaseIJERA Editor
Natural Language Interface to Database (NLIDB) is an interesting and widely applicable research field. As the name suggests an NLIDB allows a naive user to ask query to database in natural language. This paper presents an NLIDB namely Pattern based Natural Language Interface to Database (PBNLIDB) in which patterns for simple query, aggregate function, relational operator, short-circuit logical operator and join are defined. The patterns are categorized into valid and invalid. Valid patterns are directly used to translate natural language query into Structured Query Language (SQL) query whereas an invalid pattern assists the query authoring service in generating options for user so that the query could be framed correctly. The system takes an English language query as input, recognizes pattern in the query, selects one of the before mentioned features of SQL based on the pattern, prepares an SQL statement, fires it on database and displays the result.
Anonymization techniques are used to ensure the privacy preservation of the data owners, especially for personal and sensitive data. While in most cases, data reside inside the database management system; most of the proposed anonymization techniques operate on and anonymize isolated datasets stored outside the DBMS. Hence, most of the desired functionalities of the DBMS are lost, e.g., consistency, recoverability, and efficient querying. In this paper, we address the challenges involved in enforcing the data privacy inside the DBMS. We implement the k-anonymity algorithm as a relational operator that interacts with other query operators to apply the privacy requirements while querying the data. We study anonymizing a single table, multiple tables, and complex queries that involve multiple predicates. We propose several algorithms to implement the anonymization operator that allow efficient non-blocking and pipelined execution of the query plan. We introduce the concept of k-anonymity view as an abstraction to treat k-anonymity (possibly, with multiple k preferences) as a relational view over the base table(s). For non-static datasets, we introduce the materialized k-anonymity views to ensure preserving the privacy under incremental updates. A prototype system is realized based on PostgreSQL with extended SQL and new relational operators to support anonymity views. The prototype system demonstrates how anonymity views integrate with other privacy- preserving components, e.g., limited retention, limited disclosure, and privacy policy management. Our experiments, on both synthetic and real datasets, illustrate the performance gain from the anonymity views as well as the proposed query optimization techniques under various scenarios.
A practical introduction to SADI semantic Web services and HYDRA query toolAlexandre Riazanov
This sequel to the talk "Comrehensive Self-Service Life Science Data Federation with SADI semantic Web services and HYDRA" is more technical, although relatively self-contained, and tailored to people interested in practical aspects of SADI: I go step-by-step through an example of SADI service creation, deployment and use.
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
Slides for our COLING'18 paper: http://aclweb.org/anthology/C18-1058
Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-RuleML
UML class diagrams (UCDs) are a widely adopted formalism
for modeling the intensional structure of a software system. Although
UCDs are typically guiding the implementation of a system, it is common
in practice that developers need to recover the class diagram from an
implemented system. This process is known as reverse engineering. A
fundamental property of reverse engineered (or simply re-engineered)
UCDs is consistency, showing that the system is realizable in practice.
In this work, we investigate the consistency of re-engineered UCDs, and
we show is pspace-complete. The upper bound is obtained by exploiting
algorithmic techniques developed for conjunctive query answering under
guarded Datalog+/-, that is, a key member of the Datalog+/- family
of KR languages, while the lower bound is obtained by simulating the
behavior of a polynomial space Turing machine.
his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
Demonstration of a significant bias of some French court of appeal judges in decisions about the rights of asylum.
www.supralegem.fr
Presentation meetup ML Paris Feb 17th, 2016
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
A practical introduction to SADI semantic Web services and HYDRA query toolAlexandre Riazanov
This sequel to the talk "Comrehensive Self-Service Life Science Data Federation with SADI semantic Web services and HYDRA" is more technical, although relatively self-contained, and tailored to people interested in practical aspects of SADI: I go step-by-step through an example of SADI service creation, deployment and use.
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
Slides for our COLING'18 paper: http://aclweb.org/anthology/C18-1058
Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-RuleML
UML class diagrams (UCDs) are a widely adopted formalism
for modeling the intensional structure of a software system. Although
UCDs are typically guiding the implementation of a system, it is common
in practice that developers need to recover the class diagram from an
implemented system. This process is known as reverse engineering. A
fundamental property of reverse engineered (or simply re-engineered)
UCDs is consistency, showing that the system is realizable in practice.
In this work, we investigate the consistency of re-engineered UCDs, and
we show is pspace-complete. The upper bound is obtained by exploiting
algorithmic techniques developed for conjunctive query answering under
guarded Datalog+/-, that is, a key member of the Datalog+/- family
of KR languages, while the lower bound is obtained by simulating the
behavior of a polynomial space Turing machine.
his talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. This has important implications for user-centric and semantically focused relevance.
Demonstration of a significant bias of some French court of appeal judges in decisions about the rights of asylum.
www.supralegem.fr
Presentation meetup ML Paris Feb 17th, 2016
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealandBaydepdulich
Auckland Grammar School (AGS) là trường phổ thông công lập dành cho nam sinh danh giá nhất nhì New Zealand, đào tạo từ lớp 9 đến lớp 13 và A-level. Tại AGS, học sinh được phát triển toàn diện cả về thể chất, tinh thần, học vấn, kiến thức lẫn năng khiếu nghệ thuật
Những con số ấn tượng của AGS:
- Luôn đứng trong top 5 trường có học sinh đạt học bổng xuất sắc của New Zealand;
- 100% học sinh đỗ chứng chỉ A-Level ;
- >90% học sinh được nhận luôn vào đại học.
I. Lý do thu hút học sinh học tập tại AGS:
- Chất lượng đào tạo hàng đầu: được NCEA, các tổ chức giáo dục tại New Zealand và thế giới công nhận;
- Tọa lạc gần trung tâm thành phố Auckland: học sinh có thể đi bộ đến trường Đại học Auckland và AUT mất khoảng 30 phút và đi xe bus hoặc xe lửa đến UNITEC và Đại học Massey;
- Chi phí phải chăng: Chi phí sinh hoạt và học phí tại New Zealand tương đối thấp so với Anh, Mỹ và Úc... do tỷ giá đô la NZ thấp hơn;
- Cơ sở vật chất hiện đại, bao gồm: hội trường lớn, sân thi đấu trong nhà, bể bơi có hệ thống sưởi nhiệt, phòng tập đa năng, phòng âm nhạc, thư viện, toà nhà công nghệ thông tin và một trung tâm dã ngoại ngoài trời ở Ohakune;
- Sĩ số lớp nhỏ: giúp học sinh được các giáo viên hướng dẫn và trợ giúp nhiều hơn trong học tập, học sinh có thể phát triển khả năng của mình đến mức cao nhất;
- Sắp xếp chỗ ở: Trường sắp xếp cho học sinh ở nhà người bản xứ đã được lựa chọn trước và được quản lý bởi Giám đốc phụ trách sinh viên quốc tế;
- Dịch vụ hỗ trợ học sinh rất tốt, bao gồm: Tư vấn cho học sinh, các cuộc gặp gỡ, trao đổi thường xuyên giữa các học sinh quốc tế và với các gia đình người bản xứ. Các nhân viên phụ trách học sinh quốc tế luôn sẵn sàng giúp đỡ và hỗ trợ học sinh;
- Tính Nam tính trong trường: Nhiều học sinh tự hào rằng tại đây, họ thực sự được đào tạo để trở thành các Quý ông trong tương lai;
- Trường thường xuyên tổ chức các hoạt động ngoại khóa, thể thao nhằm giúp các em có cơ hội làm quen với môi trường mới, phát triển toàn diện bản thân;
- Tốt nghiệp phổ thông, học sinh sẽ học thẳng lên cao đẳng hoặc đại học. Tốt nghiệp cao đắng, đại học, các em được ở lại làm việc 2-3 năm và đăng kí định cư dạng tay nghề khi đủ điều kiện.
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)Laura Chiticariu
Invited talk at MIT CSAIL, March 8 2016
Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service. Such applications drive the following main requirements for IE systems: accuracy, scalability, expressivity, transparency, and customizability.
SystemT, a declarative IE system, has been designed and developed to address these requirements. It is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy to use, maintain and customize.
SystemT ships today with multiple products across 4 IBM Software Brands. Furthermore, SystemT is used in multiple ongoing research projects and being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and database technology.
O aluno João Vitor da Silva encerrou a Oficina Mídias Sociais do Projeto EducaJovem produzindo um trabalho sobre informações e a opinião nos seguintes temas:
Fundação Educandário "Cel. Quito Junqueira";
Projeto EducaJovem;
Oficina Mídias Sociais;
Comunicação;
Blog;
Postagens de seu blog passadas para o trabalho.
O Projeto EducaJovem é uma atividade da Fundação Educandário "Cel. Quito Junqueira" de Ribeirão Preto, interior de São Paulo.
Mais informações: www.educandariorp.com.br
El canal de denuncias es una garantía del cumplimiento ético en la empresa. Cuáles son sus objetivos. Características que debe de tener el canal de compliance. Cómo establecerlo. Ventajas de un canal de denuncias externo. Prevención de la responsabilidad penal de la empresa.
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
Equipment maintenance log of the global fleet is traditionally maintained using legacy infrastructure and data models, which limit the ability to extract insights at scale. However, to impact the bottom line, it is critical to ingest and enrich global fleet data to generate data driven guidance for operations. The impact of such insights is projected to be millions of dollars per annum.
To this end, we leverage Databricks to perform machine learning at scale, including ingesting (structured and unstructured data) from legacy systems, and then sifting through millions of nonlinearly growing records to extract insights using NLP. The insights enable outlier identification, capacity planning, prioritization of cost reduction opportunities, and the discovery process for cross-functional teams.
Get full visibility and find hidden security issuesElasticsearch
When even basic threats can be multi-staged and complex, limited visibility into your security data just doesn’t cut it. Whether you’re performing investigations or hunting for threats, you need all security-relevant context. Learn key practices in data collection and normalisation and see how you can use Elastic Security to quickly and accurately triage, verify, and scope issues.
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
What are the "use case patterns" for deploying LLMs into production? Understanding these will allow you to spot "LLM-shaped" problems in your own industry.
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
These are slides from my talk on March 28, 2018 at the LA SCALE tech Meetup, graciously hosted at TicketMaster's office. (https://www.meetup.com/scalela/events/248904126/)
Performance Analysis of Leading Application Lifecycle Management Systems for...Daniel van den Hoven
The performance of three leading application lifecycle management (ALM) systems (Rally by Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was assessed to draw comparative performance observations when customer data exceeds a 500,000-
artifact threshold. The focus of this performance testing was how each system handles a
simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical representative data set of 512,000 objects was constructed and populated in each system in order
to simulate identical use cases as closely as possible. Timed browser testing was performed to gauge the performance of common usage scenarios, and comparisons were then made. Nine tests were performed based on measurable, single-operation events
Introduction talk at the University of Strathclyde (Scotland) Algorithms Workshop, providing a quick overview of the fundamental and practical reasons why algorithms are/are not technical black boxes. (This talk does not address issues of trade secret or other business reasons for lack of transparency). The presentation was given to an audience of academics and students at the law department.
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Miguel Gallardo
Frameworks are very helpful to solve common problems when developing an application. But what happens when we have to move to another framework? In this talk I will show how my company tries to keep independent of any framework, decoupling our business logic from symfony.
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
This talk introduces the core concepts of functional programming, why it matters now and how these principles are adopted for programming.
Beyond the usage of the functional principles for programming in the small, their applicability to the design of components and further on to the architecture of software systems is also explored along with relevant examples.
The code samples used in the talk are in Haskell / Java 8 / Scala / Elm and JavaScript, but deep understanding of any of these languages are not a pre-requisite.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
Oplægget blev holdt ved et seminar i InfinIT-interessegruppen Højniveau sprog til indlejrede systemer den 11. november 2009.
Læs mere om interessegruppen på http://www.infinit.dk/dk/interessegrupper/hoejniveau_sprog_til_indlejrede_systemer/
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/28ePA2L.
Sylvan Clebsch talks about using Pony in a fintech environment to build high-performance tools. Pony is a new actor-model language that's statically typed and ahead-of-time compiled (using LLVM), with a fully concurrent garbage collector and a data-race free type system. Filmed at qconlondon.com.
Sylvan Clebsch is the designer of the Pony programming language. He has worked in industry for 24 years, on fintech, milsims, video games, peer networking, VOIP, identity management, crypto, and embedded OSes. He is currently working on distributed Pony.
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022
If you are a data scientist or a platform engineer, you probably can relate to the pains of working with the current explosive growth of Data/ML technologies and toolings. With many overlapping options and steep learning curves for each, it’s increasingly challenging for data science teams. Many platform teams started thinking about building an abstracted ML platform layer to support generalized ML use cases. But there are many complexities involved, especially as the underlying real-time data is shifting into the mainstream.
In this talk, we’ll discuss why ML platforms can benefit from a simple and ""invisible"" abstraction. We’ll offer some evidence on why you should consider leveraging streaming technologies even if your use cases are not real-time yet. We’ll share learnings (combining both ML and Infra perspectives) about some of the hard complexities involved in building such simple abstractions, the design principles behind them, and some counterintuitive decisions you may come across along the way.
By the end of the talk, I hope data scientists can walk away with some tips on how to evaluate ML platforms, and platform engineers learned a few architectural and design tricks.
Similar to Declarative Multilingual Information Extraction with SystemT (20)
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
3. Case Study 1: Social Data Analysis
Text
Analytics
Product catalog, Customer Master Data, …
Social Media
• Products
Interests360o Profile
• Relationships
• Personal
Attributes
• Life
Events
Statistical
Analysis,
Report Gen.
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
ScaleScaleScaleScale
450M+ tweets a day,
100M+ consumers, …
BreadthBreadthBreadthBreadth
Buzz, intent, sentiment, life
events, personal atts, …
ComplexityComplexityComplexityComplexity
Sarcasm, wishful thinking, …
Customer 360º
3
4. Case Study 2: Machine Analysis
Web Server
Logs
CustomCustom
Application
Logs
Database
Logs
Raw Logs
Parsed
Log
Records
Parse Extract
Linkage
Information
End-to-End
Application
Sessions
Integrate
Graphical Models
(Anomaly detection)
Analyze
5 min 0.85
CauseCause EffectEffect DelayDelay CorrelationCorrelation
10 min 0.62
Correlation Analysis
• Every customer has unique
components with unique log record
formats
• Need to quickly customize all
stages of analysis for these custom
log data sources
4
6. Enterprise Requirements
• ExpressivityExpressivityExpressivityExpressivity
– Need to express complex algorithms for a variety of tasks and data sources
• ScalabilityScalabilityScalabilityScalability
– Large number of documents and large-size documents
• TransparencyTransparencyTransparencyTransparency
– Every customer’s data and problems are unique in some way
– Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors
6
7. Expressivity Example: Different Kinds of Parses
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Natural Language Machine Log
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected
7
8. 88
Expressivity Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
9. • Social Media: Twitter alone has 500M+ messages / day;
1TB+ per day
• Financial Data: SEC alone has 20M+ filings, several TBs of
data, with documents range from few KBs to few MBs
• Machine Data: One application server under moderate load
at medium logging level 1GB of logs per day
Scalability Example
Scalability Example
9
10. Transparency Example: Need to incorporate in-situ data
Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest
I need to go back to Walmart, Toys R Us has the sameI need to go back to Walmart, Toys R Us has the same
toy $10 cheaper!
10
12. AQL Language
Optimizer
Operator
Runtime
Specify extractor
semantics declaratively
Choose efficient
execution plan that
implements semantics
Example AQL Extractor
Fundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & Theorems
• Expressivity:Expressivity:Expressivity:Expressivity: The class of extraction tasks expressible in AQL is a strict superset of that
expressible through cascaded regular automata.
• Performance:Performance:Performance:Performance: For any acyclic token-based finite state transducer T, there exists an
operator graph G such that evaluating T and G has the same computational complexity.
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, PhoneNumber N, Sentence S
where
Follows(P.name, N.number, 0, 30)
and Contains(S.sentence, P.name)
and Contains(S.sentence, N.number)
and ContainsRegex(/b(phone|at)b/,
SpanBetween(P.name, N.number));
Within asinglesentence
<Person> <PhoneNum>
0-30chars
Contains“phone” or “at”
Within asinglesentence
<Person> <PhoneNum>
0-30chars
Contains“phone” or “at”
SystemTSystemTSystemTSystemT: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006
More than 35 papers across ACL, EMNLP, VLDB, PODS and SIGMOD
SystemT Architecture
12
13. Expressivity: Rich Set of Operators
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Semantic Role Labels
Language to express NLP Algorithms AQL
.Regular
Expressions
HTML/XML
Detagging
Span
Operations
Relational
Operations
Aggregation
Operations
13
14. package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Expressivity: AQL vs. Custom Java Code
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
14
15. Transparency in SystemT
PersonPhone
• Clear and simple provenanceprovenanceprovenanceprovenance
Person PhonePerson
Anna at James St. office (555-5555) .
[Liu et al., VLDB’10]
Ease of comprehension
Ease of debugging
Ease of enhancement
• Explains output data in terms of input data, intermediate data, and the transformation
• For predicate-based rule languages (e.g., SQL), can be computed automatically
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, PhoneNumber N
where Follows(P.name, N.number, 0, 30);
personpersonpersonperson phonephonephonephone
Anna 555-5555
James 555-5555
Person
namenamenamename
Anna
James
PhoneNumber
numbernumbernumbernumber
555-5555t1
t2
t3
t1 ∧ t3
t2 ∧ t3
15
16. Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
– Learning using AQL as target language
16
17. Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts
– Embeddable Models in AQL
• Model Training and Scoring integration with SystemML
• Semantic Role LabelingSemantic Role LabelingSemantic Role LabelingSemantic Role Labeling [Akbik et al., ACL’15,’16, EMNLP’16, COLING’16]
• Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15]
– Learning using AQL as target language
17
18. Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts
– Embeddable Models in AQL
– Learning using AQL as target language
• Low-level features: regular expressions [Li et al., EMNLP’08],
dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13]
• Rule induction [Nagesh et al., EMNLP’12]
• Rule refinement [Liu et al., VLDB’10]
• Ongoing research
18
21. SystemT in Universities
2013 2015 2016 2017
• UC Santa Cruz
(full Graduate class)
2014
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Aalborg, Denmark (Grad)
• UIUC (Grad)
• U. Maryland Baltimore County
(Undergrad)
• UC Irvine (Grad)
• NYU Abu-Dhabi (Undergrad)
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Maryland Baltimore County
(Undergrad)
•
• UC Santa Cruz, 3 lectures
in one Grad class
SystemT MOOC
21
22. AlonAlonAlonAlon Halevy, November 2015Halevy, November 2015Halevy, November 2015Halevy, November 2015
"The tutorial of System T was an important hands-on component of a Ph.D
course on Structured Data on the Web at the University of Aalborg in
Denmark taught by Alon Halevy. The tutorial did a great job giving the
students the feeling for the challenges involved in extracting structured data
from text.“
ShimeiShimeiShimeiShimei Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016
"The Information Extraction with SystemT class, which was offered for the
first time at UMBC, has been a wonderful experience for me and my
students. I really enjoyed teaching this course. Students were also very
enthusiastic. Based on the feedback from my students, they have learned a lot.
Some of my students even want me to offer this class every semester.“
SystemT in Universities: Testimonials
22
23. • Professors like teaching it
• Students do interesting projects and have fun!
• Identify character relationships in Games of Thrones
• Determine the cause and treatment for PTSD patients
• Extract job experiences and skills from resumes
• …
SystemT in Universities: Summary
23
24. SystemT MOOC
• Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students
• Target Audience:Target Audience:Target Audience:Target Audience: NLP Engineer, Data Scientist
e.g., University students (grad, undergrad), industry professionals
• Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classes
TA0105: Text Analytics: Getting Results with SystemT
– Basics of IE with SystemT, learning to write extractors using the web tool,
learn AQL
– 5 lectures + 1 hands-on lab + final exam
TA0106EN: Advanced Text Analytics: Getting Results with SystemT
– Advanced material on declarative approach and SystemT optimization
– 2 lectures + final exam
• Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students
• Target Audience:Target Audience:Target Audience:Target Audience: NLP Engineer, Data Scientist
e.g., University students (grad, undergrad), industry professionals
• Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classes
TA0105: Text Analytics: Getting Results with SystemT
– Basics of IE with SystemT, learning to write extractors using the web tool,
learn AQL
– 5 lectures + 1 hands-on lab + final exam
TA0106EN: Advanced Text Analytics: Getting Results with SystemT
– Advanced material on declarative approach and SystemT optimization
– 2 lectures + final exam
24
26. • 7,102 known languages
• 23 most spoken language
> 4.1 billion people
https://walizahid.com/2015/06/the-worlds-most-spoken-languages/
• Our goal: Enable
Text Analytics
over Multilingual
Text
The World’s Most Spoken Languages
26
27. ProblemProblemProblemProblem: High costs to handle
multilingual data
English
NLP
English Text
Analytics
German
NLP
German Text
Analytics
Current StatusCurrent StatusCurrent StatusCurrent Status: - One language at a time;
- Query language and data must be in the same language
SeparateSeparateSeparateSeparate parsers /
features for each
language
French
NLP
French Text
Analytics
Chinese
NLP
Chinese Text
Analytics
SeparateSeparateSeparateSeparate text analyics
must be developed
EnglishEnglishEnglishEnglish
Text Data
ChineseChineseChineseChinese
Text Data
FrenchFrenchFrenchFrench
Text Data
GermanGermanGermanGerman
Text Data
CrosslingualCrosslingualCrosslingualCrosslingual Text AnalyticsText AnalyticsText AnalyticsText Analytics: Parse and develop against unified abstraction;
Unified
NLP
Crosslingual
TextAnalytics
Semantic parserSemantic parserSemantic parserSemantic parser
• arbitrary languages
• unified feature setunified feature setunified feature setunified feature set
Universal extractorsUniversal extractorsUniversal extractorsUniversal extractors
• Develop onceonceonceonce
• Run for all languages
Proposed Approach: Crosslingual Text Analytics
27
28. SRL focuses on semantics, not syntax
Useful for many applications (Shen and
Lapata, 2007; Maqsud et al., 2014)
FrameFrameFrameFrame semanticssemanticssemanticssemantics crosslinguallycrosslinguallycrosslinguallycrosslingually valid?valid?valid?valid?
Dirk broke the window with a hammer.
Break.01A0 A1 A2
The window was broken by Dirk.
The window broke.
A1 Break.01 A0
A1 Break.01
Break.01
A0 – Breaker
A1 – Thing broken
A2 – Instrument
A3 – Pieces
Break.15
A0 – Journalist,
exposer
A1 – Story,
thing exposed
• Problem: What type of labels are valid across languages?
• Lexical, morphological and syntactic labels differ greatly
• But shallow semantic labelsshallow semantic labelsshallow semantic labelsshallow semantic labels remain stable
Semantic Role Labeling
28
29. The station bought Cosby reruns for record prices
Buy.01A0
spurred dollar buying by Japanese institutions.
Spur.01 A1 Buy.01 A0
The company payed 2 million US$ to buy .
Corpus of annotated text dataFrame set
A1 A3
Buy.01
A0 – Buyer
A1 – Thing bought
A2 – Seller
A3 – Price paid
A4 – Benefactive
Pay.01
A0 – Payer
A1 – Money
A2 – Being payed
A3 – Commodity
Buy.01Pay.01A0 A1
• English SRL
– FrameNet
– PropBank
SRL Resources
29
30. • English SRL
– FrameNet
– PropBank
30
1. Limited coverage
2. Language-specific formalisms
• Other languages
– Chinese Proposition Bank
– Hindi Proposition Bank
– German FrameNet
– French? Spanish? Russian? Arabic? …
订购
A0: buyer
A1: commodity
A2: seller
order.02
A0: orderer
A1: thing ordered
A2: benefactive, ordered-for
A3: source
SRL for other languages?
We want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labels
31. WhatsApp was bought by Facebook
Facebook hat WhatsApp gekauft
Facebook a achété WhatsApp
buy.01
Facebook WhatsApp
Crosslingual representationMultilingual input text
Buy.01 A0A1
Buy.01A1A0
Buy.01A0 A1
Shared Frames Across Languages
31
32. • 1. Generation of Proposition Banks for Arbitrary Languages
• 2. Development of a SotA Semantic Role Labeler
• 3. Application to Use Cases
Research
32
33. Our Goal
• Generate SRL resources for many other languages
• Shared frame set
• Minimal effort
Frame set
Buy.01
A0 – Buyer
A1 – Thing bought
A2 – Seller
A3 – Price paid
A4 – Benefactive
Pay.01
A0 – Payer
A1 – Money
A2 – Being payed
A3 – Commodity
Il faut qu‘ il y ait des responsables
Need.01A0
Je suis responsable pour le chaos
Be.01A1 A2 AM-PRD
Les services postaux ont achété des
Be.01 A2A1
Buy.01A0
Corpus of annotated text data
33
34. Example: TV subtitles
Generating Universal Proposition Banks
Current practice:
Annotator training
(months)
Annotation
(years)
Must be
repeated for
each language!
Idea: Annotation projection in parallel corpora
Das würde ich für einen Dollar kaufen German subtitles
I would buy that for a dollar! English subtitles
PRICEBUYER ITEM
BUYERITEM
I would buy that for a dollar!
PRICE
projection
Das würde ich für einen Dollar kaufenDas würde ich für einen Dollar kaufen
34
35. However: Projections Not Always Possible
We need to hold people responsible
Il faut qu‘ il y ait des responsables
English sentence:
Target sentence:
Hold.01
A0
A1 A3
Need.01
Hold.01
Incorrect projection!
There need to be those responsible
A1
Main error sources:Main error sources:Main error sources:Main error sources:
• Translation shift
• Source-language SRL errors
35
36. • Two-step process of filtering and bootstrapping (ACL ‘15)
– Filters to detect translation shift, block projections (more precision at cost
of recall)
– Bootstrap learning to increase recall
– Generate 7 universal proposition banks7 universal proposition banks7 universal proposition banks7 universal proposition banks from 3 language groups
Improving Projection Quality
36
37. • Online demo for SRL in 9 languages (ACL ‘16)
Train Multilingual SRL
37
38. Low-resource Languages & Crowdsourcing
• Apply approach to low-resource languages
(EMNLP ‘16)
– Bengali, Malayalam, Tamil
– Fewer sources of parallel data
– Almost no NLP: No syntactic parsing,
lemmatization etc.
• Crowdsourcing for data curation
38
40. Original
drehen.
04 SHOOT.03
record on film
A0: videographer
A1: subject filmed
A2: medium
drehen.
01 TURN.01
rotation
A0: turner
A1: thing turning
Am: direction
drehen.
02 TURN.02
transformation
A0: causer of
transformation
A1: thing changing
A2: end state
A3: start state
record on film
A0: recorder
A1: thing filmed
drehen.
03 FILM.01
incorrect frame,
filtered
Merged
synonyms,
merged
Predicate: drehendrehendrehendrehen
Filtered
drehen.
01 TURN.01
rotation
A0: turner
A1: thing turning
Am: direction
record on film
A0: recorder
A1: thing filmed
drehen.
03 FILM.01
drehen.
01 TURN.01
rotation
A0: turner
A1: thing turning
Am: direction
drehen.
03
FILM.01
SHOOT.03
record on film
A0: recorder
A1: thing filmed
record on film
A0: videographer
A1: subject filmed
A2: medium
drehen.
04 SHOOT.03
Multilingual Aliasing: Example
40
41. Multilingual Aliasing
• Expert curation of frame mappings (COLING ‘16)
– 60 person hours per language
– Chinese, French and German
– More salient senses, improved quality
41
42. Automatic Generation of Proposition Banks
• Project English PropBank frame and role annotation onto
target language sentences
• A variety of strategies to increase quality:
– Filtering of translation shift
– Crowdsourced data curation
– Expert frame alignments
• Result: Generated universal proposition banks for 8universal proposition banks for 8universal proposition banks for 8universal proposition banks for 8
languageslanguageslanguageslanguages
• Currently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source release
42
43. • 1. Generation of Proposition Banks for Arbitrary Languages
• 2. Development of a SotA Semantic Role Labeler
• 3. Application to Use Cases
Research
43
44. • ProblemProblemProblemProblem: Current SRL error-prone
• Strong local bias, many lowStrong local bias, many lowStrong local bias, many lowStrong local bias, many low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize
Instance-based Semantic Role Labeling
44
45. • ProblemProblemProblemProblem: Current SRL error-prone
• Strong local biasStrong local biasStrong local biasStrong local bias
• Many lowMany lowMany lowMany low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize
Instance-based Semantic Role Labeling
• K-SRL: Instance-based
learning
– K-nearest neighbors
• Composite features
• Significantly outperform
all other current systems
In-domain Out-of-domain
45
46. • 1. Generation of Proposition Banks for Arbitrary Languages
• 2. Development of a SotA Semantic Role Labeler
• 3. Application to Use Cases
Research
46
48. Thank you!
• For more information…
– Visit our website:Visit our website:Visit our website:Visit our website:
http://ibm.co/1Cdm1Mj
– Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:
http://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEv
– Learning at your own pace:Learning at your own pace:Learning at your own pace:Learning at your own pace: SystemT MOOCSystemT MOOCSystemT MOOCSystemT MOOC
– Contact usContact usContact usContact us
• akbika@us.ibm.com
• chiti@us.ibm.com
SystemT Team:
Alan Akbik, Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy,
Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu
48
49. Declarative AQL language
Development Environment
AQL Extractor
create view ProductMention as
select ...
from ...
where ...
create view IntentToBuy as
select ...
from ...
where ... Cost-based
optimization
...
SystemT: Declarative Information Extraction
Discovery tools for AQL
development
SystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,
efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with
expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.
Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and
novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.
AQL: a declarative language that can be used to build extractors
outperforming the state-of-the-arts [ACL’10]
Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]
A suite of novel development tooling leveraging
machine learning and HCI [EMNLP’08, VLDB’10,
ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,
SIGMOD’13, ACL’13,VLDB’15, NAACL’15]
Cost-based optimization for
text-centric operations
[ICDE’08, ICDE’11, FPL’13, FPL’14]
Highly embeddable runtime
with high-throughput and
small memory footprint.
[SIGMOD Record’09, SIGMOD’09]
For details and
Online Class visit:
https://ibm.biz/BdF4GQ
49