Invited talk at MIT CSAIL, March 8 2016
Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service. Such applications drive the following main requirements for IE systems: accuracy, scalability, expressivity, transparency, and customizability.
SystemT, a declarative IE system, has been designed and developed to address these requirements. It is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy to use, maintain and customize.
SystemT ships today with multiple products across 4 IBM Software Brands. Furthermore, SystemT is used in multiple ongoing research projects and being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and database technology.
The document discusses cloud computing and provides a case study on moving a financial company's data service to the cloud. It outlines two options: 1) migrating the existing service to Amazon EC2, which would require addressing various technical and operational challenges, or 2) redesigning the service to leverage cloud databases and services, which could reduce costs but would require significant development work. The document cautions that moving workloads to the cloud requires careful analysis and that the hype around cloud computing should be taken in perspective, as integration and other business issues are more significant than technical challenges. It emphasizes establishing metrics to measure the benefits of cloud initiatives.
Giantsnet est un groupe composé de jeunes étudiants professionnels, on est là pour vous , des tutoriels dans tout les domaines Administration des Systèmes & Réseaux ( Virtualisation, Supervision .. ), Sécurité, Linux, CISCO, Microsoft ... et pleins d'autres choses . Notre but c'est de partager l'information avec vous, vous avez qu’à nous suivre pour découvrir.
Facebook :https://www.facebook.com/souhaib.es
Page : https://www.facebook.com/GNetworksTv
Groupe : https://www.facebook.com/groups/Giants.Networks
Twitter : https://www.twitter.com/GNetworksTv
YouTube : https://www.youtube.com/user/GNetworksTv
Network Operation Centre Highlights and Practices
In complex networks, the telecom operators and IT organizations can consider the report for high level planning and operations
Fighting Financial Crime with Artificial IntelligenceDataWorks Summit
How can we take the state of the art in deep learning and AI research, and transplant it into a large bank to deliver useful results which impact the general public? To answer this broad-reaching question, we take the viewer through a solution Think Big Analytics recently deployed at a major European bank for fraud detection, using state of the art AI techniques and a near-real time open-source architecture. We show how financial transactions can be transposed into a form where the latest AI techniques in image recognition can be leveraged, in surprisingly novel ways. We have been able to more accurately detect fraud and reduce financial crime, cutting losses and improving customer experience. We describe some architectures which can be used to do this in production, at scale, in global financial institutions.
Speaker:
Tim Seears, Director of Data Science, Think Big Analytics, a Teradata Company
1. The document discusses key metrics like velocity and productivity that are commonly used to track agile team performance but questions whether these truly measure business value.
2. Productivity is defined as accomplishing more with the same resources and may be a more meaningful metric than velocity, which can fluctuate. Productivity remains relatively constant as a team gains experience.
3. Velocity measures the amount of work a team can complete in a sprint but is not a good measure for comparing teams or evaluating performance, as story point sizes vary between teams. Business value, defined at the intersection of what can be successfully implemented, what customers want, and what excites the team, should be the key indicator used.
Dimension Data provides cloud services including compute, hosting, integration, and consulting delivered on its Managed Cloud Platform. The platform offers automation, security, and flexibility to move between public, private and hybrid cloud models. It addresses concerns around cloud adoption through management of the infrastructure and applications.
The document discusses cloud computing and provides a case study on moving a financial company's data service to the cloud. It outlines two options: 1) migrating the existing service to Amazon EC2, which would require addressing various technical and operational challenges, or 2) redesigning the service to leverage cloud databases and services, which could reduce costs but would require significant development work. The document cautions that moving workloads to the cloud requires careful analysis and that the hype around cloud computing should be taken in perspective, as integration and other business issues are more significant than technical challenges. It emphasizes establishing metrics to measure the benefits of cloud initiatives.
Giantsnet est un groupe composé de jeunes étudiants professionnels, on est là pour vous , des tutoriels dans tout les domaines Administration des Systèmes & Réseaux ( Virtualisation, Supervision .. ), Sécurité, Linux, CISCO, Microsoft ... et pleins d'autres choses . Notre but c'est de partager l'information avec vous, vous avez qu’à nous suivre pour découvrir.
Facebook :https://www.facebook.com/souhaib.es
Page : https://www.facebook.com/GNetworksTv
Groupe : https://www.facebook.com/groups/Giants.Networks
Twitter : https://www.twitter.com/GNetworksTv
YouTube : https://www.youtube.com/user/GNetworksTv
Network Operation Centre Highlights and Practices
In complex networks, the telecom operators and IT organizations can consider the report for high level planning and operations
Fighting Financial Crime with Artificial IntelligenceDataWorks Summit
How can we take the state of the art in deep learning and AI research, and transplant it into a large bank to deliver useful results which impact the general public? To answer this broad-reaching question, we take the viewer through a solution Think Big Analytics recently deployed at a major European bank for fraud detection, using state of the art AI techniques and a near-real time open-source architecture. We show how financial transactions can be transposed into a form where the latest AI techniques in image recognition can be leveraged, in surprisingly novel ways. We have been able to more accurately detect fraud and reduce financial crime, cutting losses and improving customer experience. We describe some architectures which can be used to do this in production, at scale, in global financial institutions.
Speaker:
Tim Seears, Director of Data Science, Think Big Analytics, a Teradata Company
1. The document discusses key metrics like velocity and productivity that are commonly used to track agile team performance but questions whether these truly measure business value.
2. Productivity is defined as accomplishing more with the same resources and may be a more meaningful metric than velocity, which can fluctuate. Productivity remains relatively constant as a team gains experience.
3. Velocity measures the amount of work a team can complete in a sprint but is not a good measure for comparing teams or evaluating performance, as story point sizes vary between teams. Business value, defined at the intersection of what can be successfully implemented, what customers want, and what excites the team, should be the key indicator used.
Dimension Data provides cloud services including compute, hosting, integration, and consulting delivered on its Managed Cloud Platform. The platform offers automation, security, and flexibility to move between public, private and hybrid cloud models. It addresses concerns around cloud adoption through management of the infrastructure and applications.
A network operations center, also known as a "network management center", is one or more locations from which network monitoring and control, or network management, is exercised over a computer, telecommunication or satellite network.
Network Management Fundamentals - Back to the BasicsSolarWinds
Let's get Back to the Basics of Network Management. In this slideshare, we will walk you through the Fundamental Protocols of Network Management, Windows Management Protocols, Flow Based Protocols, Cisco IP Service Level Agreements and the Network Management Framework.
Santosh Rau, Engineering Manager, Software Infrastructure, Netflix talks about their infrastructure built on AWS to power their website and their movies on demand
There is an increasing trend witnessed in the cloud computing technology which has led to a lot of risks in preserving the Confidentiality, Integrity and Availability of data. The Cloud is now facing a lot of compliance requirements due to the sensitivity of the data that is being stored. View this presentation to understand the Cloud Compliance Requirements, Risks, Audit Processes and Methodologies involved in providing assurance.
This presentation was given by CA Anand Prakash Jangid at the Conference on Cloud Computing conducted by the Committee on Information Technology of the Institute of Chartered Accountants of India on 11th January 2014.
Vous apprendrez également à :
• Créer plus rapidement des produits et fonctionnalités à l’aide d’une suite complète de connecteurs et d’outils de gestion des flux, et à connecter vos environnements à des pipelines de données
• Protéger vos données et charges de travail les plus critiques grâce à des garanties intégrées en matière de sécurité, de gouvernance et de résilience
• Déployer Kafka à grande échelle en quelques minutes tout en réduisant les coûts et la charge opérationnelle associés
System models for distributed and cloud computingpurplesea
This document discusses different types of distributed computing systems including clusters, peer-to-peer networks, grids, and clouds. It describes key characteristics of each type such as configuration, control structure, scale, and usage. The document also covers performance metrics, scalability analysis using Amdahl's Law, system efficiency considerations, and techniques for achieving fault tolerance and high system availability in distributed environments.
Security As A Service In Cloud(SECaaS)أحلام انصارى
This document discusses security as a service (SECaaS) in cloud computing. It begins by explaining other common cloud service models like SaaS, PaaS, IaaS, and STaaS. It then defines SECaaS as a business model where large service providers integrate security services like authentication, antivirus, intrusion detection, and security event management into a corporate infrastructure on a subscription basis. The document lists the top 10 cloud service providers and reasons why cloud-based security is required. It outlines common areas covered by SECaaS like identity and access management, data loss prevention, and network security. Finally, it provides examples of specific SECaaS products and services offered by vendors.
Server load balancing (SLB) distributes network traffic across multiple servers to optimize resource utilization and maximize throughput. It intercepts traffic destined for a website and redirects requests to various backend servers using techniques like network address translation. SLB aims to improve performance, increase scalability, and maintain high availability by monitoring servers and routing traffic around failures to keep applications running if servers go down. Both hardware and software-based solutions exist, with hardware providing higher performance but at greater cost than software-based options.
Cloud computing is a model for enabling network access to configurable computing resources that can be rapidly provisioned with minimal management effort. There are differing definitions from NIST, Wikipedia, and others. Cloud computing provides utility computing, service-oriented architecture, and service level agreements. Key characteristics include scalability, availability, manageability, accessibility, performance, and enabling techniques like virtualization. The three main cloud models are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Cloud deployment models include public, private, hybrid, and community clouds. Cloud computing provides advantages like cost savings and scalability but also risks like reliance on internet and potential security issues.
Tungsten Fabric is an open source network virtualization solution for providing connectivity and security for virtual, containerized or bare-metal workloads. Savannah will cover the overall architecture of Tungsten Fabric and the DPDK vRouter, which performs packet forwarding and enforces network and security policies.
Cloud & Sécurité : Quels risques et quelles sont les questions importantes à ...Microsoft Décideurs IT
La migration vers le Cloud, quel que soit le modèle de déploiement envisagé (IaaS pour l’infrastructure, PaaS pour la migration des applications, ou SaaS pour l'utilisation de services du Cloud) fait apparaitre de nouveaux défis pour les responsables sécurité. En s'appuyant en particulier sur les travaux et réflexions de la Cloud Security Alliance (CSA), cette session aborde les différents aspects à considérer qu’il s’agisse de considérations techniques, organisationnelles, légales, ou encore de maturité de l'entreprise. Après la mise en évidence des risques les plus forts, nous aborderons dans une seconde partie les questions importantes à se poser dans le choix d'un fournisseur de Cloud et l'intérêt des normes telles que ISO 27001 et des certifications. La dernière partie de la session sera axée sur l'offre Microsoft Office 365 de type SaaS, et les réponses apportées d'un point de vue sécurité en s'attachant aux critères définis dans la matrice CCM (Cloud Controls Matrix) de la CSA.
In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end.
Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability.
You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.
The Next Wave of Reliability EngineeringMichael Kehoe
In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?
This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.
The document discusses the principles and practices of DevSecOps. It begins with an agenda that covers DevSecOps prerequisites, foundations, roles and responsibilities, and practical tips. It discusses concepts like shifting security left, continuous integration/delivery pipelines, and the importance of collaboration across roles. It provides overviews of risk management, static and dynamic testing, feature toggles, and recommends DevSecOps training and tools from Cprime. The presentation aims to help organizations adopt DevSecOps practices to improve security and deployment processes.
O documento apresenta os conceitos e serviços da computação em nuvem da AWS. É discutido o que é computação em nuvem, os tipos de nuvem, os pilares, modelos em camadas, virtualização, infraestrutura AWS, segurança, e vários serviços como EC2, S3, DynamoDB, Redshift, entre outros. O palestrante também apresenta suas credenciais e canais de contato.
The success of application deployment on cloud depends a lot on the architecture style which in turn depends on your business needs. This presentation talks about the commonly used Architecture and business use cases.
This is the firs presentation I created for training IBM EBIS community on cloud computing and apporach to cloud sales and projects. All the materials come from IBM internal documentation and precedent classes slide.
The document outlines the key tools, processes, and operations used to efficiently run a Network Operations Center (NOC). It discusses (1) tools like ticketing systems, knowledge bases, monitoring, and automation that help manage incidents and tasks. It also discusses (2) important processes like escalation, prioritization, incident handling, change management, and maintenance. Finally, it provides examples of (3) daily operational processes for fault notification, resolution, client provisioning and decommissioning, and datacenter access control. The goal is to deliver proper service, meet business goals, and optimize NOC management through best practices and appropriate resources.
Roger S. Barga discusses his experience in data science and predictive analytics projects across multiple industries. He provides examples of predictive models built for customer segmentation, predictive maintenance, customer targeting, and network intrusion prevention. Barga also outlines a sample predictive analytics project for a real estate client to predict whether they can charge above or below market rates. The presentation emphasizes best practices for building predictive models such as starting small, leveraging third-party tools, and focusing on proxy metrics that drive business outcomes.
Doing Analytics Right - Building the Analytics EnvironmentTasktop
Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern.
This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.
A network operations center, also known as a "network management center", is one or more locations from which network monitoring and control, or network management, is exercised over a computer, telecommunication or satellite network.
Network Management Fundamentals - Back to the BasicsSolarWinds
Let's get Back to the Basics of Network Management. In this slideshare, we will walk you through the Fundamental Protocols of Network Management, Windows Management Protocols, Flow Based Protocols, Cisco IP Service Level Agreements and the Network Management Framework.
Santosh Rau, Engineering Manager, Software Infrastructure, Netflix talks about their infrastructure built on AWS to power their website and their movies on demand
There is an increasing trend witnessed in the cloud computing technology which has led to a lot of risks in preserving the Confidentiality, Integrity and Availability of data. The Cloud is now facing a lot of compliance requirements due to the sensitivity of the data that is being stored. View this presentation to understand the Cloud Compliance Requirements, Risks, Audit Processes and Methodologies involved in providing assurance.
This presentation was given by CA Anand Prakash Jangid at the Conference on Cloud Computing conducted by the Committee on Information Technology of the Institute of Chartered Accountants of India on 11th January 2014.
Vous apprendrez également à :
• Créer plus rapidement des produits et fonctionnalités à l’aide d’une suite complète de connecteurs et d’outils de gestion des flux, et à connecter vos environnements à des pipelines de données
• Protéger vos données et charges de travail les plus critiques grâce à des garanties intégrées en matière de sécurité, de gouvernance et de résilience
• Déployer Kafka à grande échelle en quelques minutes tout en réduisant les coûts et la charge opérationnelle associés
System models for distributed and cloud computingpurplesea
This document discusses different types of distributed computing systems including clusters, peer-to-peer networks, grids, and clouds. It describes key characteristics of each type such as configuration, control structure, scale, and usage. The document also covers performance metrics, scalability analysis using Amdahl's Law, system efficiency considerations, and techniques for achieving fault tolerance and high system availability in distributed environments.
Security As A Service In Cloud(SECaaS)أحلام انصارى
This document discusses security as a service (SECaaS) in cloud computing. It begins by explaining other common cloud service models like SaaS, PaaS, IaaS, and STaaS. It then defines SECaaS as a business model where large service providers integrate security services like authentication, antivirus, intrusion detection, and security event management into a corporate infrastructure on a subscription basis. The document lists the top 10 cloud service providers and reasons why cloud-based security is required. It outlines common areas covered by SECaaS like identity and access management, data loss prevention, and network security. Finally, it provides examples of specific SECaaS products and services offered by vendors.
Server load balancing (SLB) distributes network traffic across multiple servers to optimize resource utilization and maximize throughput. It intercepts traffic destined for a website and redirects requests to various backend servers using techniques like network address translation. SLB aims to improve performance, increase scalability, and maintain high availability by monitoring servers and routing traffic around failures to keep applications running if servers go down. Both hardware and software-based solutions exist, with hardware providing higher performance but at greater cost than software-based options.
Cloud computing is a model for enabling network access to configurable computing resources that can be rapidly provisioned with minimal management effort. There are differing definitions from NIST, Wikipedia, and others. Cloud computing provides utility computing, service-oriented architecture, and service level agreements. Key characteristics include scalability, availability, manageability, accessibility, performance, and enabling techniques like virtualization. The three main cloud models are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Cloud deployment models include public, private, hybrid, and community clouds. Cloud computing provides advantages like cost savings and scalability but also risks like reliance on internet and potential security issues.
Tungsten Fabric is an open source network virtualization solution for providing connectivity and security for virtual, containerized or bare-metal workloads. Savannah will cover the overall architecture of Tungsten Fabric and the DPDK vRouter, which performs packet forwarding and enforces network and security policies.
Cloud & Sécurité : Quels risques et quelles sont les questions importantes à ...Microsoft Décideurs IT
La migration vers le Cloud, quel que soit le modèle de déploiement envisagé (IaaS pour l’infrastructure, PaaS pour la migration des applications, ou SaaS pour l'utilisation de services du Cloud) fait apparaitre de nouveaux défis pour les responsables sécurité. En s'appuyant en particulier sur les travaux et réflexions de la Cloud Security Alliance (CSA), cette session aborde les différents aspects à considérer qu’il s’agisse de considérations techniques, organisationnelles, légales, ou encore de maturité de l'entreprise. Après la mise en évidence des risques les plus forts, nous aborderons dans une seconde partie les questions importantes à se poser dans le choix d'un fournisseur de Cloud et l'intérêt des normes telles que ISO 27001 et des certifications. La dernière partie de la session sera axée sur l'offre Microsoft Office 365 de type SaaS, et les réponses apportées d'un point de vue sécurité en s'attachant aux critères définis dans la matrice CCM (Cloud Controls Matrix) de la CSA.
In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end.
Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability.
You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.
The Next Wave of Reliability EngineeringMichael Kehoe
In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?
This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.
The document discusses the principles and practices of DevSecOps. It begins with an agenda that covers DevSecOps prerequisites, foundations, roles and responsibilities, and practical tips. It discusses concepts like shifting security left, continuous integration/delivery pipelines, and the importance of collaboration across roles. It provides overviews of risk management, static and dynamic testing, feature toggles, and recommends DevSecOps training and tools from Cprime. The presentation aims to help organizations adopt DevSecOps practices to improve security and deployment processes.
O documento apresenta os conceitos e serviços da computação em nuvem da AWS. É discutido o que é computação em nuvem, os tipos de nuvem, os pilares, modelos em camadas, virtualização, infraestrutura AWS, segurança, e vários serviços como EC2, S3, DynamoDB, Redshift, entre outros. O palestrante também apresenta suas credenciais e canais de contato.
The success of application deployment on cloud depends a lot on the architecture style which in turn depends on your business needs. This presentation talks about the commonly used Architecture and business use cases.
This is the firs presentation I created for training IBM EBIS community on cloud computing and apporach to cloud sales and projects. All the materials come from IBM internal documentation and precedent classes slide.
The document outlines the key tools, processes, and operations used to efficiently run a Network Operations Center (NOC). It discusses (1) tools like ticketing systems, knowledge bases, monitoring, and automation that help manage incidents and tasks. It also discusses (2) important processes like escalation, prioritization, incident handling, change management, and maintenance. Finally, it provides examples of (3) daily operational processes for fault notification, resolution, client provisioning and decommissioning, and datacenter access control. The goal is to deliver proper service, meet business goals, and optimize NOC management through best practices and appropriate resources.
Roger S. Barga discusses his experience in data science and predictive analytics projects across multiple industries. He provides examples of predictive models built for customer segmentation, predictive maintenance, customer targeting, and network intrusion prevention. Barga also outlines a sample predictive analytics project for a real estate client to predict whether they can charge above or below market rates. The presentation emphasizes best practices for building predictive models such as starting small, leveraging third-party tools, and focusing on proxy metrics that drive business outcomes.
Doing Analytics Right - Building the Analytics EnvironmentTasktop
Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern.
This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.
The Sky’s the Limit – The Rise of Machine LearninInside Analysis
The Briefing Room with Analyst Dr. Robin Bloor and SkyTree
Live Webcast on June 24, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=1da2b498fc39b8b331a5bbb8dea2660f
With data growing more complex these days, many organizations are looking for ways to make sense of new information sources. The goal? Sprint ahead of the competition by exploiting fast-moving opportunities. The challenge? The data volumes, variety and velocity call for significantly greater horsepower than ever before. That’s where machine learning comes into play, and it’s already fundamentally changing the Big Data Analytics landscape.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he explains how advanced analytics technology can transform the enterprise. He’ll be briefed by Martin Hack, CEO of Skytree, who will tout his company’s machine learning solution for big data. Hack will discuss the critical challenges facing today’s data professionals, and present use cases to show how machine learning can help organizations leverage big data as a capital asset. He’ll specifically address the power of predictive analytics, which can help companies seize opportunities and prevent serious problems.
Visit InsideAnlaysis.com for more information.
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
The Briefing Room with Mark Madsen and WebAction
Live Webcast Feb. 10, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=fa83c6283de99dfb6f38b9d7199cb452
In our increasingly interconnected world, the windows of opportunity for meaningful action are shrinking. Where hours once sufficed, minutes are now the norm. For some transactions, seconds make all the difference, even sub-seconds. Meeting these demands requires a new approach to information architecture, one that embraces the many innovations that are fundamentally changing the data-driven economy.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature as he explains how a confluence of advances are changing the nature of data management. He'll be briefed by Sami Akbay of WebAction, who will showcase his company's real-time data platform, designed from the ground up to meet the challenges of leveraging Big Data in concert with all manner of operational enterprise systems.
Visit InsideAnalysis.com for more information.
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
Diego Oppenheimer discusses the rise of algorithm marketplaces and the new "algorithm economy". Key points include:
- Advances in machine learning, computer vision, speech recognition and natural language processing are enabling algorithms to interpret unstructured data at scale.
- Algorithm marketplaces allow algorithms to be hosted, discovered, monetized and composed modularly to address a wide range of use cases across many industries.
- The algorithm economy will lower barriers to applying machine intelligence and foster innovation as algorithms become reusable assets that creators and users can both benefit from.
This document provides an overview of data science and its applications. It discusses:
1) Industries that are being disrupted by data science like telecom, banking, retail, and healthcare.
2) How companies like Amazon, Netflix, and Google were able to disrupt their industries through their ability to analyze patterns in data faster than competitors.
3) The factors driving more companies to adopt data science including competitive advantages, revenue growth, and cost optimization.
This document discusses opportunities for using big data in private wealth management. It begins by defining big data and describing how data volumes have increased exponentially. It then outlines several potential use cases for big data in areas like real-time performance metrics, portfolio optimization, and leveraging customer data. For each use case, it describes current limitations and how a big data approach could enable new capabilities. Finally, it proposes a phased approach for wealth managers to identify use cases, prioritize them, implement proofs of concept, and incrementally automate analysis and reporting. The overall message is that big data can enhance analytics and open up new opportunities previously only available to investment banks.
A step towards machine learning at accionlabsChetan Khatri
This document provides an overview of machine learning including definitions of common techniques like supervised learning, unsupervised learning, and reinforcement learning. It discusses applications of machine learning across various domains like vision, natural language processing, and speech recognition. Additionally, it outlines machine learning life cycles and lists tools, technologies, and resources for learning and practicing machine learning.
[Data Meetup] Data Science in Finance - Building a Quant ML pipelineData Science Society
Georgi Kirov shares a common market-neutral statistical arbitrage framework. It will help showcase the many different ways to structure a systematic research project. From data reconciliation and signal backtesting to optimization and execution, what are some principled ways to evaluate and compare ML ideas? This process inevitably depends on the characteristics of a specific strategy, for instance, if it is liquidity-taking or liquidity-making.
Technology Office challenges especially better use of information within a banking environment. Provided tangible examples of transactional analysis, text sentiment analysis, power consumption analysis
Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
Predictive analytics can help reduce volatility and improve decision making in procurement processes. It allows understanding of future costs, demand, and supply to overcome challenges. Predictive models analyze past data and behaviors to forecast trends and outcomes. As data sources like IoT sensors expand, predictive analytics is increasingly used for applications like manufacturing process improvement, predictive maintenance of equipment, and optimizing building energy usage.
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...Dell World
Becoming data-driven requires analytics to be embedded throughout the organization in different functional areas and different operational processes. But how do you provide more and more people with the ability to run any analytics on any data anywhere– without breaking the bank? In this session, you’ll see real-world examples of Dell customers who have successfully embedded analytics across processes and operations to drive innovation.We will also demonstrate how embedding analytics enables faster innovation and improves collaboration between data scientists, business analysts, and business stakeholders, leading to a competitive advantage.
Oracle is a leading technology company focused on database software and cloud computing. It generates revenue from software licenses and cloud services. While Oracle faces competition from other large tech companies, its strengths include consulting services, global sales channels, and expertise in data storage and applications. The rise of big data presents both opportunities and challenges for Oracle to leverage new types and volumes of customer information through its products.
This document discusses how big data and analytics are moving from on-premises data warehouses to hybrid cloud environments that leverage technologies like Hadoop, Spark, and machine learning. It provides examples of how Oracle is helping customers with this transition by offering big data cloud services that give them flexibility to run workloads both on-premises and in the cloud while simplifying data management and enabling new types of advanced analytics.
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
Data mining and Machine learning explained in jargon free & lucid language.
By reading one can get some intuition about what data mining and machine learning is all about
APPLY IT IN THEIR OWN WORK
Transformacion del Negocio Financiero por medio de Tecnologias CloudRaul Goycoolea Seoane
This document discusses how cloud technologies can transform businesses. It provides an overview of Xertica, a leading cloud consulting firm in Latin America. The document then discusses how various industries like financial services, retail, and manufacturing are using technologies like cloud platforms, machine learning, and analytics to improve areas such as customer service, risk management, and modernizing legacy infrastructure. Specific customer examples are provided for each area to illustrate the business benefits seen such as reduced costs, improved efficiencies and increased innovation.
Keynote presentation from ECBS conference. The talk is about how to use machine learning and AI in improving software engineering. Experiences from our project in Software Center (www.software-center.se).
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
This document discusses the democratization of data science and machine learning using automated machine learning tools. It provides examples of how DataRobot has helped customers in various industries build predictive models faster and with less coding than traditional approaches. Specifically, it summarizes how DataRobot has helped customers in banking, insurance, retail, and other industries with use cases like predictive maintenance, sales forecasting, fraud detection, customer churn prediction, and insurance underwriting.
Artificial Intelligence using Machine Learning techniques like Churn and Recommender models can help Relationship Managers connect with dormant clients and help recommend stocks and MFs using existing applications via different devices
Similar to SystemT: Declarative Information Extraction (invited talk at MIT CSAIL) (20)
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Essentials of Automations: The Art of Triggers and Actions in FME
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
1. SystemT: an Algebraic Approach to
Declarative Information Extraction
Laura Chiticariu
IBM Research | Almaden
Talk at MIT Computer Science and Artificial Intelligence Laboratory
03/08/2016
2. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 2
3. Case Study 1: Sentiment Analysis
Text
Analytics
Product catalog, Customer Master Data, …
Social Media
• Products
Interests360o Profile
• Relationships
• Personal
Attributes
• Life
Events
Statistical
Analysis,
Report Gen.
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
ScaleScaleScaleScale
450M+ tweets a day,
100M+ consumers, …
BreadthBreadthBreadthBreadth
Buzz, intent, sentiment, life
events, personal atts, …
ComplexityComplexityComplexityComplexity
Sarcasm, wishful thinking, …
Customer 360º
3
4. Case Study 2: Machine Analysis
Web
Servers
Application
Servers
Transaction
Processing
Monitors
Database #1 Database #2
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log FileLog File
DB #2
Log File
DB #2
Log File
• Web site with multi-tier
architecture
• Every component produces its
own system logs
• An error shows up in the log for
Database #2
• What sequence of events ledWhat sequence of events ledWhat sequence of events ledWhat sequence of events led
to this error?to this error?to this error?to this error?
12:34:56 SQL ERROR 43251:
Table CUST.ORDERWZ is not
Operations Analysis
4
5. Case Study 2: Machine Analysis
Web Server
Logs
CustomCustom
Application
Logs
Database
Logs
Raw Logs
Parsed
Log
Records
Parse Extract
Linkage
Information
End-to-End
Application
Sessions
Integrate
Operations Analysis
Graphical Models
(Anomaly detection)
Analyze
5 min 0.85
CauseCause EffectEffect DelayDelay CorrelationCorrelation
10 min 0.62
Correlation Analysis
• Every customer has unique
components with unique log record
formats
• Need to quickly customize all
stages of analysis for these custom
log data sources
5
6. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 6
7. Text Analytics vs Information Extraction
7
… hotel close to Seaworld in Orlando?
NNS JJ IN NNP IN NNP
POS Tagging
Text
Normalization
Oh wow. Holy s***. Welp. Ima go see
Godzilla in 3 months. F*** yes
I am going to go see Godzilla in 3
months.
We are neutral on Malaysian banks.
Dependency
Parsing
Information
Extraction
Clustering
Classification
Regression
Applications
7
8. Enterprise Requirements
• ExpressivityExpressivityExpressivityExpressivity
– Need to express complex NLP algorithms for a variety of tasks and data
sources
• ScalabilityScalabilityScalabilityScalability
– Large data volumes, often orders of magnitude larger than classical NLP
corpora
• Social Media: Twitter alone has 500M+ messages / day; 1TB+ per day
• Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few
KBs to few MBs
• Machine Data: One application server under moderate load at medium logging level 1GB of
logs per day
• TransparencyTransparencyTransparencyTransparency
– Every customer’s data and problems are unique in some way
– Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors 8
9. Expressivity Example: Different Kinds of Parses
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Natural Language Machine Log
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected
9
10. 1010
Expressivity Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
11. Expressivity Example: Sentiment Analysis
Mcdonalds mcnuggets are fake as shit but they so delicious.
We should do something cool like go to ZZZZ (kiddingkiddingkiddingkidding).
Makin chicken fries at home bc everyone sucks!
You are never too old for Disney movies.
Bank X got me ****ed up today!
Not a pleasant client experience. Please fix ASAP.
I'm still hearing from clients that Company A's website is better.
X... fixing something that wasn't broken
Intel's 2013 capex is elevated at 23% of sales, above average of 16%
IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62
per share, an increase of 11 percent
We continue to rate shares of MSFT neutral.
Sell EUR/CHF at market for a decline to 1.31000…
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
Customer SurveysCustomer SurveysCustomer SurveysCustomer Surveys
Social MediaSocial MediaSocial MediaSocial Media
Analyst ResearchAnalyst ResearchAnalyst ResearchAnalyst Research
ReportsReportsReportsReports
11
12. Transparency Example: Need to incorporate in-situ data
Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest
I need to go back to Walmart, Toys R Us has the sameI need to go back to Walmart, Toys R Us has the same
toy $10 cheaper!
12
13. A Brief History of IE
• 1978-1997: MUC (Message
Understanding Conference) –
DARPA competition 1987 to 1997
– FRUMP [DeJong82]
– FASTUS [Appelt93],
– TextPro, PROTEUS
• 1998: Common Pattern
Specification Language (CPSL)
standard [Appelt98]
– Standard for subsequent rule-
based systems
• 1999-present: Commercial
products, GATE
• At first: Simple techniques like
Naive Bayes
• 1990’s: Learning Rules
– AUTOSLOG [Riloff93]
– CRYSTAL [Soderland98]
– SRV [Freitag98]
• 2000’s: More specialized models
– Maximum Entropy Models
[Berger96]
– Hidden Markov Models [Leek97]
– Maximum Entropy Markov
Models [McCallum00]
– Conditional Random Fields
[Lafferty01]
– Automatic feature expansion
RuleRuleRuleRule----BasedBasedBasedBased Machine LearningMachine LearningMachine LearningMachine Learning
13
14. Real Systems: A Practical Perspective
• Entity extraction
• EMNLP, ACL,
NAACL, 2003-
2012
• 54 industrial
vendors (Who’s
Who in Text
Analytics, 2012)
[Chiticariu, Li, Reiss, EMNLP’13]
Fast development, fast adaptation,
better results in limited time,
sophistication
Easy to comprehend, maintain, debug &
optimize performance; lower reliance on
labeled data
14
15. Background: The SystemT Project
• Early 2000’s: NLP group starts at IBM Research –
Almaden
• Initial focus: Collection-level machine learning problems
• Observation: Most time spent on feature extraction
– Technology used: Java, then Cascading Finite State Automata
15
16. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 16
17. Finite-state Automata
• Common formalism underlying most rule-based IE
systems
– Input text viewed as a sequence of tokens
– Rules expressed as regular expression patterns over the lexical
features of these tokens
• Several levels of processing Cascading Automata
– Typically, at higher levels of the grammar, larger segments of text are
analyzed and annotated
• Common Pattern Specification Language (CPSL)
– Standard to specify and represent cascading finite state automata
– Each automata accepts a sequence of annotations and outputs a
sequence of annotations
17
18. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Cascading Automata Example: Person Extractor
Tokenization
(preprocessing step)
Level 1
〈Gazetteer〉[type = LastGaz] 〈Last〉
〈Gazetteer〉[type = FirstGaz] 〈First〉
〈Token〉[~ “[A-Z]w+”] 〈Caps〉
Rule priority used to prefer
First over Caps
• Rule priority used to prefer First over Caps.
• Lossy Sequencing: annotations dropped
because input to next stage must be a sequence
– First preferred over Last since it was declared earlier
18
19. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e
sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra
lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque
id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Cascading Automata Example: Person Extractor
Tokenization
(preprocessing step)
Level 1
〈Gazetteer〉[type = LastGaz] 〈Last〉
〈Gazetteer〉[type = FirstGaz] 〈First〉
〈Token〉[~ “[A-Z]w+”] 〈Caps〉
Level 2 〈First〉 〈Last〉 〈Person〉
〈First〉 〈Caps〉 〈Person〉
〈First〉 〈Person〉
Rigid Rule Priority and Lossy
Sequencing in Level 1 caused
partial results
19
20. Problems with Cascading Automata
• Scalability: Redundant passes over document
• Expressivity: Frequent use of custom code
• Transparency
• Ease of comprehension
• Ease of debugging
• Ease of enhancement
Operational
semantics
+ custom code
= no provenance
20
21. Bringing Transparency to Feature Extraction
• Our approachOur approachOur approachOur approach: Use a declarative language
– Decouple meaning of extraction rules from execution plan
• Our language:Our language:Our language:Our language: AQL (Annotator Query Language)
– Semantics based on relational calculus
– Syntax based on SQL
21
22. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 22
23. Document
text: String
Person
last: Spanfirst: Span fullname: Span
AQL Data Model (Simplified)
• Relational data modelRelational data modelRelational data modelRelational data model: data is organized in: data is organized in: data is organized in: data is organized in tuples; tuples have a; tuples have a; tuples have a; tuples have a schema
• SpecialSpecialSpecialSpecial data types necessarydata types necessarydata types necessarydata types necessary for text processing:for text processing:for text processing:for text processing:
– Document consists of a single texttexttexttext attribute
– Annotations are represented by a type called SpanSpanSpanSpan, which consists of beginbeginbeginbegin,
endendendend and documentdocumentdocumentdocument attribute
23
24. create view FirstCaps as
select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
• Declarative: Specify logical conditions that input tuples should satisfy in order to
generate an output tuple
• Choice of SQL-like syntax for AQL motivated by wider adoption of SQL
• Compiles into SystemT algebra
AQL By Example
24
25. create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
<First><Caps>
<First><Last>
<First>
Revisiting the Person Example
25
26. create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
Explicit clause for
resolving ambiguity
(No Rigid Priority
problem)
Input may contain
overlapping annotations
(No Lossy Sequencing
problem)
Revisiting the Person Example
26
27. Compiling and Executing AQL
AQL Language
Optimizer
Operator
Graph
Specify extractor semantics
declaratively (express logic of
computation, not control flow)
Choose efficient execution
plan that implements
semantics
Optimized execution plan
executed at runtime
27
28. Regular Expression Extraction Operator
28
[A-Z][a-z]+
DocumentInput Tuple
we will meet Mark
Scott and
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Regex
29. Expressivity: Rich Set of Operators
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Regular
Expressions
Span
Operations
Relational
Operations
Semantic Role Labels
Language to express NLP Algorithms AQL
.
Aggregation
Operations
29
30. package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Expressivity: AQL vs. Custom Java Code
30
31. package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Expressivity: AQL vs. Custom Java Code
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
31
32. Expressivity: AQL vs CPSL
• [Fagin et al., PODS ’13]
– Formal modeling of core subset of code-free CPSL (regular
spanners) and AQL (core spanners)
– Core spanners strictly more expressive than regular
spanners
• [Fagin et al., PODS ‘14]
– Unified framework for declarative cleaning (i.e., resolving
overlap)
– Cleaning increases expressiveness in general
• But not for SystemT consolidate policies, nor for controls in JAPE
(CPSL implementation in GATE)
32
33. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 33
34. Scalability
• Performance issues with cascading automata
– Complete pass through tokens for each cascade
– Many of these passes are wasted work
• Dominant approach: Make each pass go faster
– Doesn’t solve root problem!
• Algebraic approach: Build a query optimizer!
34
35. Scalability: Class of Optimizations in SystemT
• RewriteRewriteRewriteRewrite----basedbasedbasedbased: rewrite
algebraic operator graph
– Shared Dictionary Matching
– Shared Regular Expression
Evaluation
– On-demand tokenization
• CostCostCostCost----basedbasedbasedbased: relies on novel
selectivity estimation for text-
specific operators
– Standard transformations
• E.g., push down selections
– Restricted Span Evaluation
• Evaluate expensive operators
on restricted regions of the
document
35
Tokenization overhead is paid only
once
First
(followed within 0 tokens)
Plan C
Plan A
Join
Caps
Restricted Span
Evaluation
Plan B
First
Identify Caps starting
within 0 tokensExtract text to the right
Caps
Identify First ending
within 0 tokens
Extract text to the left
36. Performance Comparison (with ANNIE)
0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Throughput(KB/sec)
Open Source Entity Tagger
SystemT
ANNIE
TaskTaskTaskTask: Named Entity
DatasetDatasetDatasetDataset : Different document collections from the Enron corpus obtained
by randomly sampling 1000 documents for each size
10~50x faster
[Chiticariu et al., ACL’10] 36
37. [Chiticariu et al., ACL’10]
Performance Comparison on Larger Documents
Dataset Document Size
Throughput
(KB/sec)
Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium
SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings
1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC)
Throughput benefits carryover for
wide-variety of document sizes
Much lower
memory footprint
37
38. [Chiticariu et al., ACL’10]
Performance Comparison on Larger Documents
Dataset Document Size
Throughput
(KB/sec)
Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium
SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings
1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC)
Throughput benefits carryover for
wide-variety of document sizes
Much lower
memory footprint
Theorem:Theorem:Theorem:Theorem: For any acyclic tokenFor any acyclic tokenFor any acyclic tokenFor any acyclic token----based FSTbased FSTbased FSTbased FST TTTT,,,,
there exists an operator graphthere exists an operator graphthere exists an operator graphthere exists an operator graph GGGG such that evaluatingsuch that evaluatingsuch that evaluatingsuch that evaluating
TTTT andandandand GGGG has the same computational complexityhas the same computational complexityhas the same computational complexityhas the same computational complexity
38
40. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 40
41. How AQL Solved our Problems
• Expressivity: Complex tasks, no custom
code
• Scalability: Cost-based query optimization
• Transparency
• Ease of comprehension
• Ease of debugging
• Ease of enhancement
Clear and Simple
Provenance
41
42. 42
Computing Provenance
PersonPhone rule:
insert into PersonPhone
select Merge(F.match, P.match) as match
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
match
Anna at James St. office (555-5555
James St. office (555-5555
PersonPhone
Provenance: Explains output data in terms of the input data, the intermediate data,
and the transformation (e.g., SQL query, ETL, workflow)
– Surveys: [Davidson & Freire, SIGMOD 2008] [Cheney et al., Found. Databases 2009]
For predicate-based rule languages (e.g., SQL), can be computed automatically!
Person PhonePerson
Anna at James St. office (555-5555) .
[Liu et al., VLDB’10]
43. 43
Computing Provenance
Rewritten PersonPhone rule:
insert into PersonPhone
select Merge(F.match, P.match) as match,
GenerateID() as ID,
P.id as nameProv, Ph.id as numberProv
‘AND’ as how
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
Person PhonePerson
Anna at James St. office (555-5555) .
ID: 1 ID: 2 ID: 3
1 3AND
2 3AND
match
Anna at James St. office (555-5555
James St. office (555-5555
PersonPhone
Provenance
Provenance: Explains output data in terms of the input data, the intermediate data,
and the transformation (e.g., SQL query, ETL, workflow)
– Surveys: [Davidson & Freire, SIGMOD 2008] [Cheney et al., Found. Databases 2009]
For predicate-based rule languages (e.g., SQL), can be computed automatically!
[Liu et al., VLDB’10]
44. AQL: Going beyond Feature Extraction
Dataset Entity Type System Precision Recall F-measure
CoNLL 2003
Location
SystemT 93.11 91.61 92.35
Florian 90.59 91.73 91.15
Organization
SystemT 92.25 85.31 88.65
Florian 85.93 83.44 84.67
Person
SystemT 96.32 92.39 94.32
Florian 92.49 95.24 93.85
Enron Person
SystemT 87.27 81.82 84.46
Minkov 81.1 74.9 77.9
ExtractionExtractionExtractionExtraction Task:Task:Task:Task: Named-entity extraction
SystemsSystemsSystemsSystems compared:compared:compared:compared: SystemT (customized) vs. [Florian et al.’03] [Minkov et al.’05]
[Chiticariu et al., EMNLP’10]
Transparency without machine learning
outperforms machine learning without
transparency.
44
45. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 45
46. Multilingual Support
Tokenization • 26+ languages
• All other languages supported via
whitespace/punctuation tokenizer
Part of Speech and Lemmatization • 18+ languages, including English and Western
languages, Arabic, Russian, CJK
Semantic Role Labeling • English
• Ongoing work on Multilingual SRL [Akbik et al,
ACL’15]
Annotator Libraries • Out-of-the-box customizable libraries for
multiple applications and data sources in
multiple languages
46
47. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 47
48. Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
– Learning using AQL as target language
48
49. Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
• Model Training and Scoring integration with SystemML
• Deep Parsing & Semantic Role Labeling [Akbik et al., ACL’15]
• Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15]
– Learning using AQL as target language
49
51. SystemML in a Nutshell
• Provides a language for data scientists to implement machine learning algorithms
– Declarative, high-level language with R-like syntax (also Python)
– Also comes with approx. 20 algorithms pre-implemented
• Compiles execution plans ranging from single node (scale up multi threaded) to
scale out (MapReduce, Spark)
– Cost-based optimizer to generate execution plans, parallelize
• Based on data and system characteristics
– Operators for in-memory single node and cluster execution
• Runs in embeddable, standalone, and cluster mode
• Supports various APIs
• Apache SystemML Incubator project: http://systemml.apache.org
• Ongoing research effort at IBM Research - Almaden
51
52. Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
– Learning using AQL as target language
• Low-level features: regular expressions [Li et al., EMNLP’08],
dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13]
• Rule refinement [Liu et al., VLDB’10]
• Rule induction [Nagesh et al., EMNLP’12]
• Ongoing research
52
53. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Grammar-based IE systems
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 53
54. 54
Tooling Research for Productivity
Develop
TestAnalyze
Development
Deploy
Refine
Test
Maintenance
Task Analysis
[ACL’11,12,13,CHI’13]
• Concordance Viewer and
Labeling Tool [ACL ‘13]
•Extraction plan [CHI’13]
• Track provenance [VLDB’10]
• Contextual clue discovery[CIKM’11]
• Regex learning [EMNLP’08]
• Rule induction & refinement [EMNLP’12,VLDB’10]
• Dictionary refinement [SIGMOD’13]
•Visual Programming [VLDB’15]
• Visual Programming: [VLDB’15]
•NE Interface [EMNLP’10]
55. Eclipse Tools Overview
55
Ease of
Programming
Performance
Tuning
Automatic
Discovery
AQL Editor
Explain
Pattern Discovery
Result Viewer
Regex Learner
AQL Editor: syntax highlighting, auto-complete,
hyperlink navigation
Result Viewer: visualize/compare/evaluate
Explain: show how each result was generated
Workflow UI: end-to-end development wizard
Regex Generator: generate regular expressions
from examples
Pattern Discovery: identify patterns in the
data
Profiler: identify performance bottlenecks to be
hand tuned
[Chiticariu et al., SIGMOD ‘11, Li et al. ACL’12]
56. Web Tools Overview
56
Ease of
Programming
Ease of
Sharing
Canvas: Visual construction of extractors,
Customization of existing extractors
Result Viewer: visualize/compare/evaluate
Concept catalog: share concepts
Project: share extractor development
[Li et al., VLDB’15]
57. Tooling for Productivity
Named
Entities
Parts-of-speech
….
Core
Operations
.....
Higher-level
Tasks
Tokenization Span Operators
Dictionary Regular Expressions
Financial
Primitives
Drug
Extraction
Sentiment
Analysis
Corporate Intelligence Life Sciences
Application
Tasks
.....
NLPEngineers
(EclipseTools)
SME
(WebTools)
Relational &Aggregate
Operators
ML Training
& Scoring
AQL Language
Customizable Extractor Libraries developed in AQL
Deep Parsing & SRL
57
58. Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Grammar-based IE systems
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 58
59. 59
Summary
Declarative AQL language
Development EnvironmentDevelopment EnvironmentDevelopment EnvironmentDevelopment Environment
Cost-based
optimization
............
Discovery tools for AQL development
SystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
AQL: a declarative language that can
be used to build extractors
outperforming the state-of-the-arts
[ACL’10, EMNLP’10]
A suite of novel development tooling
leveraging machine learning and HCI
[EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12,
EMNLP’12, SIGMOD’12, CHI’13, ACL’13, VLDB’15]
Cost-based optimization
for text-centric operations
[ICDE’08, ICDE’11]
Highly embeddable runtime with
high-throughput and small
memory footprint. [SIGMOD
Record’09, SIGMOD’09,’ FPL’13,’14]
A declarativedeclarativedeclarativedeclarative information extraction system with costcostcostcost----based optimizationbased optimizationbased optimizationbased optimization, highhighhighhigh----performanceperformanceperformanceperformance
runtimeruntimeruntimeruntime and novel development toolingnovel development toolingnovel development toolingnovel development tooling based on solid theoretical foundationsolid theoretical foundationsolid theoretical foundationsolid theoretical foundation [PODS’13, 14]. Shipping
with 10+ IBM products, and installed on 400+ client’s sites.
InfoSphere
StreamsStreamsStreamsStreams
InfoSphere
BigInsightsBigInsightsBigInsightsBigInsights
SystemTSystemTSystemTSystemTSystemTSystemTSystemTSystemT
IBM EnginesIBM EnginesIBM EnginesIBM Engines
UIMAUIMAUIMAUIMA
SystemTSystemTSystemTSystemT
…
60. Find Out More about SystemT!
60
https://ibm.biz/BdF4GQ
61. Find Out More about SystemT!
61
https://ibm.biz/BdF4GQ
Try out SystemT
Watch a demo
Learn about using
SystemT in
unversity courses
62. Thank you!
• For more information…
– Visit our website:Visit our website:Visit our website:Visit our website:
http://ibm.co/1Cdm1Mj
– Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:
http://http://http://http://ibm.co/1DIouEvibm.co/1DIouEvibm.co/1DIouEvibm.co/1DIouEv
– Learning at your own pace: With the lab materialsLearning at your own pace: With the lab materialsLearning at your own pace: With the lab materialsLearning at your own pace: With the lab materials
– Contact meContact meContact meContact me
• chiti@us.ibm.com
SystemT Team:
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu
62