This presentation introduces text analytics, its applications and various tools/algorithms used for this process. Given below are some of the important tools:
- Decision trees
- SVM
- Naive-Bayes
- K-nearest neighbours
- Artificial Neural Networks
- Fuzzy C-Means
- Latent Dirichlet Allocation
This video will give you an idea about Data science for beginners.
Also explain Data Science Process , Data Science Job Roles , Stages in Data Science Project
This presentation introduces text analytics, its applications and various tools/algorithms used for this process. Given below are some of the important tools:
- Decision trees
- SVM
- Naive-Bayes
- K-nearest neighbours
- Artificial Neural Networks
- Fuzzy C-Means
- Latent Dirichlet Allocation
This video will give you an idea about Data science for beginners.
Also explain Data Science Process , Data Science Job Roles , Stages in Data Science Project
A Practical-ish Introduction to Data ScienceMark West
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
This Edureka Data Science course slides will take you through the basics of Data Science - why Data Science, what is Data Science, use cases, BI vs Data Science, Data Science tools and Data Science lifecycle process. This is ideal for beginners to get started with learning data science.
You can read the blog here: https://goo.gl/OoDCxz
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Defining Data Science
• What Does a Data Science Professional Do?
• Data Science in Business
• Use Cases for Data Science
• Installation of R and R studio
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this session I will share insights and knowledge that I have gained from building up a Data Science department from scratch. The talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organization.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
This presentation briefly discusses the following topics:
Classification of Data
What is Structured Data?
What is Unstructured Data?
What is Semistructured Data?
Structured vs Unstructured Data: 5 Key Differences
Here's a starting template for anyone presenting data science topic to elementary school students. Exhibits how fun the field is and how the job market for these skills is excellent. Includes hyperlinks to various examples of interesting interactive visualizations.
This eBook outlines the various types of data and explores the future of data analytics with a particular leaning towards unstructured data, both human and machine-generated.
Moving from Unstructured Documents to Structured XMLScott Abel
Presented by Thomas Aldous at Documentation and Training West, May 6-9, 2008 in Vancouver, BC
Have you thought about converting to XML, but were afraid it was to difficult? Have you talked to consultants who make the process seem long and expensive? Wondering if you should adopt a standard like DITA or go it alone?
Well, if you have a laptop, Adobe FrameMaker 7.2 or Adobe FrameMaker 8, and some sample unstructured documents (Word or FrameMaker), we'll walk through the steps that it takes to convert Word and FrameMaker files to XML, using both a custom DTD and using DITA. We will also edit those documents with some of the industrys leading XML editors.
This session is all about getting you started without the hype.
Whether you own FrameMaker or not, this session is a good starting place for those thinking of making the move to structured documentation.
A Practical-ish Introduction to Data ScienceMark West
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
This Edureka Data Science course slides will take you through the basics of Data Science - why Data Science, what is Data Science, use cases, BI vs Data Science, Data Science tools and Data Science lifecycle process. This is ideal for beginners to get started with learning data science.
You can read the blog here: https://goo.gl/OoDCxz
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Defining Data Science
• What Does a Data Science Professional Do?
• Data Science in Business
• Use Cases for Data Science
• Installation of R and R studio
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this session I will share insights and knowledge that I have gained from building up a Data Science department from scratch. The talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organization.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
This presentation briefly discusses the following topics:
Classification of Data
What is Structured Data?
What is Unstructured Data?
What is Semistructured Data?
Structured vs Unstructured Data: 5 Key Differences
Here's a starting template for anyone presenting data science topic to elementary school students. Exhibits how fun the field is and how the job market for these skills is excellent. Includes hyperlinks to various examples of interesting interactive visualizations.
This eBook outlines the various types of data and explores the future of data analytics with a particular leaning towards unstructured data, both human and machine-generated.
Moving from Unstructured Documents to Structured XMLScott Abel
Presented by Thomas Aldous at Documentation and Training West, May 6-9, 2008 in Vancouver, BC
Have you thought about converting to XML, but were afraid it was to difficult? Have you talked to consultants who make the process seem long and expensive? Wondering if you should adopt a standard like DITA or go it alone?
Well, if you have a laptop, Adobe FrameMaker 7.2 or Adobe FrameMaker 8, and some sample unstructured documents (Word or FrameMaker), we'll walk through the steps that it takes to convert Word and FrameMaker files to XML, using both a custom DTD and using DITA. We will also edit those documents with some of the industrys leading XML editors.
This session is all about getting you started without the hype.
Whether you own FrameMaker or not, this session is a good starting place for those thinking of making the move to structured documentation.
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataPerficient, Inc.
Healthcare organizations create a massive amount of digital data. Some is stored in structured fields within electronic medical records (EMR), claims or financial systems and is readily accessible with traditional analytics. Other information, such as physician notes, patient surveys, call center recordings and diagnosis reports is often saved in a free-form text format and is rarely used for analytics. In fact, experts suggest that up to 80% of enterprise data exists in this unstructured format, which means a majority of critical data isn’t being considered or analyzed!
Our webinar demonstrated how to extract insights from unstructured data to increase the accuracy of healthcare decisions with IBM Watson Content Analytics. Leveraging years of experience from hundreds of physicians, IBM has developed tools and healthcare accelerators that allow you to quickly gain insights from this “new” data source and correlate it with the structured data to provide a more complete picture.
ListenLogic Unstructured & Structured Data AnalyticsListenLogic
Learn how high performing companies are integrating unstructured and structured data become customer-centric, gain actionable insights and drive results. Achieve market and operational intelligence to predict business outcomes, improve business performance, and detect reputational and operational risks.
Discussion Forum data, sourced from sites like Reddit and other social media platforms, as well other sources of textual information, provides tremendous opportunity for insight and innovation. This presentation focuses on how an analysis of unstructured data can be used to innovate in Life/Health Science organizations
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
How can you make sense of messy data? How do you wrap structure around non-relational, flexibly structured data? With the growth in cloud technologies, how do you balance the need for flexibility and scale with the need for structure and analytics? Join us for an overview of the marketplace today and a review of the tools needed to get the job done.
During this hour, we'll cover:
- How big data is challenging the limits of traditional data management tools
- How to recognize when tools like MongoDB, Hadoop, IBM Cloudant, R Studio, IBM dashDB, CouchDB, and others are the right tools for the job.
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...Course5i
With the exponential growth of social media and new touchpoints, customers are interacting with brands and organizations at a much faster pace, generating volumes of unstructured data in the form of customer reviews, feedback, preferences, trends, etc. Other metadata such as demographic data, transaction data or point of sale data, when combined with unstructured data can help organizations better understand consumer behavior and market forces, at a much more granular and deeper level. This enables brands to make effective business decisions for profitable growth.
This presentation explains how unstructured data analytics can help in building a digital library of news, blogs, and research papers to keep track of changing trends and news, as well as creating a digital summary to ensure information from various online resources are used to ensure technology, product development, and customer experience teams stay updated about the latest trends.
The presentation also covered and introduced our Unstructured Text Analytics Platform ("UTAP") which allows the automation of classification of unstructured text data to categories, enabling organizations to track customer categories/issues over a stipulated period of time, with faster and more efficient analysis of unstructured text data.
CRL: A Rule Language for Table Analysis and InterpretationAlexey Shigarov
Tables presented in spreadsheets can be a source of important information that needs to be loaded into relational databases. However, many of them have complex structures. This does not allow to populate databases with their information directly. The presentation is devoted to the issues of the rule-based information extraction from arbitrary tables presented in spreadsheets and its transformation into structured canonical form that can be loaded into a database by standard ETL tools. We suggest a novel rule language called CRL for table analysis and interpretation. It enables developing a simple program to recover missing relationships describing table semantics. Particular sets of rules can be designed for different types of tables to provide extraction and transformation steps in a process of unstructured tabular data integration.
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Yousif Almas
A presentation I have delivered at University of Bahrain on intelligent systems and their current and future use in organisations and by consumers, iPhone’s Siri is used as an example of the mainstream adoption of such systems.
Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It encompasses a range of techniques and technologies that enable machines to understand, interpret, and generate human language in a way that is meaningful and useful.
https://hiretopwriters.com/
An overview of some core concept in natural language processing, some example (experimental for now!) use cases, and a brief survey of some tools I have explored.
Software evolution research is a thriving area of software engineering research. Recent years have seen a growing interest in variety of evolution topics, as witnessed by the growing number of publications dedicated to the subject. Without attempting to be complete, in this talk we provide an overview of emerging trends in software evolution research, such as extension of the traditional boundaries of software, growing attention for social and socio-technical aspects of software development processes, and interdisciplinary research applying research techniques from other research areas to study software evolution, and software evolution research techniques to other research areas. As a large body of software evolution research is empirical in nature, we are confronted by important challenges pertaining to reproducibility of the research, and its generalizability.
As more and more organizations move from recognizing that unstructured data exists, and remains untapped, the field of semantic technology and text analysis capabilities is
Fast and accurate sentiment classification us and naive bayes model b516001Abhisek Sahoo
In today’s world, Social Networking website like Twitter, Facebook , Linkedin, etc. plays a very significant role. Twitter is a micro-blogging platform which provides a tremendous amount of data which can be used for various application of sentiment Analysis like predictions, review, elections, marketing, etc. Sentiment Analysis is a process of extracting information from large amount of data, and classifies them into different classes called sentiments.
Presented by Wes Caldwell, Chief Architect, ISS, Inc.
The customers in the Intelligence Community and Department of Defense that ISS services have a big data challenge. The sheer volume of data being produced and ultimately consumed by large enterprise systems has grown exponentially in a short amount of time. Providing analysts the ability to interpret meaning, and act on time-critical information is a top priority for ISS. In this session, we will explore our journey into building a search and discovery system for our customers that combines Solr, OpenNLP, and other open source technologies to enable analysts to "Shrink the Haystack" into actionable information.
Similar to Getting Started with Unstructured Data (20)
A brief introduction to taxonomies through ontologies for indexing given to the American Society of Indexers at their annual conference in Providence, RI on April 30, 2011.
Presentation given at the 2009 Semantic Technology Conference discussing the kinds of people that are desirable on teams building semantic applications.
Some ideas I've been pondering around models for knowledge hierarchies. I would love to hear your feedback, as this is ongoing, informal theoretical research.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Neuro-symbolic is not enough, we need neuro-*semantic*
Getting Started with Unstructured Data
1. Getting Started with Unstructured
Data
Christine Connors & Kevin Lynch
TriviumRLG LLC
Semantic Tech & Business, Washington D.C.
November 29, 2011
Tuesday, November 29, 2011
2. Meta
✤ Presenter: Christine Connors
✤ @cjmconnors
✤ Presenter: Kevin Lynch
✤ @kevinjohnlynch
✤ Principals at www.triviumrlg.com
Tuesday, November 29, 2011
3. Agenda
✤ What is unstructured data?
✤ Where do we find it?
✤ How important is it?
✤ How do we visualize it?
✤ Machine processing for actionable data
✤ Tools
Tuesday, November 29, 2011
4. What is unstructured data?
✤ Data which is
✤ Not in a database
✤ Does not adhere to a formal data model
✤ Content
Tuesday, November 29, 2011
5. Isn’t that a misnomer?
✤ Problematic term
✤ The presence of object metadata or aesthetic markup does not alone
give ‘structure’ in this sense of the word
✤ Object metadata = machine or applied properties
✤ Aesthetic markup = stylesheets; rendering information
✤ Semi-structured data is typically treated as unstructured for the
purposes of machine processing and analysis
Tuesday, November 29, 2011
6. Types of ‘un’structured data
✤ Text-based documents
✤ Word processing, presentations, email, blogs, wikis, tweets, web
pages, web components (read/write web)
✤ Audio/video files
Tuesday, November 29, 2011
7. Where do we find it?
✤ Office productivity suites
✤ Content management systems
✤ Digital asset management systems
✤ Web content management systems
✤ Wikis, blogs, comment & discussion threads
✤ Social networking tools
✤ Twitter, Yammer, instant messengers
Tuesday, November 29, 2011
8. Is it really that important?
Structured Unstructured
15%
85%
Tuesday, November 29, 2011
9. What’s in that 80-85%?
✤ Progress reports -
created in a word processor
Tuesday, November 29, 2011
10. What’s in that 80-85%?
✤ Dashboards -
created in presentation software
Tuesday, November 29, 2011
11. What’s in that 80-85%?
✤ Progress reports -
color coded text in a
spreadsheet
Tuesday, November 29, 2011
12. What’s in that 80-85%?
✤ Brainstorming -
in messaging systems
✤ Decision making - in email
Tuesday, November 29, 2011
13. What’s in that 80-85%?
✤ Business intelligence - on the
web and more
Tuesday, November 29, 2011
14. How can we make the data more
actionable?
✤ Identify it
✤ Convert to a format you can work with
✤ Add structure, meaning:
✤ information extraction
✤ annotation
✤ content analytics
Tuesday, November 29, 2011
15. What about enterprise search?
✤ First line of defense
✤ Points you at the highest relevancy ranked data via pattern matching
and statistical analysis
✤ Does not assist in other visualizations or transformations without
further machine processing
Tuesday, November 29, 2011
16. Machine Processing
Unstructured Natural Rules-based
Statistical Semantic
Data Language Classifica-
Analysis Analysis
Processing tion
Machine Processing Platform
Federated
Search A
P Index
I
Visualizations Data Stores
Tuesday, November 29, 2011
17. Let’s go a little deeper...
Tuesday, November 29, 2011
18. Good News, Bad News
✤ Good: Basic text analysis tools are widely available; cheap or free
✤ Good: The range of information you can now consider has broadened;
the intelligence you can bring to bear on that information has
increased
✤ Bad: Skillsets not widely available (but they are available!)
✤ Good: You can get started right here, understanding, identifying the
sources, and possible approaches
Tuesday, November 29, 2011
19. What Data Doesn’t Do
✤ From Coco Krumme in “Beautiful Data”
✤ Data doesn’t drive everything.
✤ Note: “narrative fallacy,” “confirmation bias,” “paradox of choice”
✤ Data doesn’t: scale (cognitively), alone explain, predict
✤ The real world doesn’t create random variables
✤ Data doesn’t stand alone
Tuesday, November 29, 2011
20. Integrating Unstructured
Data
Images
From Oracle 11g presentation at www.nmoug.org/papers/11g_High_Level_April08.ppt
Tuesday, November 29, 2011
21. The Goal: Usable Knowledge
✤ Information extraction is NOT the goal
✤ Information extraction is a means to an end
✤ Knowledge discovery is the goal
✤ To this end, we will perform lots of processing to move from bits to
usable meaning
Tuesday, November 29, 2011
22. So many <near> synonyms
✤ Text analytics
✤ Content analytics
✤ Text mining
✤ Data mining
✤ Information extraction
✤ And then there’s Natural Language Processing
Tuesday, November 29, 2011
23. What’s the same?
✤ Moving from bits to meaning requires processing, and a lot of that
processing is the same, no matter what you call it
✤ We will focus primarily on textual information today
Tuesday, November 29, 2011
24. Natural Language
✤ From Peter Norvig’s “Natural Language Corpus Data: chapter in
“Beautiful Data”
✤ Google’s 1 trillion-word corpus investigating probabilistic language
models
✤ 13 million types (unique words, punctuation)
✤ 100k types cover 98% of the corpus
✤ For: word segmentation, spelling correction, language identification,
spam detection, author identification
✤ %? = “chooses pain” ; “in sufficient numbers”
Tuesday, November 29, 2011
26. Information Extraction
✤ Cluster analysis - group related information, where relationship may not
be known
✤ Classification - mapping to specific categories
✤ Dependency identification / Rule generation
✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM”
✤ Conference resolution (anaphoric reference resolution)
✤ e.g., “Joe is CEO at IBM. He is an IEEE member.”
✤ Summarization - key concepts or key sentences
Tuesday, November 29, 2011
27. IR and IE
✤ IR (Information Retrieval) versus IE (Information Extraction)
✤ IR retrieves documents from collections; IE retrieves facts and structured
information from collections
✤ In IR, the objects of analysis are documents; in IE, the objects of analysis
are facts
✤ IE returns knowledge at a deeper level than traditional IR
✤ Results may be imperfect, and linking them back to documents adds
value
✤ Sound familiar? (semantic web, linked data)
Tuesday, November 29, 2011
28. Information Extraction
Two primary system types
Knowledge Engineering Learning Systems
Rule based Use statistics or other machine learning
Developed by experienced language engineers Developers do not need language engineering expertise
Make use of human intuition
Require only small amount of training data Require large amounts of annotated training data
Development can be very time consuming
Some changes may require re-annotation of the entire
Some changes may be hard to accommodate
training corpus
From http://gate.ac.uk/sale/talks/gate-course-may11/track-1/module-2-ie/module-2-ie.pdf
Tuesday, November 29, 2011
29. Text
Predicate
Subject Object
Two views of the semantic web
Machine learning, natural language processing, artificial intelligence and linked data
Images from Wikipedia
Tuesday, November 29, 2011
30. Named Entities
✤ What is NER?
✤ Named Entity Recognition
✤ identifying proper names in texts, and classification into a set of
predefined categories of interest
✤ Named entity recognition is the cornerstone of Information
Extraction, providing a foundation from which to build complex
information extraction systems
Tuesday, November 29, 2011
31. Named Entities
✤ Person names
✤ Organizations (companies, government organizations, committees)
✤ Locations (cities, countries, rivers)
✤ Date and time expressions
✤ Measures (percent, money, weight)
✤ Email addresses, web addresses, street addresses
✤ Some domain-specific entities: names of drugs, medical conditions,
names of ships, bibliographic references, etc.
Tuesday, November 29, 2011
32. NOT Named Entities
✤ Artifacts - Wall Street Journal
✤ Common nouns, referring to named entities
✤ e.g. the company, the committee
✤ Name of groups of people and things named after people
✤ e.g. the Tories, the Nobel Prize
✤ Adjectives derived from names
✤ e.g. Bulgarian, Chinese
✤ Numbers which are not times, dates, percentages or money amounts
http://gate.ac.uk/sale/talks/ne-tutorial.ppt
Tuesday, November 29, 2011
34. Open Tools
✤ GATE – General Architecture for
Text Engineering, from the
University of Sheffield, with many
users and excellent documentation.
✤ GATE has customizable document
and corpus processing pipelines.
GATE is an architecture, a
framework, and a development
environment, with a clean separation
of algorithms, data, and
visualization.
Tuesday, November 29, 2011
35. GATE
✤ “The Volkswagen Beetle of language processing”
✤ “...more than a decade of collecting reusable code and building a
community has lead [to] a mature ecosystem for solving language
processing problems quickly.”
✤ Hamish Cunningham 2010
Tuesday, November 29, 2011
36. GATE – Key Features
✤ Component-based development
✤ Automatic performance measurement
✤ Clean separation between data structures and algorithms
✤ Consistent use of standard mechanisms for components to
communicate data
✤ Insulation from data formats
✤ Provision of a baseline set of language components
Tuesday, November 29, 2011
37. GATE – More...
✤ Free – open source, LPGL, Java
✤ Mature, at version 6, actively supported, 15 FTEs
✤ Comprehensive, standards-based, popular
✤ Used by thousands of companies, universities, and research
laboratories
✤ Well-known, tested, researched, and very well-documented
Tuesday, November 29, 2011
38. GATE Overview
✤ Architectural principles
✤ Non-prescriptive, theory neutral (strength and weakness)
✤ Re-use, interoperation, not reimplementation (diverse support, lots of
plugins)
✤ (Almost) everything is a component, and component sets are user-extendable
✤ Component-based development
✤ CREOLE = modified Java Beans (Collection of REusable Objects for
Language Engineering)
✤ The minimal component = 10 lines of Java, 10 lines of XML, 1 URL
Tuesday, November 29, 2011
39. GATE – Family
✤ GATE Developer – an integrated development environment for
language processing components bundled with the most widely used
Information Extraction system and a comprehensive set of plugins
✤ GATE Embedded – an object library optimized for inclusion in
diverse apps
✤ GATE Teamware – web app, a collaborative annotative environment
✤ GATE Cloud – parallel distributed processing
Tuesday, November 29, 2011
40. GATE – Embedded
From http://gate.ac.uk/g8/page/print/2/sale/talks/gate-apis.png
Tuesday, November 29, 2011
41. GATE – Teamware
✤ GATE Teamware – web app, a collaborative annotative environment
for high volume factory-style semantic annotation built with workflow
✤ Running in 5 minutes with Teamware virtual server from
GATECloud.net (itself open source):
✤ Reusable project templates
✤ Project-specific roles, users
✤ Applying GATE-based processing routines
✤ Project status, annotator activity, statistics
Tuesday, November 29, 2011
42. GATE – First Cousins
✤ Ontotext KIM: UIs demonstrating the multi-paradigm approach to
information management, navigation and search
✤ Ontotext Mimir: a massively scalable multi-paradigm index built on
Ontotext’s semantic repository family, GATE’s annotation structures
database, plus full-text indexing from MG4
✤ Ontotext FactForge: ~4B Linked Data statements, query-able
Tuesday, November 29, 2011
43. GATE – Ontotext KIM
✤ Ontotext KIM: UIs, tools, GATE Gazetteers, including a Linked Data
gazetteer (experimental)
✤ Pre-loaded knowledge base for entities
✤ Tools to upload, query, tailor the knowledge base, algorithms, UI
✤ Can crawl web, including Linked Data, creating semantic index: your
servers, theirs, or cloud
✤ Based on GATE and OWLIM
Tuesday, November 29, 2011
44. GATE – Ontotext KIM
From: http://www.ontotext.com/sites/default/files/pictures/diagram.png
Tuesday, November 29, 2011
49. GATE – Ontotext MIMIR
✤ Ontotext Mimir: large scale indexing infrastructure supporting hybrid
search (text, annotation, meaning); massively scalable multi-paradigm
capability, combines MG4J full-text index and BigOWLIM semantic
repository; query with text, structural info, and SPARQL
✤ Integrated with GATE, customizable, scalable
✤ Open source components
✤ Can federate multiple MIMIRs
✤ Low acquisition, management cost to scale
Tuesday, November 29, 2011
50. GATE – Multi-paradigm
✤ Why “multi-paradigm?” Proliferation of retrieval technology options
✤ Full text, boolean, proximity, ranking; behavior mining, tag clouds;
concept indexing: taxonomic, ontological; annotation-based
✤ Choice depends principally on content volume + value:
✤ High volume, low (average) value: web search
✤ Medium volume, higher (personal) value: social networks, photo
sharing, tagging
✤ Low volume, high value: controlled vocabularies, taxonomies,
ontologies
Tuesday, November 29, 2011
51. GATE “Resources”
✤ Applications – groups of processes (that run on one or more
documents)
✤ Language Resources – documents or document collections (corpus,
corpora)
✤ Processing Resources – annotation tools that operate on text in
documents
✤ Applications, made up of Processing Resources, operate on Language
Resources
Tuesday, November 29, 2011
52. Plugins
✤ Applications – an application consists of any number of Processing
Resources, run sequentially over documents
✤ Plugins – a plugin is a collection of one or more Processing Resources,
bundled together.
✤ Plugins, then, are applications, that need to be loaded in order to
access their Processing Resources.
Tuesday, November 29, 2011
56. GATE Annotations
✤ Annotations are central to understanding GATE
✤ Annotations are associated with each document
✤ Each annotation has:
✤ start and end offsets
✤ an optional set of features
✤ each feature has a name and a value
Tuesday, November 29, 2011
59. Information Extraction
✤ TE: Template Elements
✤ NE: Named Entity recognition and
typing
✤ TR: Template Relations
✤ CO: CO-reference resolution
✤ ST: Scenario Templates
✤ Example:
The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head.
Dr. Head is a staff scientist at We Build Rockets Inc.
✤ NE: Entities are “rocket,” “Tuesday,” “Dr. Head” and “We Build Rockets”
CO: “it” refers to the rocket; “Dr. Head” and “Dr. Big Head” are the same
TE: the rocket is “shiny red” and Head’s “brainchild”
TR: Dr. Head works for “We Build Rockets Inc.”
ST: a rocket launching event occurred with the various participants
From http://gate.ac.uk/sale/talks/ne-tutorial.ppt
Tuesday, November 29, 2011
60. ANNIE
✤ A Nearly-New Information Extraction System, packaged with GATE,
used throughout examples, and a great place to start
✤ A collection of GATE Processing Resources to perform Information
Extraction on unstructured text
✤ “Nearly new” – its name 10 years ago, that stuck
✤ Other information extraction systems include LingPipe and
OpenNLP. GATE includes wrappers for LingPipe and OpenNLP,
independently developed NLP pipelines. All three systems are
provided as pre-built application through the GATE File menu
Tuesday, November 29, 2011
61. ANNIE
✤ “Processing Resources” inside ANNIE:
✤ Tokenizer, sentence splitter, part-of-speech tagger, gazetteers, named
entity tagger, and an orthomatcher
✤ Also included are noun phrase and verb phrase chunkers
✤ Each “Processing Resource” inside ANNIE can be used as part of a
pipeline you create to add annotations or modify existing ones
✤ ANNIE is a highly customizable, rule-based system, with very useful
defaults
Tuesday, November 29, 2011
62. ANNIE
✤ “Processing Resources” inside ANNIE:
✤ Gazetteer – lookup annotations (lists)
✤ JAPE transducer – date, person, location, organization, money,
percent annotations
✤ Orthomatcher – adds match features to named entity annotations
(coreference matching)
✤ Document Reset – removes annotations
Tuesday, November 29, 2011
63. IE Steps in ANNIE
✤ “Tokenizer” performs Token identification and word segmentation
✤ “Sentence splitter” identifies sentences
✤ “POS” tagger performs Part-of-speech tagging – (noun, verb, adverb,
adjective)
✤ Must run Tokenizer and Sentence Splitter before POS tagger
Tuesday, November 29, 2011
64. IE Steps in ANNIE
✤ “Gazetteers” – lists of names (people, cities, groups); you can modify
or add lists
✤ Each list has features (majorType, minorType, language)
✤ Gazetteers generate “Lookup” annotations with features
corresponding to the matched list. When the text matches a gazetteer
entry, a Lookup annotation is created.
✤ Lookup annotation are used by ANNIE’s Named Entity transducer to
for entity identification.
Tuesday, November 29, 2011
69. IE Steps in ANNIE
✤ “NE Transducer” – Named Entity Transducer performs named entity
recognition (NER)
✤ Once we have built up the processing resource pipeline with the
previous steps (tokeniser, sentence splitter, POS tagger, gazetteer), we
are ready to add the transducer for named entity recognition
✤ More specific information can be added to the features now, including
the “kind” of entity, and the rules that were fired
Tuesday, November 29, 2011
70. IE Steps in ANNIE
✤ “OrthoMatcher” – orthographic co-reference matches proper names
and their variants.
✤ Will match previously unclassified names, based on relations with
classified entities
✤ Matches “Kevin Lynch” with “Dr. Lynch”
✤ Matches acronyms with expansions
Tuesday, November 29, 2011
71. IE Steps in ANNIE
✤ Tokenizer, sentence splitter, and OrthoMatcher are language, domain,
and application-independent
✤ Part-of-speech tagger is language dependent and application-
independent
✤ Gazetteer lists are starting points (60K entries)
✤ ANNIE is a way to get started, with a framework for identifying the
kinds of elements that matter to your work, and for quickly testing
your ideas against existing data
Tuesday, November 29, 2011
73. Rules-based Classification
✤ Once a stand-alone project, now often part of annotation services
✤ Regex, Boolean and naive Bayesian algorithms executed on tokens
✤ And, Or, Not, Near (x), Multi, Stem, Exact, Phrase, et al (vendor or
source dependent)
✤ Assigns documents to a taxonomic category
✤ Allow for greater control over depth and breadth of categories
✤ Human aided, machine processed
Tuesday, November 29, 2011
85. Quick!
✤ Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of
parliament, and so on and so forth) -- call this your corpus
✤ Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy,
or something from the Linked Data cloud) -- call this your ontology
✤ Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to
the ontology (2.)
✤ Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and
measure performance against the gold standard
✤ Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems
using GATE Embedded)
✤ Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For
techies: this sits in the backroom as a RESTful web service.)
✤ Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity
graphing, time series graphing, annotation structure search and (last but not least) boolean full text search.
(More techy stuff: mash up these types of search with your existing UIs.)
Tuesday, November 29, 2011
86. Data Warehousing /
Business Intelligence
✤ Perspective
✤ Process
✤ Use cases
✤ Implications with unstructured data
Tuesday, November 29, 2011
87. DW/BI Perspective
✤ Structured data is an incomplete version of the “truth”
✤ Until information is quantified, it is not very useful
✤ Discover facts, and give them structure
✤ Complement structured data with unstructured data; try to complete
the picture (of the business, the customer, performance)
Tuesday, November 29, 2011
88. DW/BI Process
✤ Extract, then formalize
✤ Give information structure, then associations
✤ Map to existing structures in the data warehouse
Tuesday, November 29, 2011
89. DW/BI Use Cases
✤ Report indexing (of metadata, of instances)
✤ Report sections become possible
✤ Self-service for consumers
✤ “BI Search” (of those reports)
✤ Include in portal
✤ As range of reports and users increases, unstructured data approaches
have more value
Tuesday, November 29, 2011
90. DW/BI Use Case Ideas
✤ For customers, products, complaints, locations:
✤ Voice recognition indexing
✤ RSS feeds
✤ Wikis, blogs (internal and external)
✤ Instant messages
Tuesday, November 29, 2011
91. DW/BI Implications
✤ Have to store these results
✤ Have to model these results
✤ Have to map these results to something meaningful
✤ Have to include the results in a useful way (Where? Use taxonomies?
Which ones?)
✤ Quality, cost, and complexity matter; extracted entities don’t relate
directly to performance
✤ Not a replacement, an addition to the technology
Tuesday, November 29, 2011
92. Some Technical Issues
✤ Quality
✤ Integration
✤ Concurrency
✤ Security
✤ Skills
Tuesday, November 29, 2011
93. Additional Open Tools
✤ UIMA – Unstructured Information
Management Architecture (IBM’s
Watson uses this), originated at
IBM, now an Apache project.
✤ Component software architecture
with a document processing
pipeline similar to GATE. Focus on
performance and scalability, with
distributed processing (web
services).
Tuesday, November 29, 2011
94. UIMA
UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
types based on existing ones and update the Common Analysis Structure (CAS) for
upstream processing.
UIMA CAS
Representation now
Common Analysis Structure (CAS) Aligned
with XMI standard
Relationship CeoOf
Arg1:Person Arg2:Org
Analysis Results
(i.e., Artifact Metadata)
Named Entity Person Organization
Parser NP VP PP
Fred Center is the CEO of Center Micros
Artifact (e.g., Document)
Chart by
IBM
Tuesday, November 29, 2011
96. Commercial Tools
✤ Oracle Data Mining (Text Mining)
✤ IBM SPSS
✤ SAS Text Miner
✤ Smartlogic
✤ Lots of acquisitions going on in the “big data” space
✤ HP acquired Autonomy
✤ Oracle acquired Endeca
Tuesday, November 29, 2011
97. A Note on Tools
✤ UIMA and GATE – comprehensive suite of capabilities, with learning
curves.
✤ Commercial tools range from unstructured capabilities inside DBMSs
like Oracle, to Business Objects business intelligence tools (who
acquired Inxight from Xeroc Parc).
✤ Your mileage will vary. The biggest differentiator is your knowledge
of your data.
Tuesday, November 29, 2011