The Pullenti NER Engine in the
Tasks of Semantic Similarity
Establishment
Kuznetsov K.I. (k.smith@mail.ru),
Kozerenko E.B. (kozerenko@mail.ru),
Romanov D.A. DRomanov@it.ru
FRC “Computer Science and Control”
NRU HSE
AINL FRUCT 2016
суббота, 12 ноября 16 г.
Semantics-Oriented Linguistic
Processor (SOLP)
• Establishing text segments containing
the similar semantic units for the tasks
of analytical text processing within the
semantic technology platform.
• The methods and instruments
presented provide the discovery of
relevant content based on users'
focused interests within a certain
domain.
2
суббота, 12 ноября 16 г.
Basic Ideas
• The hybrid approach comprising
linguistic rules and example-based
learning techniques is employed.
• The Pullenti-based engine is specified,
the two-step Semantic Expansion
Algorithm is presented.
• The legal and mass media texts are
considered. 3
суббота, 12 ноября 16 г.
Pullenti Engine Features
• Pullenti is used as the engine for
selecting entities and structuring
information.
• The feature set comprises common
morphological algorithms and parsing.
Modules are focused on the
identification of specific entities.
суббота, 12 ноября 16 г.
Linguistic Processing
• The system integrates the seed feature-
value unification grammar which is
extensible by creating new rule-based
modules and incremental machine-
learning components extracting phrase
structures from texts under study.
суббота, 12 ноября 16 г.
Pullenti Implementation
• The system is implemented as a
software development kit (SDK) for
information systems development
dealing with unstructured data, i.e.
texts in natural languages, for use in
the systems developed in .NET
environment. For Mac and Linux
systems it is running on Mono
platform.
суббота, 12 ноября 16 г.
Software Development Kit
• The SDK is completely written in C #
and .NET, it is a set of assemblies
of .NET Framework (2.0 and above).
Third-party extensions are not used.
• If necessary, one can take the form of a
standard service for unstructured
information processing UIMA
(Unstructured Information
Management Architecture).
суббота, 12 ноября 16 г.
Semantics Expansion
Algorithm
• The two steps of the Semantics Expansion
Algorithm (SEA):
• a text of a newly received application to
the Constitutional Court (CC) is matched
to a list of links to the profile of the
applications traced by links to Normative
Legal Acts (NLA) and legal norms (LN)
(meaningful terms and bigrams).
суббота, 12 ноября 16 г.
The Pullenti-based technology
• The Pullenti-based technology is a
novel development designed from
scratch four years ago:
• http://semantick.ru/
• http://pullenti.ru/DemoPage.aspx
• http://semantick.ru/Ner.aspx
суббота, 12 ноября 16 г.
Performance
• The work speed can be very
approximately estimated as
60.000-80.000 characters per second
on a serial computer model. The
maximum amount of text on a 32-bit
computer that can be processed at one
time is no more than 20 MB. On a 64-
bit processor the result depends on the
size of the operational memory.
суббота, 12 ноября 16 г.
Evaluation
• The Precision (P), Recall (R) and F-
measure demonstrated by Pullenti for
the NER tasks are as follows:
• P = 0.7957-0.9663;
• R = 0.7091-0.8675;
• F-measure = 0.8083-0.8570.
•
суббота, 12 ноября 16 г.
Competition Results
• The Pullenti-based engine participated
in the NER competition organized for
the Dialog 2016 Conference
(nickname Pink):
• http://www.dialog-21.ru/media/3430/
starostinaetal.pdf
суббота, 12 ноября 16 г.
Terms of Use
• The system is free for non-commercial
use and it can be downloaded. For
commercial use the SDK is shipped
without restrictions on the number of
end users and installations:
http://semantick.ru/Technology.aspx
суббота, 12 ноября 16 г.
AINL Prototyping Environment
•www.ipiranlogos.com
суббота, 12 ноября 16 г.
• Thank you!
суббота, 12 ноября 16 г.

AINL 2016: Kozerenko

  • 1.
    The Pullenti NEREngine in the Tasks of Semantic Similarity Establishment Kuznetsov K.I. (k.smith@mail.ru), Kozerenko E.B. (kozerenko@mail.ru), Romanov D.A. DRomanov@it.ru FRC “Computer Science and Control” NRU HSE AINL FRUCT 2016 суббота, 12 ноября 16 г.
  • 2.
    Semantics-Oriented Linguistic Processor (SOLP) •Establishing text segments containing the similar semantic units for the tasks of analytical text processing within the semantic technology platform. • The methods and instruments presented provide the discovery of relevant content based on users' focused interests within a certain domain. 2 суббота, 12 ноября 16 г.
  • 3.
    Basic Ideas • Thehybrid approach comprising linguistic rules and example-based learning techniques is employed. • The Pullenti-based engine is specified, the two-step Semantic Expansion Algorithm is presented. • The legal and mass media texts are considered. 3 суббота, 12 ноября 16 г.
  • 4.
    Pullenti Engine Features •Pullenti is used as the engine for selecting entities and structuring information. • The feature set comprises common morphological algorithms and parsing. Modules are focused on the identification of specific entities. суббота, 12 ноября 16 г.
  • 5.
    Linguistic Processing • Thesystem integrates the seed feature- value unification grammar which is extensible by creating new rule-based modules and incremental machine- learning components extracting phrase structures from texts under study. суббота, 12 ноября 16 г.
  • 6.
    Pullenti Implementation • Thesystem is implemented as a software development kit (SDK) for information systems development dealing with unstructured data, i.e. texts in natural languages, for use in the systems developed in .NET environment. For Mac and Linux systems it is running on Mono platform. суббота, 12 ноября 16 г.
  • 7.
    Software Development Kit •The SDK is completely written in C # and .NET, it is a set of assemblies of .NET Framework (2.0 and above). Third-party extensions are not used. • If necessary, one can take the form of a standard service for unstructured information processing UIMA (Unstructured Information Management Architecture). суббота, 12 ноября 16 г.
  • 8.
    Semantics Expansion Algorithm • Thetwo steps of the Semantics Expansion Algorithm (SEA): • a text of a newly received application to the Constitutional Court (CC) is matched to a list of links to the profile of the applications traced by links to Normative Legal Acts (NLA) and legal norms (LN) (meaningful terms and bigrams). суббота, 12 ноября 16 г.
  • 9.
    The Pullenti-based technology •The Pullenti-based technology is a novel development designed from scratch four years ago: • http://semantick.ru/ • http://pullenti.ru/DemoPage.aspx • http://semantick.ru/Ner.aspx суббота, 12 ноября 16 г.
  • 10.
    Performance • The workspeed can be very approximately estimated as 60.000-80.000 characters per second on a serial computer model. The maximum amount of text on a 32-bit computer that can be processed at one time is no more than 20 MB. On a 64- bit processor the result depends on the size of the operational memory. суббота, 12 ноября 16 г.
  • 11.
    Evaluation • The Precision(P), Recall (R) and F- measure demonstrated by Pullenti for the NER tasks are as follows: • P = 0.7957-0.9663; • R = 0.7091-0.8675; • F-measure = 0.8083-0.8570. • суббота, 12 ноября 16 г.
  • 12.
    Competition Results • ThePullenti-based engine participated in the NER competition organized for the Dialog 2016 Conference (nickname Pink): • http://www.dialog-21.ru/media/3430/ starostinaetal.pdf суббота, 12 ноября 16 г.
  • 13.
    Terms of Use •The system is free for non-commercial use and it can be downloaded. For commercial use the SDK is shipped without restrictions on the number of end users and installations: http://semantick.ru/Technology.aspx суббота, 12 ноября 16 г.
  • 14.
  • 15.
    • Thank you! суббота,12 ноября 16 г.