Linguistic component Tokenizer for the Russian language

1,516 views
1,425 views

Published on

Low-level component that allows to extract individual tokens (words) from an input text. Both generic type and tweet messages are supported.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,516
On SlideShare
0
From Embeds
0
Number of Embeds
921
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Linguistic component Tokenizer for the Russian language

  1. 1. Linguistic Component: Tokenizer for the Russian language Technical description SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info This document describes technical details of tokenizer for the Russian language. The component has two modes of operation:  Processing of generic texts: news, technical articles etc  Processing of Twitter messages Demo package sent upon request contains the following:  Java library of tokenizer in a form of a binary  run_tokenizer.sh script for swift checking the functionality of the module  messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization using the run_tokenizer.sh script The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting individual tokens for a text stream. Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 38497 characters/ms 5158 tokens/ms Tests were conducted in a single thread. Format of the messages_to_tokenize.txt file This file describes input data for the tokenizer module for demo purposes. Format: TexttText type Text contains textual data in Russian for tokenization t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER. Examples of tokenization The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out. For the following input file messages_to_tokenize.txt: :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER This output gets generated:
  2. 2. :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER emopostkn, type: ALPHANUM this, type: ALPHANUM is, type: ALPHANUM it, type: ALPHANUM !, type: PUNCT #По_русски, type: TWITTER_HASHTAG @dm, type: TWITTER_USERNAME emopostkn, type: ALPHANUM www.test.com/x?y, type: HYPERLINK Examples of using the library from the Java code Tokenizer twitterTokenizer = new TwitterFlexTokenizer(new StringReader("#ht. done!"), true); Token reusableToken = Token.newReusableToken(); while((reusableToken = twitterTokenizer.getNextToken(reusableToken)) != null) { System.out.println(reusableToken); } output: Token[text=#ht,type=TWITTER_HASHTAG] Token[text=done,type=ALPHANUM]

×