Speakers often switch back and forth between languages when speaking or writing, mostly in informal settings. This language interchanging involves complex grammar and the terms “code switching” or “code mixing” are used to describe It .
Named Entity Recognition For Hindi-English code-mixed Twitter Text
1. BHUVANESH KACHAVE (23)
AMOGH KAWLE (25)
SAGAR TIVREKAR (74)
Named Entity Recognition for Hindi-
English Code-Mixed Twitter
Text
2. Introduction
Speakers often switch back and forth between languages when
speaking or writing, mostly in informal settings. This language
interchanging involves complex grammar and the terms “code
switching” or “code mixing” are used to describe It .
Code-mixing refers to the use of linguistic units from different
languages in a single utterance or sentence, whereas code
switching refers to the co-occurrence of speech extracts
belonging to two different grammatical systems.
3. Problem Statement
The problem definition of code-mixing entity extraction comprises
two sub-problem entity extraction and entity classification.
Mathematically the problem of code-mixing entity extraction can be
described
Mathematically the problem of code-mixing entity extraction can be
described
4. Scope
It is used to analyze the twitter data means tweets and this
analysis is useful during election (For Government).
You can generate exit poles for all of this representation , we
required twitter tweets with use of this tweet we represent
data using chart , poles and election related data.
Using this project you can find out and represent any type of
event trending on twitter using user tweet.
5. How It Works
Extract information or user provide dataset of that
Information from tweets from twitter.
Get that data set of tweets and then filter other languages
from that tweets.
Analysis the person, location etc form tweets and tag them
in graphical representation of application.
Display Results in graphical format .
7. Algorithm
Start
Get tweets from twitter or get dataset of tweets.
Filter important tweets using machine learning
algorithms.
Tag tweets with location and names etc.
Graphical representation of analysis dataset
display as result.
End
9. Conditional Random Field (CRF)
For sequence labeling tasks, it is beneficial to consider the
correlations between labels in neighborhoods and jointly
Decode the best chain of labels for a given input sentence .
For example, in POS tagging an adjective is more likely to be
followed by a noun than a Verb, and in NER with standard BIO2
annotation IORG cannot follow I-PER.
Therefore, we model label sequence jointly using a conditional
Random field (CRF) instead of decoding each label Independently.
11. Pre-processing
This step is done to make the data uniform which will be beneficial for
our system. The preprocessing step consist of :-
Removing noisy tweets
Seperate links from tweets
Tokenization
Separating words which appear continuous
(i.e Modi.ji.Ke.Liye as ’Modi ji Ke Liye’ )
Converting to lowercase
Token encoding (mapping of tokens to their tags)
12. Technology To Be Used
This project will be a desktop based application to be developed
using Python, Machine Learning and hardware is windows PC.
Front End :- Java , Python And Machine Learning
Back End :- Solr Database (Banana)
13. Hardware And Software Requirements
This project will be a desktop based application to be
developed using Python and hardware is windows PC.
Hardware Requirement :
64-bit operating system of windows, linux, etc.
4 gb RAM minimum (8gb preferred)
Intel i3 3200k and above with more than 2.6 Ghz
Display with at least 60hz.
14. Hardware And Software Requirements
Software Requirement :
Programming language : Python 3.5
Machine learning Library : scikit-learn (0.19.1)
Python packages : pandas(0.20.0) for data
processing , numpy(1.14.3) for data manipulation,
matplotlib(2.2.2) and seaborn(0.8.1) for
visualisation
IDE : spyder, jupyter notebook, google colab
Database : Solr Databse
Other : Anaconda 4.5.4