Communicating KnowledgeSentiment Analysis SymposiumLessons Learned from a VOCAnalysis System for a big KoreanTelecommunication CompanyIvan BerlocherSALTLUXSentiment Analysis SymposiumNov. 9th 2011
Communicating KnowledgeSentiment Analysis SymposiumIntroduction• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.• Expertise domain:Information Retrieval, Text/Data/Web/Graph Mining solutions and services based onSemantic Web Technology.• Main languages support: Korean, Japanese, English. For other use externalsolutions.• 70 employees in Seoul, one Development Center in Vietnam (12 employees)One sales office in Japan (3 employees)• Have several partnerships with other companies/institutes:– Ontoprise in Germany– Franz in California– DERI in Ireland• Have many partnerships with R&D (ETRI, KAIST, Universities…)2
Communicating KnowledgeSentiment Analysis SymposiumTable of Contents• Project & Environment Description– Needs of Customer– System (Main) Requirements• VOC Data– Sample Data– Data Analysis• System Overview• Korean Linguistic• Sentiment Analysis• Lessons Learned• Future work3
Communicating KnowledgeSentiment Analysis SymposiumProject & Environment Description4• Needs of Customer– Customer: Korean Corporation in Telecommunication– Department of Voice of Customer Analysis– Mission: Analysis (human typed) memos from all call centers foridentifying majors problems, make reports for decisions makers inorder to improve quality of services and augment customersatisfaction.– Data: human typed notes covering any kind of questions fromcustomers• Information about subscriptions• Inquiry or complaint about devices (phones) or services, dealership• Complaints about quality of communication• etc.The numbers of notes: ~200 thousand a day. (~5 Millions a Month).Required notes to be searchable during 1 year (~60 millions)
Communicating KnowledgeSentiment Analysis SymposiumProject & Environment Description5• System (Main) Requirements• Distinguish between simple inquiries vs. complaints• Classify into categories/departments of services• Monitor Trends of Topics in real-time, daily, weekly, monthly• Compare trends/tendency between by slice of times• Find related Topics• Manage personal vocabulary• Anonymous”ize” personal data (people names, telephone, socialid, addresses etc.)Project started in October 2010 for a 3 Months POC. (~10MM)After acceptance(success) integration with real system foranother 3 months (~10 MM)2 phases: ~200 000$
Communicating KnowledgeSentiment Analysis SymposiumVOC Data Sample6
Communicating KnowledgeSentiment Analysis SymposiumVOC Data Sample7• Data often contain somestructured information(metadata) but without anystandard.• But most of time, no particular mark/meta.Cause problem of Named Entities Recognitionmore complexAll different input of same information(연락처:Phone Number)
Communicating KnowledgeSentiment Analysis SymposiumVOC Data Analysis8• Data contains lot‟s of named entities:Products/Services/People/Social ID/phones numbersoften related to privacy• Data contains lot‟s of technical (domain) terms• Real content to analysis is mostly very short(tweets like)but sometimes very.• Lot‟s of misspelling/mistyping• Korean(Asian) problem of segmentation, amplified byspeed constraint• Lot‟s of (non standard) abbreviations
Communicating KnowledgeSentiment Analysis SymposiumSystem Overview9TextSegmentationMorphologicalAnalyzerChunk/PhraseIdentificationNamedEntitiesRecognitionSynonyms &NormalizationIndexingDistributed IndexesClassifier(Hybrid SVM& Rules)Analysis PhaseSearching/Clustering(TopicRank)TimelinesDumperDFSTimelines20110713_0700_1.df20110713_0700_2.df20110713_0700_3.df20110713_0710_1.df20110713_0710_2.df20110713_0710_3.dfSchedulerMerger &RankerTrend(TopN)DBWeb Server(Web UI)ComplaintDetector• Overall ArchitectureIn the real system, for fast indexing, system has been parallelized on 18 Linuxmachines.
Communicating KnowledgeSentiment Analysis SymposiumSystem Overview10• Home page
Communicating KnowledgeSentiment Analysis SymposiumSystem Overview11• Top N Keywords Extraction
Communicating KnowledgeSentiment Analysis SymposiumSystem Overview12• Related Keywords (Word Clustering)
Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic15• Brief introductionKorean is alphabetic based with consonants/vowels, composition byconsonant/vowel or consonant/vowel/consonant.„나는 학생입니다.” => 나 = ㄴ (N) + ㅏ(A) = NA=> 학 = ㅎ (H) + ㅏ(A) + ㄱ (K) = HAKOne unit of consonant/vowel or consonant/vowel/consonant is asyllable called “Eojol”(Syllable) and words are composed of several“eojeol”.Basic grammar:Words a composition of one root (Nouns, Adjectives/Verbs) followedby a flexion marking grammatical role (Subject/Object/Location etc.)for nouns (Called “Josa”)or aspects/mood (tense, honorific form etc. ) for verbs/adjectives(Called “Eomi”).
Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic16• Examples:„나는 학생입니다.” => “나는” = “나” (NA: I/me) + “는” (Neun: Thema)학생입니다 = “학생” + “입니다” = “학생”(Hak-seng: Student) +“입니다”(Im-ni-da: am) => I‟m (a) student.Lot‟s of (composite) inflectional forms:학생+입니다 = Noun + Be학생 +인/이예요/이다/입니까?/인데/인데요 etc. (was, will be …) (eomi)학생 + Syntactic Role (이:Subject/에게:To/한테:From/을:Object) etc. (josa)Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)Search Engine: 검색엔진.High performance search engine: 고성능검색엔진But usage of space is free/arbitrary.Can write equivalently: 검색엔진 or 검색 엔진Especially with SNS, space limited devices for speed constraints(like real-time transcription of conversations) the space is more and moreun/mis- used.=> Need Automatic Segmentation Correction.
Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic18• Automatic Segmentation Correction ImplementationBinary Classification Approach:Tagging each syllable as space or not before.Can use any kind of Classifier.Here we use CRF model (could be SVM)with following set of features:프랑스의 세계적인 디자이너 …CRFAccuracy at Character Level 96.25%Precision at Word Level 95.58%• Features– 1gram, 2gram, 3gram, 4gram of characters (syllables)– Korean or not, contains number• Evaluation– Accuracy (character)– Word-precision# words correct spaced word / # words produced by system• Very simple to train (easy to get huge data)• Not need of lexicon or any lexical information• Perform surprisingly very well
Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic19• Transliteration- Korean used more and more English derived wordtransliterated phonetically in Korean alphabet(Reverse of “Romanization”).Especially for foreign names (Companies, Products, People,technical/domain terms)– Transcription is non unique and non standardExamples:tablet, 태블릿, 태블릿 , 타블렛, 테블릿Hitachi, 히타치, 히타찌, 히다찌, 히타찌iPhone 4s, 아이폰 4s, 아이폰포에스, 아이폰 포에스
Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic20• Automatic transliteration recognition- Make a rules based transliteration based on phonetictransliteration acting similarly to Soundex, adapted forKorean pronunciation.tablet, 태블릿T=>ㅌ/ㄸ/ㄷA => ㅏ/ㅓ/ㅔ/ㅐEtc.This method has high recall but low precision and need post-processing filtering (Removeknown Korean words from lexicons, remove too short nouns etc.)Result has to be corrected by human, so need of efficient workbench for productivity.Gathered a 130 thousand entries dictionaries, mainly IT oriented.Still need more Academic research to solve this problem.
Communicating KnowledgeSentiment Analysis SymposiumSentiment Analysis21• Complaint DetectionSimilar problem of standard Subjectivity Detection(Detect if a sentence is sentiment bearing or not)Simple Approach: Binary ClassificationUsing SVM,manually tagged training/test corpuses.(more than 20 thousand)Features Space:N-gram of Characters (Syllables/Eojol) + N-Gram of Wordsusing 2-4 grams gave best results.Features Extraction is important to lower the features space.Chi-square/Information Gain gave best results.
Communicating KnowledgeSentiment Analysis SymposiumSentiment Analysis22Problems: No freely available resources such Sentiword-NetNeed to build it!Build our general domain dictionary as baseline:20 000 verbs/adjectives classified as positive/negative/neutralResult is a lexicon of ~5000 entries (only positive/negative)Enrich with manually extracted features from N-grams.Precision oriented (92%) but still quite low recall (75%).Overall Accuracy: 85%=> Still working on ways to make recall better withoutsacrificing precision.Basic Ideas:Bagging / Boosting (Combining several Classifiers)Make hybrid models between (linguistic: semantic/syntactic) rulesand Machine Learning(statistics)
Communicating KnowledgeSentiment Analysis SymposiumLessons Learned23• Lessons Learned- Still a quite big gap between expectation of customer andreality. Need to explain and let him involved in process ofassessment and knowledge/domain vocabulary acquisition- Need acquire a lot of lexicons:=> Named entities/Synonyms/Stopwords/Senti-Word- Quality and Quantity of this lexicons is a real assets ofCompany. Acquiring lexicons require workbenches forefficiently semi-supervised methods (Filter manually automaticmethods) to reduce costs.- Tuning Classifiers parameters, features extraction, linguisticknowledge etc. is time/expertise consuming.- Simple Academic methods works quite well (even needs lot oftuning)- Beyond simple search engine, NLP components qualitybecame more and more important, especially for SentimentAnalysis
Communicating KnowledgeSentiment Analysis SymposiumLessons Learned24• Lessons Learned- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud ”, “SocialNetwork/Intelligence”…- More and more Customers want to get data/opinion out of in-site system(Blogs, Communities(BBS), Tweets etc.). Typical questions:How many crawlers are needed for crawl all Korean tweets/blogs?How about crawling Facebook?- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?Solutions required are required far more than Sentiment Analysis.But often customer can‟t afford/don‟t want crawling infra-structure and maintenance fees.New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS(Software/Platform/Infrastructure) as Service.Even in enterprise, distributed framework is required (not only web scale services)- Customers (as least in Korea) love knowing technology and are more and more high level users.They not only buy solutions but consulting/expertise.- Projects are more and more expensive, and many require either Benchmarks/POC
Communicating KnowledgeSentiment Analysis SymposiumFuture Work & Plan25• Future Work (On-going)Acquire more entries in Sentiment dictionary- Make a framework for handling Linguistic Rules and Statistical(SVM/Rocchio)- Coupling with Antonyms; and/or hints- Better handling Negation- Better Workbench for faster acquisition / (re-)training- Co-Reference resolution- (Full/Semi) Parsing ?- More complex models than binary classification ?- Building/Maintaining a Platform for Pass/SassA long long way to go…