VOC real world enterprise needs

  • 134 views
Uploaded on

VOC sentiment analysis korean language processing morphological analysis CRF

VOC sentiment analysis korean language processing morphological analysis CRF

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
134
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Communicating KnowledgeSentiment Analysis SymposiumLessons Learned from a VOCAnalysis System for a big KoreanTelecommunication CompanyIvan BerlocherSALTLUXSentiment Analysis SymposiumNov. 9th 2011
  • 2. Communicating KnowledgeSentiment Analysis SymposiumIntroduction• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.• Expertise domain:Information Retrieval, Text/Data/Web/Graph Mining solutions and services based onSemantic Web Technology.• Main languages support: Korean, Japanese, English. For other use externalsolutions.• 70 employees in Seoul, one Development Center in Vietnam (12 employees)One sales office in Japan (3 employees)• Have several partnerships with other companies/institutes:– Ontoprise in Germany– Franz in California– DERI in Ireland• Have many partnerships with R&D (ETRI, KAIST, Universities…)2
  • 3. Communicating KnowledgeSentiment Analysis SymposiumTable of Contents• Project & Environment Description– Needs of Customer– System (Main) Requirements• VOC Data– Sample Data– Data Analysis• System Overview• Korean Linguistic• Sentiment Analysis• Lessons Learned• Future work3
  • 4. Communicating KnowledgeSentiment Analysis SymposiumProject & Environment Description4• Needs of Customer– Customer: Korean Corporation in Telecommunication– Department of Voice of Customer Analysis– Mission: Analysis (human typed) memos from all call centers foridentifying majors problems, make reports for decisions makers inorder to improve quality of services and augment customersatisfaction.– Data: human typed notes covering any kind of questions fromcustomers• Information about subscriptions• Inquiry or complaint about devices (phones) or services, dealership• Complaints about quality of communication• etc.The numbers of notes: ~200 thousand a day. (~5 Millions a Month).Required notes to be searchable during 1 year (~60 millions)
  • 5. Communicating KnowledgeSentiment Analysis SymposiumProject & Environment Description5• System (Main) Requirements• Distinguish between simple inquiries vs. complaints• Classify into categories/departments of services• Monitor Trends of Topics in real-time, daily, weekly, monthly• Compare trends/tendency between by slice of times• Find related Topics• Manage personal vocabulary• Anonymous”ize” personal data (people names, telephone, socialid, addresses etc.)Project started in October 2010 for a 3 Months POC. (~10MM)After acceptance(success) integration with real system foranother 3 months (~10 MM)2 phases: ~200 000$
  • 6. Communicating KnowledgeSentiment Analysis SymposiumVOC Data Sample6
  • 7. Communicating KnowledgeSentiment Analysis SymposiumVOC Data Sample7• Data often contain somestructured information(metadata) but without anystandard.• But most of time, no particular mark/meta.Cause problem of Named Entities Recognitionmore complexAll different input of same information(연락처:Phone Number)
  • 8. Communicating KnowledgeSentiment Analysis SymposiumVOC Data Analysis8• Data contains lot‟s of named entities:Products/Services/People/Social ID/phones numbersoften related to privacy• Data contains lot‟s of technical (domain) terms• Real content to analysis is mostly very short(tweets like)but sometimes very.• Lot‟s of misspelling/mistyping• Korean(Asian) problem of segmentation, amplified byspeed constraint• Lot‟s of (non standard) abbreviations
  • 9. Communicating KnowledgeSentiment Analysis SymposiumSystem Overview9TextSegmentationMorphologicalAnalyzerChunk/PhraseIdentificationNamedEntitiesRecognitionSynonyms &NormalizationIndexingDistributed IndexesClassifier(Hybrid SVM& Rules)Analysis PhaseSearching/Clustering(TopicRank)TimelinesDumperDFSTimelines20110713_0700_1.df20110713_0700_2.df20110713_0700_3.df20110713_0710_1.df20110713_0710_2.df20110713_0710_3.dfSchedulerMerger &RankerTrend(TopN)DBWeb Server(Web UI)ComplaintDetector• Overall ArchitectureIn the real system, for fast indexing, system has been parallelized on 18 Linuxmachines.
  • 10. Communicating KnowledgeSentiment Analysis SymposiumSystem Overview10• Home page
  • 11. Communicating KnowledgeSentiment Analysis SymposiumSystem Overview11• Top N Keywords Extraction
  • 12. Communicating KnowledgeSentiment Analysis SymposiumSystem Overview12• Related Keywords (Word Clustering)
  • 13. Communicating KnowledgeSentiment Analysis SymposiumSystem Overview13• Trend (Timeline) view
  • 14. Communicating KnowledgeSentiment Analysis SymposiumSystem Overview14• Tweets view
  • 15. Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic15• Brief introductionKorean is alphabetic based with consonants/vowels, composition byconsonant/vowel or consonant/vowel/consonant.„나는 학생입니다.” => 나 = ㄴ (N) + ㅏ(A) = NA=> 학 = ㅎ (H) + ㅏ(A) + ㄱ (K) = HAKOne unit of consonant/vowel or consonant/vowel/consonant is asyllable called “Eojol”(Syllable) and words are composed of several“eojeol”.Basic grammar:Words a composition of one root (Nouns, Adjectives/Verbs) followedby a flexion marking grammatical role (Subject/Object/Location etc.)for nouns (Called “Josa”)or aspects/mood (tense, honorific form etc. ) for verbs/adjectives(Called “Eomi”).
  • 16. Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic16• Examples:„나는 학생입니다.” => “나는” = “나” (NA: I/me) + “는” (Neun: Thema)학생입니다 = “학생” + “입니다” = “학생”(Hak-seng: Student) +“입니다”(Im-ni-da: am) => I‟m (a) student.Lot‟s of (composite) inflectional forms:학생+입니다 = Noun + Be학생 +인/이예요/이다/입니까?/인데/인데요 etc. (was, will be …) (eomi)학생 + Syntactic Role (이:Subject/에게:To/한테:From/을:Object) etc. (josa)Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)Search Engine: 검색엔진.High performance search engine: 고성능검색엔진But usage of space is free/arbitrary.Can write equivalently: 검색엔진 or 검색 엔진Especially with SNS, space limited devices for speed constraints(like real-time transcription of conversations) the space is more and moreun/mis- used.=> Need Automatic Segmentation Correction.
  • 17. Communicating KnowledgeSentiment Analysis SymposiumProject & Environment Description17• Automatic Segmentation Correction Illustration
  • 18. Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic18• Automatic Segmentation Correction ImplementationBinary Classification Approach:Tagging each syllable as space or not before.Can use any kind of Classifier.Here we use CRF model (could be SVM)with following set of features:프랑스의 세계적인 디자이너 …CRFAccuracy at Character Level 96.25%Precision at Word Level 95.58%• Features– 1gram, 2gram, 3gram, 4gram of characters (syllables)– Korean or not, contains number• Evaluation– Accuracy (character)– Word-precision# words correct spaced word / # words produced by system• Very simple to train (easy to get huge data)• Not need of lexicon or any lexical information• Perform surprisingly very well
  • 19. Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic19• Transliteration- Korean used more and more English derived wordtransliterated phonetically in Korean alphabet(Reverse of “Romanization”).Especially for foreign names (Companies, Products, People,technical/domain terms)– Transcription is non unique and non standardExamples:tablet, 태블릿, 태블릿 , 타블렛, 테블릿Hitachi, 히타치, 히타찌, 히다찌, 히타찌iPhone 4s, 아이폰 4s, 아이폰포에스, 아이폰 포에스
  • 20. Communicating KnowledgeSentiment Analysis SymposiumKorean Linguistic20• Automatic transliteration recognition- Make a rules based transliteration based on phonetictransliteration acting similarly to Soundex, adapted forKorean pronunciation.tablet, 태블릿T=>ㅌ/ㄸ/ㄷA => ㅏ/ㅓ/ㅔ/ㅐEtc.This method has high recall but low precision and need post-processing filtering (Removeknown Korean words from lexicons, remove too short nouns etc.)Result has to be corrected by human, so need of efficient workbench for productivity.Gathered a 130 thousand entries dictionaries, mainly IT oriented.Still need more Academic research to solve this problem.
  • 21. Communicating KnowledgeSentiment Analysis SymposiumSentiment Analysis21• Complaint DetectionSimilar problem of standard Subjectivity Detection(Detect if a sentence is sentiment bearing or not)Simple Approach: Binary ClassificationUsing SVM,manually tagged training/test corpuses.(more than 20 thousand)Features Space:N-gram of Characters (Syllables/Eojol) + N-Gram of Wordsusing 2-4 grams gave best results.Features Extraction is important to lower the features space.Chi-square/Information Gain gave best results.
  • 22. Communicating KnowledgeSentiment Analysis SymposiumSentiment Analysis22Problems: No freely available resources such Sentiword-NetNeed to build it!Build our general domain dictionary as baseline:20 000 verbs/adjectives classified as positive/negative/neutralResult is a lexicon of ~5000 entries (only positive/negative)Enrich with manually extracted features from N-grams.Precision oriented (92%) but still quite low recall (75%).Overall Accuracy: 85%=> Still working on ways to make recall better withoutsacrificing precision.Basic Ideas:Bagging / Boosting (Combining several Classifiers)Make hybrid models between (linguistic: semantic/syntactic) rulesand Machine Learning(statistics)
  • 23. Communicating KnowledgeSentiment Analysis SymposiumLessons Learned23• Lessons Learned- Still a quite big gap between expectation of customer andreality. Need to explain and let him involved in process ofassessment and knowledge/domain vocabulary acquisition- Need acquire a lot of lexicons:=> Named entities/Synonyms/Stopwords/Senti-Word- Quality and Quantity of this lexicons is a real assets ofCompany. Acquiring lexicons require workbenches forefficiently semi-supervised methods (Filter manually automaticmethods) to reduce costs.- Tuning Classifiers parameters, features extraction, linguisticknowledge etc. is time/expertise consuming.- Simple Academic methods works quite well (even needs lot oftuning)- Beyond simple search engine, NLP components qualitybecame more and more important, especially for SentimentAnalysis
  • 24. Communicating KnowledgeSentiment Analysis SymposiumLessons Learned24• Lessons Learned- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud ”, “SocialNetwork/Intelligence”…- More and more Customers want to get data/opinion out of in-site system(Blogs, Communities(BBS), Tweets etc.). Typical questions:How many crawlers are needed for crawl all Korean tweets/blogs?How about crawling Facebook?- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?Solutions required are required far more than Sentiment Analysis.But often customer can‟t afford/don‟t want crawling infra-structure and maintenance fees.New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS(Software/Platform/Infrastructure) as Service.Even in enterprise, distributed framework is required (not only web scale services)- Customers (as least in Korea) love knowing technology and are more and more high level users.They not only buy solutions but consulting/expertise.- Projects are more and more expensive, and many require either Benchmarks/POC
  • 25. Communicating KnowledgeSentiment Analysis SymposiumFuture Work & Plan25• Future Work (On-going)Acquire more entries in Sentiment dictionary- Make a framework for handling Linguistic Rules and Statistical(SVM/Rocchio)- Coupling with Antonyms; and/or hints- Better handling Negation- Better Workbench for faster acquisition / (re-)training- Co-Reference resolution- (Full/Semi) Parsing ?- More complex models than binary classification ?- Building/Maintaining a Platform for Pass/SassA long long way to go…
  • 26. Communicating KnowledgeSentiment Analysis Symposium 26Questions?Thank you.