SlideShare a Scribd company logo
+                 Jan Žižka
          František Dařena
                              Department
                              of
                                            Faculty of
                                            Business
                              Informatics   and
                                            Economics




                              Mendel        Czech
                              University    Republic
                              in Brno




        Mining Textual Significant
    Expressions Reflecting Opinions in
           Natural Languages
+
    Introduction


     Many  companies collect opinions expressed
      by their customers.
     These opinions can hide valuable knowledge.
     Discovering the knowledge by people can be
      sometimes a very demanding task because
      the opinion database can be very large,
      the customers can use different languages,
      the people can handle the opinions subjectively,
      sometimes additional resources (like lists of positive
       and negative words) might be needed.
+
    Introduction


     Text
         mining can reveal units of the texts
     (words, phrases, sentences etc.) that can
     represent the meaning/sentiment
     Individual
               words usually do not bring
     enough information
     More  information can provide phrases, but
     their extraction, based on linguistic
     analysis, requires additional knowledge
     that is unique for every language
+
    Objective


    The objective is to find a way how a
    computer can reveal phrases that
    express a certain opinion, without the
    exacting and time consuming linguistic
    analysis which is miscellaneous for
    different natural languages.
+
    Data description


     Processed data included reviews of hotel clients
     collected from publicly available sources
     The   reviews were labeled as positive and negative
     Reviews   characteristics:
      more than 5,000,000 reviews
      written in more than 25 natural languages
      written only by real customers, based on a real
       experience
      written relatively carefully but still containing errors that
       are typical for natural languages
+
    Review examples

       Positive
           The breakfast and the very clean rooms stood out as the best
            features of this hotel.
           Clean and moden, the great loation near station. Friendly
            reception!
           The rooms are new. The breakfast is also great. We had a really
            nice stay.
           Good location - very quiet and good breakfast.

       Negative
           High price charged for internet access which actual cost now
            is extreamly low.
           water in the shower did not flow away
           The room was noisy and the room temperature was higher
            than normal.
           The air conditioning wasn't working
+
    Data preparation


     Data  collection, cleaning (removing tags, non-
      letter characters), converting to upper-case
     Transforming into the bag-of-words
      representation, term frequencies (TF) used as
      attribute values
     Removing the words with global frequency < 2
     Stemming, stopwords removing, spell
      checking, diacritics removal etc. were not
      carried out
+
                     Data characteristics – number of
                     reviews
                    1200000



                    1000000



                     800000
number of reviews




                                                                                      positive
                     600000
                                                                                      negative

                     400000



                     200000



                          0
                              English   French   Spanish   German   Italian   Czech
+
                          Data characteristics – dictionary
                          sizes
                         250000




                         200000
number of unique words




                         150000
                                                                                          MinTF=1
                                                                                          MinTF=2
                         100000




                          50000




                              0
                                  English   German   French   Spanish   Italian   Czech
+
    Finding significant words

     Thanksto having a large collection of labeled
     examples a classifier that separates positive and
     negative reviews could be created
     To reveal significant attributes (words) a decision
      tree was built using the tree-generating algorithm
      c5 based on entropy minimization
     The goal was not to achieve the best classification
      accuracy but to find relevant attributes that
      contribute to assigning a text to a given class
     The significant words appeared in the nodes of the
      decision tree
+
    Finding the significant words

     The classification accuracy which is proportional to
     the relevancy of words was between 89.5 – 92.5%

     Thedecision tree provided a list of about 200–300
     words significant for classification from the
     sentiment perspective

     These words are used as the basis for extraction of
     significant expressions in order to prevent from
     considering all possible combinations of words
+
    Extracting significant expressions

     Extraction  of significant expressions starts from
     the list of significant words, the reviews are
     being searched in the proximity of these words
     Significant-expression   extracting algorithm
     parameters:
     D  – the distance from a significant word within
       which the search is carried out
      N – the number of words forming the significant
       expressions
      M – the minimal number of occurrences of a
       specific group of words
+
    An example

     Searching for significant expressions in a review,
     the algorithm parameters: D = 3, N = 3.
+
    Results

     Lists
          of significant expressions extracted from the
     original text reviews were obtained.

     The   expressions need to be considered by people.
+
    Significant expressions for English
+
    Significant expressions for
    German
+
    Significant expressions for Spanish
+
    Significant expressions for Spanish
+
    Discussion

     Some   of the significant expressions were very similar

     The significant expressions were mostly quite
      meaningful and potentially useful for the target
      audience

     Some   of the expressions were naturally not useful at all

     Itis necessary to find a trade-off between the size of
      expressions, the length of the texts where the search is
      carried out and the informative value of expressions
+
    Discussion

       Examples of different distances of words forming the same
        significant expression "good location"
+
    Discussion

       But, the same expression can be formed from words from
        more contexts:



        “... Breakfast was really good. The location is a
        little out of the center ...”
        or
        “Good service. Convenient location”
        or
        “It is a quiet location for a good nights sleep”
+
    Handling large collections

     For
        languages with large amount of reviews the
     datasets were randomly split into subsets
     consisting of 50,000 reviews because of memory
     requirements and a decision tree was created for
     each such subset

     Each
         of the 50,000-sample subsets gave almost the
     same list of words

     The   relevancies of extracted words were averaged
+
    Conclusions

     A procedure how to apply computers, machine learning, and
      natural language processing areas to automatically find
      significant expressions was presented
     From the total number of words (80,000–200,000) only about
      200–300 were identified as significant and used as the basis
      for expressions extraction
     The simple, unified procedure worked well for many
      languages
     Following research focuses on preprocessing phase (e.g.
      eliminating meaningless words)
     The procedure might be used during the marketing
      research or marketing intelligence, for filtering reviews,
      generating lists of key-words etc.
Thank you for your attention
Vielen Dank für Ihre Aufmerksamkeit
    Gracias por vuestra atención
      Merci de votre attention
   Grazie per la vostra attenzione
      Děkuji za vaši pozornost

More Related Content

Similar to Seminar1

Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Amazon Web Services
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
Ivan Berlocher
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalS. M. Hassan Zaidi
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
Findwise
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
Dr. Haxel Consult
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
ananth
 
Text mining
Text miningText mining
Text mining
Ali A Jalil
 
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Noriaki Tatsumi
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Deep Machine Reading
Deep Machine ReadingDeep Machine Reading
Deep Machine Reading
Naveen Ashish
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Saurabh Kaushik
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysisharit66
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
ananth
 
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h442010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
Alain Désilets
 
Textmining
TextminingTextmining
Textmining
sidhunileshwar
 
Assessment In Spreadsheets
Assessment In SpreadsheetsAssessment In Spreadsheets
Assessment In Spreadsheetsguest46de76
 

Similar to Seminar1 (20)

Zizka synasc 2012
Zizka synasc 2012Zizka synasc 2012
Zizka synasc 2012
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Zizka aimsa 2012
Zizka aimsa 2012Zizka aimsa 2012
Zizka aimsa 2012
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Text mining
Text miningText mining
Text mining
 
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Deep Machine Reading
Deep Machine ReadingDeep Machine Reading
Deep Machine Reading
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h442010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
 
Fypca5
Fypca5Fypca5
Fypca5
 
Textmining
TextminingTextmining
Textmining
 
Assessment In Spreadsheets
Assessment In SpreadsheetsAssessment In Spreadsheets
Assessment In Spreadsheets
 

More from Natalia Ostapuk

Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Natalia Ostapuk
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniarNatalia Ostapuk
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12Natalia Ostapuk
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1Natalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3Natalia Ostapuk
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большаковаNatalia Ostapuk
 
Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Natalia Ostapuk
 

More from Natalia Ostapuk (20)

Gromov
GromovGromov
Gromov
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Ponomareva
PonomarevaPonomareva
Ponomareva
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Konyushkova
KonyushkovaKonyushkova
Konyushkova
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
Text mining
Text miningText mining
Text mining
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
Angelii rus
Angelii rusAngelii rus
Angelii rus
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большакова
 
Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012
 
Cross domainsc new
Cross domainsc newCross domainsc new
Cross domainsc new
 

Recently uploaded

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
Delivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and TrainingDelivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and Training
AG2 Design
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
ArianaBusciglio
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 

Recently uploaded (20)

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
Delivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and TrainingDelivering Micro-Credentials in Technical and Vocational Education and Training
Delivering Micro-Credentials in Technical and Vocational Education and Training
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 

Seminar1

  • 1. + Jan Žižka František Dařena Department of Faculty of Business Informatics and Economics Mendel Czech University Republic in Brno Mining Textual Significant Expressions Reflecting Opinions in Natural Languages
  • 2. + Introduction  Many companies collect opinions expressed by their customers.  These opinions can hide valuable knowledge.  Discovering the knowledge by people can be sometimes a very demanding task because  the opinion database can be very large,  the customers can use different languages,  the people can handle the opinions subjectively,  sometimes additional resources (like lists of positive and negative words) might be needed.
  • 3. + Introduction  Text mining can reveal units of the texts (words, phrases, sentences etc.) that can represent the meaning/sentiment  Individual words usually do not bring enough information  More information can provide phrases, but their extraction, based on linguistic analysis, requires additional knowledge that is unique for every language
  • 4. + Objective The objective is to find a way how a computer can reveal phrases that express a certain opinion, without the exacting and time consuming linguistic analysis which is miscellaneous for different natural languages.
  • 5. + Data description  Processed data included reviews of hotel clients collected from publicly available sources  The reviews were labeled as positive and negative  Reviews characteristics:  more than 5,000,000 reviews  written in more than 25 natural languages  written only by real customers, based on a real experience  written relatively carefully but still containing errors that are typical for natural languages
  • 6. + Review examples  Positive  The breakfast and the very clean rooms stood out as the best features of this hotel.  Clean and moden, the great loation near station. Friendly reception!  The rooms are new. The breakfast is also great. We had a really nice stay.  Good location - very quiet and good breakfast.  Negative  High price charged for internet access which actual cost now is extreamly low.  water in the shower did not flow away  The room was noisy and the room temperature was higher than normal.  The air conditioning wasn't working
  • 7. + Data preparation  Data collection, cleaning (removing tags, non- letter characters), converting to upper-case  Transforming into the bag-of-words representation, term frequencies (TF) used as attribute values  Removing the words with global frequency < 2  Stemming, stopwords removing, spell checking, diacritics removal etc. were not carried out
  • 8. + Data characteristics – number of reviews 1200000 1000000 800000 number of reviews positive 600000 negative 400000 200000 0 English French Spanish German Italian Czech
  • 9. + Data characteristics – dictionary sizes 250000 200000 number of unique words 150000 MinTF=1 MinTF=2 100000 50000 0 English German French Spanish Italian Czech
  • 10. + Finding significant words  Thanksto having a large collection of labeled examples a classifier that separates positive and negative reviews could be created  To reveal significant attributes (words) a decision tree was built using the tree-generating algorithm c5 based on entropy minimization  The goal was not to achieve the best classification accuracy but to find relevant attributes that contribute to assigning a text to a given class  The significant words appeared in the nodes of the decision tree
  • 11. + Finding the significant words  The classification accuracy which is proportional to the relevancy of words was between 89.5 – 92.5%  Thedecision tree provided a list of about 200–300 words significant for classification from the sentiment perspective  These words are used as the basis for extraction of significant expressions in order to prevent from considering all possible combinations of words
  • 12. + Extracting significant expressions  Extraction of significant expressions starts from the list of significant words, the reviews are being searched in the proximity of these words  Significant-expression extracting algorithm parameters: D – the distance from a significant word within which the search is carried out  N – the number of words forming the significant expressions  M – the minimal number of occurrences of a specific group of words
  • 13. + An example  Searching for significant expressions in a review, the algorithm parameters: D = 3, N = 3.
  • 14. + Results  Lists of significant expressions extracted from the original text reviews were obtained.  The expressions need to be considered by people.
  • 15. + Significant expressions for English
  • 16. + Significant expressions for German
  • 17. + Significant expressions for Spanish
  • 18. + Significant expressions for Spanish
  • 19. + Discussion  Some of the significant expressions were very similar  The significant expressions were mostly quite meaningful and potentially useful for the target audience  Some of the expressions were naturally not useful at all  Itis necessary to find a trade-off between the size of expressions, the length of the texts where the search is carried out and the informative value of expressions
  • 20. + Discussion  Examples of different distances of words forming the same significant expression "good location"
  • 21. + Discussion  But, the same expression can be formed from words from more contexts: “... Breakfast was really good. The location is a little out of the center ...” or “Good service. Convenient location” or “It is a quiet location for a good nights sleep”
  • 22. + Handling large collections  For languages with large amount of reviews the datasets were randomly split into subsets consisting of 50,000 reviews because of memory requirements and a decision tree was created for each such subset  Each of the 50,000-sample subsets gave almost the same list of words  The relevancies of extracted words were averaged
  • 23. + Conclusions  A procedure how to apply computers, machine learning, and natural language processing areas to automatically find significant expressions was presented  From the total number of words (80,000–200,000) only about 200–300 were identified as significant and used as the basis for expressions extraction  The simple, unified procedure worked well for many languages  Following research focuses on preprocessing phase (e.g. eliminating meaningless words)  The procedure might be used during the marketing research or marketing intelligence, for filtering reviews, generating lists of key-words etc.
  • 24. Thank you for your attention Vielen Dank für Ihre Aufmerksamkeit Gracias por vuestra atención Merci de votre attention Grazie per la vostra attenzione Děkuji za vaši pozornost