SlideShare a Scribd company logo
1 of 18
Download to read offline
Analysis and Modeling of Complex Data in Behavioral and Social Sciences
                          Joint meeting of Japanese and Italian Classification Societies
                          Anacapri (Capri Island, Italy), 3-4 September 2012	




 A SVM Applied Text Categorization of
   Academia-Industry Collaborative
Research and Development Documents
            on the Web	
            Kei Kurakawa1, Yuan Sun1,
       Nagayoshi Yamashita2, Yasumasa Baba3
           1. National Institute of Informatics
                     2. GMO Research
    (ex- Japan Society for the Promotion of Science)
        3. The Institute of Statistical Mathematics
U-I-G relations	
•  To make a policy of science
   and technology research and
                                               U	
   development, university-
   industry-government (U-I-G)
   relations is an important aspect      I	
         G	
   to investigate it (Leydesdorff
   and Meyer, 2003). 	
•  Web document is one of the research targets to
   clarify the state of the relationship.
•  In the clarification process, to get the exact
   resources of U-I-G relations is the first
   requirement.	
                                                           2
Objective	
•  Objective is to extract automatically
   resources of U-I relations from the web.
                 U	


           I	
         G	

•  We set a target into “press release
   articles” of organizations, and make a
   framework to automatically crawl them and
   decide which is of U-I relations.        3
Automatic extraction framework for
U-I relations documents on the web	
  Press	
  release	
  
ar7cles	
  published	
  
 on	
  university	
  or	
  
company	
  web	
  site	

                              1.	
  Crawling	
  Web	
        Crawled	
  
                                    Documents	
             Documents	



                              2.	
  Extrac7ng	
  Text	
  
                                     From	
  the	
           Extracted	
  
                                    Documents	
                Texts	


                                3.	
  Learning	
  to	
      Learned	
  
                                 Classify	
  the	
                           4.	
  Classifying	
  the	
  
                                                             Model	
                Document	
                                  Document	
                  File	
                                        4
Support Vector Machine (1)
          (Vapnik, 1995)	
                                                              y=1
•  Two class classifier                                        y=0
      y(x) = wT (x) + b                                          y=         1

                          Bias parameter	
 Fixed feature space transformation	
•  N input vectors
                                              margin	
    –  Input vector: x1 , . . . , xN
    –  Target values: t1 , . . . , tN where tn 2 { 1, 1}   Support Vector	

•  For all input vectors, tn y(xn ) > 0
•  Maximize margin between
   hyperplane y(x) = 1 and y(x) = 1
                               	
                                                                      5
Support Vector Machine (2)	
•  Optimization problem
                1   2
         arg min kwk .
           w,b  2
                                     T
   subject to the constraints	
 tn (w (x) + b)      1,     n = 1, . . . , N


•  By means of Lagrangian method
                   N
                   X
          y(x) =         an tn k(x, xn ) + b.
                   n=1

   where kernel function is defined by 	
 k(x, x0 ) =    (x)T (x0 )

                  ,and an > 0 is Lagrange multipliers	
                                                                              6
U-I relations documents on the web	
•  Extracted texts from the web documents are very
   noisy for content analysis.
   –  Irrelevant text, e.g. menu label text, header or footer of
      page, ads are still remained.
•  In our observation,
   –  irrelevant text tends to be solely term not in a sentence,
   –  in terms of detecting U-I relations, the exact evidence of
      relevance are occurred in two or three sequential and
      formal sentences.
      •  For example, ”the MIT researchers and scientists from
         MicroCHIPS Inc. reported that... ”,
      •  target of Japanese ”東京大学とオムロン株式会社は、共同研究に
         より、重なりや隠れに強く....”
•  It’s enough to filter text including punctuation marks
   which means fully formal sentence. 	
                                                                   7
Feature selection	
•  tf-idf (Term Frequency – Inverse Document
   Frequency)
•  tf-idf is defined by  tf-idf(t, d, D) = tf(t, d) ⇥ idf(t, D)
                                     a term	
   a document	
   all document	

•  Feature is defined by
                                                    xt,d = tf-idf(t, d, D) ⇥ bt,d
     xd = (xt1 ,d , xt2 ,d , · · · , xtM ,d )              ⇢
                                                               1      if t 2 d
                                                    bt,d =
                                                               0      if t 2 d
                                                                           /

•  The term can be a term in a document, type of
   POS (part-of-speech) of morpheme, or analytical
   output of external tools in our experiment.   8
Mapping a document
                 into a feature vector	
A document	
     東北大学は、NECとの共同研究によりCPU内で使用される電子回路
     (CAM:連想メモリプロセッサ)において、世界で初めて、既存回路と同
     等の高速動作と、処理中に電源を切ってもデータを回路上に保持でき
     る不揮発動作、を両立する技術を開発、実証しました。	

           Feature selection	
                      x = (tf-idf( 産官学	
 , d, D), tf-idf(          協力	
   , d, D),
                           tf-idf( 開始+動詞	
, d, D),tf-idf(         受託+動詞	
 , d, D),
                           tf-idf( 研究+動詞	
, d, D),tf-idf(         実験+動詞	
 , d, D),
                           tf-idf( 開始+名詞,サ変接続	
 , d, D),tf-idf(   発見+動詞	
 , d, D),
                           tf-idf( 研究員	
 , d, D),tf-idf(                     , d, D),
                                                                  研究+名詞,サ変接続	

                           tf-idf( 開発+名詞,サ変接続	
, d, D), tf-idf(     共同	
     , d, D) )

A feature vector	
    x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564)
                                                                                         9
Features (1)	
1)         BoW
      –       Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word
              tf-idf consists of feature vector xn.
2)         BoW(N)
      –       Only noun is chosen.
3)         BoW(N-3)
      –       The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed
              by adding ”する” ([suru], do) to the noun).
4)         K(14)
      –       Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu],
              research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成
              功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受
              賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration),
              ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku],
              UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government)
              relations), and ”連携” ([renkei], coordination).
5)         K(18)
      –       K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),
              ”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).


                                                                                              10
Features (2)	
6)         K(18)+NM
      –       Keywords and POS (Part of Speech) of the next morpheme in a sequential text are
              checked, in that grammatically connections of those keywords are restricted to verb,
              auxiliary verb, and Sahen-noun.
7)         Corp.
      –       Cooperation marks.
      –       ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U
              +3231), (株),or (株) .
8)         Univ.
      –       University name is checked.
      –       ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)
9)         C.+U.
      –       Both cooperation mark and university name are being in a sentence.
10)  ORG
      –       The existing of organization by means of Cabocha’s Japanese named entity
              extraction function	




                                                                                                   11
Feature selection and
                               SVM kernel functions	
            TF-IDF feature element	

            (1)          (2)           (3)         (4)        (5)       (6)          (7)        (8)        (9)        (10)
Test ID	
   BoW	
        BoW(N) 	
     BoW(N-3) 	
 K(14) 	
   K(18)	
   K(18)+NM	
   Corp. 	
   Univ. 	
   C.+U. 	
   ORG 	
   Kernel function	
 1-1	
         ✔	
                                                                                                                Linear	
 1-2	
                       ✔	
                                                                                                  Linear	
 1-3	
                                    ✔	
                                                                                     Linear	
 2-1	
                                                 ✔	
                                                                        Linear	
 2-2	
                                                 ✔	
                                                                      Polynomial	
 2-3	
                                                 ✔	
                                                                         RBF	
 3-1	
                                                            ✔	
                                                             Linear	
 3-2	
                                                            ✔	
                                                           Polynomial	
 3-3	
                                                            ✔	
                                                              RBF	
 4-1	
                                                                     ✔	
                                                    Linear	
 4-2	
                                                                     ✔	
                                                  Polynomial	
 4-3	
                                                                     ✔	
                                                     RBF	
 5-1	
                                                                     ✔	
                                           ✔	
      Linear	
 5-2	
                                                                     ✔	
                                           ✔	
    Polynomial	
 5-3	
                                                                     ✔	
                                           ✔	
       RBF	
 6-1	
                                                                     ✔	
           ✔	
        ✔	
                  ✔	
      Linear	
 6-2	
                                                                     ✔	
           ✔	
        ✔	
                  ✔	
    Polynomial	
 6-3	
                                                                     ✔	
           ✔	
        ✔	
                  ✔	
       RBF	
 7-1	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
                Linear	
 7-2	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
              Polynomial	
 7-3	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
                 RBF	
 7-4	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
             RBF ( γ tuned)	
 8-1	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
      Linear	
 8-2	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
    Polynomial	
 8-3	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
       RBF	
 8-4	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
   RBF ( γ tuned)	
                                                                                                                                        12
Data set for experiment	
Organization	
           Crawled Articles	
                  Articles for Experiment	
                         Positive            Negative        Positive            Negative
                         Article	
           Article	
       Article	
           Article	
Tohoku Univ.	
                        44	
           499	
                44	
               44	
The Univ. of Tokyo	
                 106	
           848	
               106	
           106	
Kyoto Univ.                           40	
           329	
                40	
               40	
Tokyo Inst. of Tech.	
                37	
           343	
                37	
               37	
Hitachi Corp.	
                      103	
           450	
               103	
           103	
Total	
                              330	
          2469	
               330	
           330	




                                                                                                    13
Classification results
                                   (SVM light (Joachims))	
                                        Average points in 10 fold cross validation	
        Test ID	
                       Accuracy	
              Precision	
            Recall	
          F-measure	
          1-1	
                              61.21                   64.04               42.12                 47.28
BoW	
     1-2	
                              60.61                   63.75               40.00                 45.54
          1-3	
                              61.52                   67.44               40.00                 46.72
          2-1	
                              67.58                   72.02               61.52                 63.70
K(14)	
   2-2	
                              58.03                   69.76               23.33                 34.45
          2-3	
                              66.51                   62.53               86.37                 71.89
          3-1	
                              68.18                   72.02               63.33                 64.78
K(18)	
   3-2	
                              57.88                   69.00               23.03                 34.08
          3-3	
                              66.67                   62.22               88.18                 72.43
          4-1	
                              70.61                   74.66               63.64                 67.40
K(18)+NM	
4-2	
                   -	
                     -	
                    -	
               -	
          4-3	
                              70.76                   65.49               90.30                 75.66
          5-1	
                              70.61                   74.61               63.64                 67.31
K(18)+NM, ORG	
          5-2	
                   -	
                     -	
                    -	
               -	
          5-3	
                              70.76                   65.49               90.30                 75.66
          6-1	
                   -	
                     -	
                    -	
               -	
K(18)+NM, Corp, Univ., ORG	
          6-2	
                   -	
                     -	
                    -	
               -	
          6-3	
                              70.15                   64.64               93.64                 76.09
          7-1	
                              78.79                   85.01               71.52                 76.99
K(18)+NM, Corp, Univ., C+U	
          7-2	
          7-3	
                    -	
                          72.27
                                                          -	
                                                                     66.07
                                                                                 -	
                                                                                         94.85
                                                                                                   -	
                                                                                                               77.61
          7-4	
                              80.15                   78.81               83.94                 81.05
          8-1	
                              78.94                   85.03               71.82                 77.16
K(18)+NM, Corp, Univ., C+U, ORG	
          8-2	
          8-3	
                    -	
                                   -	
                    -	
               -	
                                             71.82                   65.73               94.85                 77.35
          8-4	
                              79.85                   78.51               83.94                 80.86
                    - Not calculated because of precision zero or learning optimization fault 	
                 14
Findings and discussion (1)	
•  In the test ID 1- 1, 1-2, 1-3, feature elements
   consists of BoW which count over 15800, 13000,
   and 12000 respectively. The f-measures are
   worse than the other features with the same linear
   kernel function. They seem to be out of learning.
•  The reason why they are failed in learning can be
   that training data size is too much smaller than
   enough to learn. If we have enough size of training
   data, it becomes larger than feature vector size.
   This means training data size surpass the number
   of basis function of SVM, so that learning could be
   done without over-fitting. 	

                                                    15
Findings and discussion (2)	
•  In the test ID from 2-1 to 8-3, feature
   element size is about 14 to 33.
•  Accuracy and f-measure are gradually
   inclined while feature elements are
   additionally complex.




                                             16
Findings and discussion (3)	
•  Test ID 7-* and 8-* is related to an occurrence of
   university and company symbols. Especially in ID
   7-3, recall and f-measure become highest. This
   means the occurrence of the two symbols in a
   sentence is sensitive to U-I relations.
•  Kernel function type strongly depends on scores.
•  Parameters of kernel function and efficiency of
   loss function affect balance between precision and
   recall rate. of Radial Basis Function is decided
   to get highest F-value under cross validation for
   this experiment.
                                                    17
Conclusion and future work	
•  To extract automatically resources of U-I relations from the web,
    –  we set a target into “press release articles” of organizations,
    –  Classification technique, i.e. support vector machine (SVM) is adapted
       to the decision.
•  We have conducted an experiment for several combinations of
   feature vector elements and kernel function types of SVM.
•  The combinations reveal that
    –  U-I relations keywords,
    –  university and company symbols in a sentence
    are effective elements for features.
•  Parameters of SVM is tuned to get higher f-measure, which also
   affect balance between precision and recall rate.
•  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I
   relations documents on the web.

•  In future work, we build the classifier in a context clawer to
   automatically crawl press release Web sites of organizations and get
   more resources.                                                   18

More Related Content

What's hot

Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksGiuseppe Broccolo
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
2007bai7604.doc.doc
2007bai7604.doc.doc2007bai7604.doc.doc
2007bai7604.doc.docbutest
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured RankingSunny Kr
 
Algorithm
AlgorithmAlgorithm
Algorithmseobear
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingBayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingJulyan Arbel
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011BigMC
 

What's hot (19)

Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
similarity measure
similarity measure similarity measure
similarity measure
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Julia text mining_inmobi
Julia text mining_inmobiJulia text mining_inmobi
Julia text mining_inmobi
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Opthlt
OpthltOpthlt
Opthlt
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
2007bai7604.doc.doc
2007bai7604.doc.doc2007bai7604.doc.doc
2007bai7604.doc.doc
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured Ranking
 
Algorithm
AlgorithmAlgorithm
Algorithm
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingBayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 

Similar to A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00butest
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1MostafaHazemMostafaa
 
Lecture14 xing fei-fei
Lecture14 xing fei-feiLecture14 xing fei-fei
Lecture14 xing fei-feiTianlu Wang
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationPaul Houle
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andIbrahim Bounhas
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 

Similar to A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web (20)

Ir models
Ir modelsIr models
Ir models
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00Microsoft PowerPoint - ml4textweb00
Microsoft PowerPoint - ml4textweb00
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Lecture14 xing fei-fei
Lecture14 xing fei-feiLecture14 xing fei-fei
Lecture14 xing fei-fei
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing and
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 

More from National Institute of Informatics

Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...National Institute of Informatics
 
Applying a new subject classification scheme for a database by a data-driven ...
Applying a new subject classification scheme for a database by a data-driven ...Applying a new subject classification scheme for a database by a data-driven ...
Applying a new subject classification scheme for a database by a data-driven ...National Institute of Informatics
 
Toward universal information access on the digital object cloud
Toward universal information access on the digital object cloudToward universal information access on the digital object cloud
Toward universal information access on the digital object cloudNational Institute of Informatics
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...National Institute of Informatics
 
Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...National Institute of Informatics
 
Emerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networksEmerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networksNational Institute of Informatics
 
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較National Institute of Informatics
 
離散一般化ベータ分布を仮定した研究分野マッピングの導出
離散一般化ベータ分布を仮定した研究分野マッピングの導出離散一般化ベータ分布を仮定した研究分野マッピングの導出
離散一般化ベータ分布を仮定した研究分野マッピングの導出National Institute of Informatics
 
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出National Institute of Informatics
 
レコードリンケージに基づく科研費分野-WoS分野マッピング
レコードリンケージに基づく科研費分野-WoS分野マッピングレコードリンケージに基づく科研費分野-WoS分野マッピング
レコードリンケージに基づく科研費分野-WoS分野マッピングNational Institute of Informatics
 
科研費分野-トピック分類マトリックスへの主成分分析の適用
科研費分野-トピック分類マトリックスへの主成分分析の適用科研費分野-トピック分類マトリックスへの主成分分析の適用
科研費分野-トピック分類マトリックスへの主成分分析の適用National Institute of Informatics
 
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -National Institute of Informatics
 
機械学習を用いたWeb上の産学連携関連文書の抽出
機械学習を用いたWeb上の産学連携関連文書の抽出機械学習を用いたWeb上の産学連携関連文書の抽出
機械学習を用いたWeb上の産学連携関連文書の抽出National Institute of Informatics
 
科研費データベースの分野分類とトピック分類の比較分析
科研費データベースの分野分類とトピック分類の比較分析科研費データベースの分野分類とトピック分類の比較分析
科研費データベースの分野分類とトピック分類の比較分析National Institute of Informatics
 
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
Researcher Identifiers and National Federated Search Portal for Japanese Inst...Researcher Identifiers and National Federated Search Portal for Japanese Inst...
Researcher Identifiers and National Federated Search Portal for Japanese Inst...National Institute of Informatics
 
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -National Institute of Informatics
 
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張National Institute of Informatics
 
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~National Institute of Informatics
 
ORCIDのプロトタイプシステムと著者ID関連技術の動向
ORCIDのプロトタイプシステムと著者ID関連技術の動向ORCIDのプロトタイプシステムと著者ID関連技術の動向
ORCIDのプロトタイプシステムと著者ID関連技術の動向National Institute of Informatics
 

More from National Institute of Informatics (20)

Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
 
Applying a new subject classification scheme for a database by a data-driven ...
Applying a new subject classification scheme for a database by a data-driven ...Applying a new subject classification scheme for a database by a data-driven ...
Applying a new subject classification scheme for a database by a data-driven ...
 
Toward universal information access on the digital object cloud
Toward universal information access on the digital object cloudToward universal information access on the digital object cloud
Toward universal information access on the digital object cloud
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...
 
Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...Applying tensor decompositions to author name disambiguation of common Japane...
Applying tensor decompositions to author name disambiguation of common Japane...
 
Emerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networksEmerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networks
 
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
 
研究者識別子の重要性とORCIDアップデート
研究者識別子の重要性とORCIDアップデート研究者識別子の重要性とORCIDアップデート
研究者識別子の重要性とORCIDアップデート
 
離散一般化ベータ分布を仮定した研究分野マッピングの導出
離散一般化ベータ分布を仮定した研究分野マッピングの導出離散一般化ベータ分布を仮定した研究分野マッピングの導出
離散一般化ベータ分布を仮定した研究分野マッピングの導出
 
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
 
レコードリンケージに基づく科研費分野-WoS分野マッピング
レコードリンケージに基づく科研費分野-WoS分野マッピングレコードリンケージに基づく科研費分野-WoS分野マッピング
レコードリンケージに基づく科研費分野-WoS分野マッピング
 
科研費分野-トピック分類マトリックスへの主成分分析の適用
科研費分野-トピック分類マトリックスへの主成分分析の適用科研費分野-トピック分類マトリックスへの主成分分析の適用
科研費分野-トピック分類マトリックスへの主成分分析の適用
 
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
 
機械学習を用いたWeb上の産学連携関連文書の抽出
機械学習を用いたWeb上の産学連携関連文書の抽出機械学習を用いたWeb上の産学連携関連文書の抽出
機械学習を用いたWeb上の産学連携関連文書の抽出
 
科研費データベースの分野分類とトピック分類の比較分析
科研費データベースの分野分類とトピック分類の比較分析科研費データベースの分野分類とトピック分類の比較分析
科研費データベースの分野分類とトピック分類の比較分析
 
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
Researcher Identifiers and National Federated Search Portal for Japanese Inst...Researcher Identifiers and National Federated Search Portal for Japanese Inst...
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
 
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
 
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
 
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
 
ORCIDのプロトタイプシステムと著者ID関連技術の動向
ORCIDのプロトタイプシステムと著者ID関連技術の動向ORCIDのプロトタイプシステムと著者ID関連技術の動向
ORCIDのプロトタイプシステムと著者ID関連技術の動向
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

  • 1. Analysis and Modeling of Complex Data in Behavioral and Social Sciences Joint meeting of Japanese and Italian Classification Societies Anacapri (Capri Island, Italy), 3-4 September 2012 A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web Kei Kurakawa1, Yuan Sun1, Nagayoshi Yamashita2, Yasumasa Baba3 1. National Institute of Informatics 2. GMO Research (ex- Japan Society for the Promotion of Science) 3. The Institute of Statistical Mathematics
  • 2. U-I-G relations •  To make a policy of science and technology research and U development, university- industry-government (U-I-G) relations is an important aspect I G to investigate it (Leydesdorff and Meyer, 2003). •  Web document is one of the research targets to clarify the state of the relationship. •  In the clarification process, to get the exact resources of U-I-G relations is the first requirement. 2
  • 3. Objective •  Objective is to extract automatically resources of U-I relations from the web. U I G •  We set a target into “press release articles” of organizations, and make a framework to automatically crawl them and decide which is of U-I relations. 3
  • 4. Automatic extraction framework for U-I relations documents on the web Press  release   ar7cles  published   on  university  or   company  web  site 1.  Crawling  Web   Crawled   Documents Documents 2.  Extrac7ng  Text   From  the   Extracted   Documents Texts 3.  Learning  to   Learned   Classify  the   4.  Classifying  the   Model   Document Document File 4
  • 5. Support Vector Machine (1) (Vapnik, 1995) y=1 •  Two class classifier y=0 y(x) = wT (x) + b y= 1 Bias parameter Fixed feature space transformation •  N input vectors margin –  Input vector: x1 , . . . , xN –  Target values: t1 , . . . , tN where tn 2 { 1, 1} Support Vector •  For all input vectors, tn y(xn ) > 0 •  Maximize margin between hyperplane y(x) = 1 and y(x) = 1 5
  • 6. Support Vector Machine (2) •  Optimization problem 1 2 arg min kwk . w,b 2 T subject to the constraints tn (w (x) + b) 1, n = 1, . . . , N •  By means of Lagrangian method N X y(x) = an tn k(x, xn ) + b. n=1 where kernel function is defined by k(x, x0 ) = (x)T (x0 ) ,and an > 0 is Lagrange multipliers 6
  • 7. U-I relations documents on the web •  Extracted texts from the web documents are very noisy for content analysis. –  Irrelevant text, e.g. menu label text, header or footer of page, ads are still remained. •  In our observation, –  irrelevant text tends to be solely term not in a sentence, –  in terms of detecting U-I relations, the exact evidence of relevance are occurred in two or three sequential and formal sentences. •  For example, ”the MIT researchers and scientists from MicroCHIPS Inc. reported that... ”, •  target of Japanese ”東京大学とオムロン株式会社は、共同研究に より、重なりや隠れに強く....” •  It’s enough to filter text including punctuation marks which means fully formal sentence. 7
  • 8. Feature selection •  tf-idf (Term Frequency – Inverse Document Frequency) •  tf-idf is defined by tf-idf(t, d, D) = tf(t, d) ⇥ idf(t, D) a term a document all document •  Feature is defined by xt,d = tf-idf(t, d, D) ⇥ bt,d xd = (xt1 ,d , xt2 ,d , · · · , xtM ,d ) ⇢ 1 if t 2 d bt,d = 0 if t 2 d / •  The term can be a term in a document, type of POS (part-of-speech) of morpheme, or analytical output of external tools in our experiment. 8
  • 9. Mapping a document into a feature vector A document 東北大学は、NECとの共同研究によりCPU内で使用される電子回路 (CAM:連想メモリプロセッサ)において、世界で初めて、既存回路と同 等の高速動作と、処理中に電源を切ってもデータを回路上に保持でき る不揮発動作、を両立する技術を開発、実証しました。 Feature selection x = (tf-idf( 産官学 , d, D), tf-idf( 協力 , d, D), tf-idf( 開始+動詞 , d, D),tf-idf( 受託+動詞 , d, D), tf-idf( 研究+動詞 , d, D),tf-idf( 実験+動詞 , d, D), tf-idf( 開始+名詞,サ変接続 , d, D),tf-idf( 発見+動詞 , d, D), tf-idf( 研究員 , d, D),tf-idf( , d, D), 研究+名詞,サ変接続 tf-idf( 開発+名詞,サ変接続 , d, D), tf-idf( 共同 , d, D) ) A feature vector x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564) 9
  • 10. Features (1) 1)  BoW –  Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word tf-idf consists of feature vector xn. 2)  BoW(N) –  Only noun is chosen. 3)  BoW(N-3) –  The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed by adding ”する” ([suru], do) to the noun). 4)  K(14) –  Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu], research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成 功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受 賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration), ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku], UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government) relations), and ”連携” ([renkei], coordination). 5)  K(18) –  K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment), ”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher). 10
  • 11. Features (2) 6)  K(18)+NM –  Keywords and POS (Part of Speech) of the next morpheme in a sequential text are checked, in that grammatically connections of those keywords are restricted to verb, auxiliary verb, and Sahen-noun. 7)  Corp. –  Cooperation marks. –  ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U +3231), (株),or (株) . 8)  Univ. –  University name is checked. –  ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university) 9)  C.+U. –  Both cooperation mark and university name are being in a sentence. 10)  ORG –  The existing of organization by means of Cabocha’s Japanese named entity extraction function 11
  • 12. Feature selection and SVM kernel functions TF-IDF feature element (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Test ID BoW BoW(N) BoW(N-3) K(14) K(18) K(18)+NM Corp. Univ. C.+U. ORG Kernel function 1-1 ✔ Linear 1-2 ✔ Linear 1-3 ✔ Linear 2-1 ✔ Linear 2-2 ✔ Polynomial 2-3 ✔ RBF 3-1 ✔ Linear 3-2 ✔ Polynomial 3-3 ✔ RBF 4-1 ✔ Linear 4-2 ✔ Polynomial 4-3 ✔ RBF 5-1 ✔ ✔ Linear 5-2 ✔ ✔ Polynomial 5-3 ✔ ✔ RBF 6-1 ✔ ✔ ✔ ✔ Linear 6-2 ✔ ✔ ✔ ✔ Polynomial 6-3 ✔ ✔ ✔ ✔ RBF 7-1 ✔ ✔ ✔ ✔ Linear 7-2 ✔ ✔ ✔ ✔ Polynomial 7-3 ✔ ✔ ✔ ✔ RBF 7-4 ✔ ✔ ✔ ✔ RBF ( γ tuned) 8-1 ✔ ✔ ✔ ✔ ✔ Linear 8-2 ✔ ✔ ✔ ✔ ✔ Polynomial 8-3 ✔ ✔ ✔ ✔ ✔ RBF 8-4 ✔ ✔ ✔ ✔ ✔ RBF ( γ tuned) 12
  • 13. Data set for experiment Organization Crawled Articles Articles for Experiment Positive Negative Positive Negative Article Article Article Article Tohoku Univ. 44 499 44 44 The Univ. of Tokyo 106 848 106 106 Kyoto Univ. 40 329 40 40 Tokyo Inst. of Tech. 37 343 37 37 Hitachi Corp. 103 450 103 103 Total 330 2469 330 330 13
  • 14. Classification results (SVM light (Joachims)) Average points in 10 fold cross validation Test ID Accuracy Precision Recall F-measure 1-1 61.21 64.04 42.12 47.28 BoW 1-2 60.61 63.75 40.00 45.54 1-3 61.52 67.44 40.00 46.72 2-1 67.58 72.02 61.52 63.70 K(14) 2-2 58.03 69.76 23.33 34.45 2-3 66.51 62.53 86.37 71.89 3-1 68.18 72.02 63.33 64.78 K(18) 3-2 57.88 69.00 23.03 34.08 3-3 66.67 62.22 88.18 72.43 4-1 70.61 74.66 63.64 67.40 K(18)+NM 4-2 - - - - 4-3 70.76 65.49 90.30 75.66 5-1 70.61 74.61 63.64 67.31 K(18)+NM, ORG 5-2 - - - - 5-3 70.76 65.49 90.30 75.66 6-1 - - - - K(18)+NM, Corp, Univ., ORG 6-2 - - - - 6-3 70.15 64.64 93.64 76.09 7-1 78.79 85.01 71.52 76.99 K(18)+NM, Corp, Univ., C+U 7-2 7-3 - 72.27 - 66.07 - 94.85 - 77.61 7-4 80.15 78.81 83.94 81.05 8-1 78.94 85.03 71.82 77.16 K(18)+NM, Corp, Univ., C+U, ORG 8-2 8-3 - - - - 71.82 65.73 94.85 77.35 8-4 79.85 78.51 83.94 80.86 - Not calculated because of precision zero or learning optimization fault 14
  • 15. Findings and discussion (1) •  In the test ID 1- 1, 1-2, 1-3, feature elements consists of BoW which count over 15800, 13000, and 12000 respectively. The f-measures are worse than the other features with the same linear kernel function. They seem to be out of learning. •  The reason why they are failed in learning can be that training data size is too much smaller than enough to learn. If we have enough size of training data, it becomes larger than feature vector size. This means training data size surpass the number of basis function of SVM, so that learning could be done without over-fitting. 15
  • 16. Findings and discussion (2) •  In the test ID from 2-1 to 8-3, feature element size is about 14 to 33. •  Accuracy and f-measure are gradually inclined while feature elements are additionally complex. 16
  • 17. Findings and discussion (3) •  Test ID 7-* and 8-* is related to an occurrence of university and company symbols. Especially in ID 7-3, recall and f-measure become highest. This means the occurrence of the two symbols in a sentence is sensitive to U-I relations. •  Kernel function type strongly depends on scores. •  Parameters of kernel function and efficiency of loss function affect balance between precision and recall rate. of Radial Basis Function is decided to get highest F-value under cross validation for this experiment. 17
  • 18. Conclusion and future work •  To extract automatically resources of U-I relations from the web, –  we set a target into “press release articles” of organizations, –  Classification technique, i.e. support vector machine (SVM) is adapted to the decision. •  We have conducted an experiment for several combinations of feature vector elements and kernel function types of SVM. •  The combinations reveal that –  U-I relations keywords, –  university and company symbols in a sentence are effective elements for features. •  Parameters of SVM is tuned to get higher f-measure, which also affect balance between precision and recall rate. •  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I relations documents on the web. •  In future work, we build the classifier in a context clawer to automatically crawl press release Web sites of organizations and get more resources. 18