Does sizematter


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Comment why the blt sub is different for each person..Stresss that the lightweight ontologies emerging are personal ontologies, way in which a user reffered to or characterise a set of entities.Go more slowly in the evaluation .. Say what you
  • In this work we were interested in exploring the influence of the size of documents on the accuracy of a given text processing task.Our hypothesis was that results obtained using longer texts could be comparable to those obtained from microblog size texts.. This is from texts containing as much as 140 characters.To test this hypothesis we generated a corpora which consisted of truncated emails starting from 140 characters and successive multiples thereofThe text processing task we evaluated was topic extraction/ and topic classification.
  • We were particularly interested on study this hypothesis within email based corpora because previous research has shown that - although micropost services are also used in fromal environments such as work, - There is also a tendency to perceive this services negatively as they seem to reduce productivity.. In some environments where restrictions are made on these type of services, there is a tendency to use email as an alternative to micropost services, presenting the same patterns of communication which favor brevity.
  • So within the email based corpus we were interested to determine if email is indeed used as a short messaging service And we wanted to evaluate to what extent the knowledge content provided by a truncated message is comparable to that contained in a full message. Determine if the knowledge content of short emails may be used to obtain useful information about e.g., topics of interest or expertise within an organisation, as a basis for carrying out tasks such as expert finding or content-based social network analysis (SNA)
  • Our data set consisted on the Oak mailing list consisting of a 659 email taken from July to January 2010. As we can see in the graph the corpus is already characterised by having a high frquency of short messages going from zero to 280 characters.
  • For comparing the degree in which the knowledge content of a a shorter message could be compared to that of a full message, we choose a text classification task on non-predefined topics..First we started by partioning the them emails to obtain the several fixed size corpora..Then we performed a corpus-driven topic extraction, in which each topic consists of a weighted vector of keywords.Each document was labelled with the topic it is most similar to..
  • Based on the definition of Naaman, Wagner and Strohmaier, introduce a notation for characterising these streams. They refer to this notation as a Tweetonomy. We’ll talk about it later on.. But they also identify also introduce the term Personal Awareness Stream for referring to the c
  • Based on the definition of Naaman, Wagner and Strohmaier, introduce a notation for characterising these streams. They refer to this notation as a Tweetonomy. We’ll talk about it later on.. But they also identify also introduce the term Personal Awareness Stream for referring to the c
  • Does sizematter

    1. 1. Does Size Matter? When Small is Good Enough<br />A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi and N. Ireson<br />The Oak Group, Department of Computer Science, The University of Sheffield<br />
    2. 2. Outline<br />Introduction <br />Email Corpus<br />Dynamic Topic Classification Of Short Texts<br />Experiments <br />Conclusions<br />Outline<br />
    3. 3. Introduction<br />Settings<br /><ul><li> Main Goal: observation of the influence of the size of documents on the accuracy of a defined text processing task
    4. 4. Hypothesis:</li></ul>Results obtained using longer texts may be approximated by short texts, of micropost size, i.e., maximum length 140 characters<br /><ul><li> Dataset: artificially generated comparable corpora, consisting of truncated emails, from micropost size (140 characters), and successive multiples thereof
    5. 5. Methodology: corpus-driven topic extraction/ document topic classification</li></li></ul><li>Introduction<br />Micropost Services<br /><ul><li> Mainly used for social information exchange, but also used in more formal (working) environments [Herbsleb et al., 2002, Isaacs et al., 2002]
    6. 6. Sometimes perceived negatively in the workplace, as they may be seen to reduce productivity [TNS US Group, 2009], and/or pose threats to security and privacy
    7. 7. where restrictions to use are in place, alternatives are sought that obtain the same benefits </li></ul>- Same communication patterns in alternative media: email as a short message service for communication via, e.g., mailing lists<br />
    8. 8. Introduction<br />Research Questions<br /><ul><li> Statistical analysis of emails exchanged via Oak mailing list (over a period of six months) to determine if email is indeed used as a short messaging service.
    9. 9. Content analysis of emails as microposts, to evaluate to what degree the knowledge content of truncated or abbreviated messages can be compared to the complete message.</li></li></ul><li>Email Corpus<br />Oak Mailing List<br />Six month period (July - January 2011), 659 emails<br />
    10. 10. Topic Classificat<br />Topic Classification<br /><ul><li> Goal: evaluate to what degree the knowledge content of a shorter message can be compared to that of a full message.
    11. 11. Chosen task: text classification on non-predefined topics.
    12. 12. Test bed: generated by preprocessing the Oak email corpus to obtain several fixed-size corpora.
    13. 13. Method: </li></ul>- Corpus-driven topic extraction: a number of topics are automatically extracted from a document collection; each topic is represented as a weighted vector of terms;<br />- Document topic classification: each document is labelled with the topic it is most similar to, and classified into the corresponding cluster.<br />
    14. 14. Topic Classificat<br />Topic Extraction: Proximity-based Clustering<br /><ul><li> Document corpus D = {d1,...,dk }
    15. 15. each document di= {t1,...,tv } is a vector of weighted terms
    16. 16. Term clusters C = {c1,...,ck } (clustering performed by using as feature space the inverted index of D)
    17. 17. each cluster ck = {t1,...,tn} is a vector of weighted terms
    18. 18. each cluster ideally represents a topic in the document collection</li></li></ul><li>Topic Classificat<br />Email Topic Classification<br /><ul><li>sim(d,Ci): similarity between documents and clusters (by cosine similarity)
    19. 19. labelDoc(D): each document d is mapped to the topic Ci , which maximises the similarity sim(d,Ci)</li></li></ul><li>Experiments<br />Dataset Preparation: <br />Comparable Corpora<br />
    20. 20. Experiments<br />Experimental Approach<br /><ul><li> Corpus-driven topic extraction: using mainCorpus
    21. 21. Email classification: apply the labelDoc procedure to all different comparable corpora, including mainCorpus.</li></li></ul><li>Experiments<br />Corpus Topics<br />
    22. 22. Experiments<br />Results<br />
    23. 23. Conclusions<br />A fair portion of the emails exchanged, for the corpus generated from a mailing list, are very short, with approximately 40% falling within the single micropost size, and 65% up to two microposts;<br />For the text classification task described, the accuracy of classification for micropost size texts is an acceptable approximation of classification performed on longer texts, with a decrease of only ∼ 5% for up to the second micropost block within a long e-mail.<br />Conclusions<br />
    24. 24. Conclusions<br />Enriching the micro-emails with semantic information (e.g., concepts extracted from domain and standard ontologies). would improve the results obtained using unannotatedtext<br />Investigate the influence of other similarity measures.<br />Application to expert finding tasks, exploiting dynamic topic extraction as a means to determine authors’ and recipients’ areas of expertise.<br />Formal evaluation of topic validity will be required, including the human (expert) annotator in the loop.<br />Future Work<br />
    25. 25. References<br />References<br />[1] Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002). Introducing instant messaging and chat in the workplace. In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages 171–178.<br />[2] Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002). The character, functions, and styles of instant messaging in the workplace. In Proc., ACM conference on Computer supported cooperative work, pages 11–20.<br />[3] TNS US Group (2009). Social media exploding: More than 40% use online social networks.<br />http://www.tns-<br />