Sketch engine presentation

1,273 views

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,273
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • It is a corpus query tool which takes as input a corpus of any language (with an appropriate level of linguistic mark-up) and a corresponding grammar patterns, and which generates, amongst other things, word sketches for the words of that language.Those other things include a corpus-based thesaurus and ‘sketch differences’, which specify, for two semantically related words, what behaviour they share and how they differ. We anticipate that sketch differences will be particularly useful for lexicographers interested in near synonym differentiation.Word sketches were first used in the production of the Macmillan English Dictionary (Rundell 2002) and were presented at Euralex 2002 (Kilgarriff and Rundell 2002). Following that presentation, the most-asked question was “can I have them for my language?” In response, we have now developed the Sketch Engine.
  • It is a corpus query tool which takes as input a corpus of any language (with an appropriate level of linguistic mark-up) and a corresponding grammar patterns, and which generates, amongst other things, word sketches for the words of that language.Those other things include a corpus-based thesaurus and ‘sketch differences’, which specify, for two semantically related words, what behaviour they share and how they differ. We anticipate that sketch differences will be particularly useful for lexicographers interested in near synonym differentiation.Word sketches were first used in the production of the Macmillan English Dictionary (Rundell 2002) and were presented at Euralex 2002 (Kilgarriff and Rundell 2002). Following that presentation, the most-asked question was “can I have them for my language?” In response, we have now developed the Sketch Engine.
  • The Sketch Engine has a number of language-analysis functions, the core ones being:the Concordancer A program which displays all occurrences from the corpus for a given query. The program is very powerful with a wide variety of query types and many different ways of displaying and organising the results. (concordancing, sorting, sampling, wordlists, collocation lists)the Word Sketch program This program provides a corpus-based summary of a word's grammatical and collocationalbehaviour.
  • With Corpus Architect, you can build your own corpora from documents in various format: TXT, PDF, PS, DOC, HTML, VERT. When processed, you can search and query them within Sketch Engine.
  • With Corpus Architect, you can build your own corpora from documents in various format: TXT, PDF, PS, DOC, HTML, VERT. When processed, you can search and query them within Sketch Engine.
  • With Corpus Architect, you can build your own corpora from documents in various format: TXT, PDF, PS, DOC, HTML, VERT. When processed, you can search and query them within Sketch Engine.
  • Concordance: for querying a corpus and obtaining concordances which you can then further refine, filter and use for generating frequency information and collocation listsWord List: for obtaining word lists for an entire corpus, or a specified subcorpusWord Sketch: this allows you to explore the grammatical and collocational behaviour of a word.Thesaurus: this allows you to find other words that have similar grammatical and collocational behaviour to a given word. Note that this thesaurus is produced automatically from statistics on word co-occurrences. It is not a manually constructed thesaurus and will list words for each entry which are distributionally related but not necessarily synonyms.Sketch-Diff: this allows you to compare the behaviour of two words
  • Main Sketch Engine Links:https://www.sketchengine.co.uk/documentation/wiki/SkE/Help/MainLinkHelp
  • Concordance Query:https://www.sketchengine.co.uk/documentation/wiki/SkE/Help/PageSpecificHelp/ConcordanceQueryQuery Types: Using Query Type, you can refine the type of query you wish to make in the main panel.Context : If Context is selected in the LHS menu, on the main panel you can specify criteria on the context for your query. You can choose to specify the context in terms of surrounding lemma(s) and/or PoS tag(s).Text Types: Here you can select a subcorpus or create a new subcorpus from a subset of the current corpus. You can also stipulate constraints on the text types for documents that will be searched for your query
  • CQL:https://www.sketchengine.co.uk/documentation/wiki/SkE/CorpusQuerying#1.
  • Ex1:Lemma filter:Window: right, 1 tokensLemma(s): عن none
  • Concordance Menu options:https://www.sketchengine.co.uk/documentation/wiki/SkE/Help/PageSpecificHelp/Concordance Menu optionsNote that the options in the left hand side panel are all available when you are viewing the concordance. Some of the options will not be shown if you have already selected from this menu. If so, you can click view concordance to get back to the concordance.View OptionsClicking on View Options will allow you to alter how the concordance looksWith this you can select what attributes of the words in the concordance you seeKWIC/Sentence Toggle betweenthe KWIC mode where the queried text (node) is in a central column and context is displayed on either sideSentence where the queried text (node) is provided in the context of the sentence in which it occursSave Click on this to see options for saving the concordance in the main panel (or the frequency list or collocation candidates).Sort Click on this to see complex sorting options. If the concordance is sorted based on the context, an option to"Jump to" a page with context starting with a certain letter occurs.Alternatively, you can click onLeft (Right): to sort by the text left (Right) of the nodeNode: to sort by the text in the central column (referred to as the node or KWIC)References: to sort by the document references at the left hand side of the concordanceShuffle: the concordance will be jumbled to avoid bias from a user only looking at the first portionSample Click this to select a random sample of the concordance linesFilter Click this to further specify contextual features to filter the concordance, for example by words to the left or right of the node word, or by text typeFrequency Click on this to see a variety of complex methods for obtaining frequency listsAlternatively, you can click onNode tags: to get a frequency list over the part of speech tags of the node word/s in the central columnNode forms: to get a frequency list over the node word forms in the central columnDoc IDs: to get a frequency list over the Doc ID's for the node word/s in the central columnText Types: to get a frequency list over all the text types of the node word/s in the central columnCollocations Click on this to specify criteria and build collocation lists for the node word/s in the central columnConcDesc You can see the query in detail (for technical people) and you can go back in the history if the query consists of several subsequent actions.Visualize This link will show you the distributional graph of the concordance within the corpus. On x-axis there are concordance positions (by default 100 columns for 100 slices of the corpus, you may change its granularity with the slider + click on Redraw button), on y-axis there is a relative frequency of the query hits within a concordance part (=column). Columns are clickable: by clicking on a column, you will filter the concordance and will see only the appropriate concordance part.
  • Word List Options:Left hand side options:select All words to generate a list of words in the corpus ranked by frequencyselect All lemmas to generate a list of lemmas in the corpus ranked by frequency. Lemma is the base (stem) form of a word.In the main panel of the interface you have further options:Subcorpus: where you can specify a subcorpus for the source data, or create a new one.Search Attribute: you can specify word, lemma, tag (part of speech tag) etc.. depending on the attributes defined for the corpus or you can specify one of the text types defined for the corpus. The default attribute is word.Filter Options: You can either do this for all words (or lemmas or whichever attribute you specify) or you can filter the list.Output Options:You can select different types of the produced list.
  • Choose a corpus and click on Word List in the left hand side menu.Choose lemma at Search attributeType the lemma (e.g.  حار) into the RE pattern box. Tick the box that says change output attribute(s).In the first two levels, select “lemma" and "Tag".Click on Make Word List.
  • Wordlist  search Attr: lemma, Change Attr: gender
  • Sketch engine presentation

    1. 1. Introduction to Sketch Engine http://www.sketchengine.co.uk/ – 1
    2. 2. Basic Terminology Introduction How to Use Sketch Engine ? Research Issues Outline 2
    3. 3. BasicTerminology English Term Corpus - Corpora ≠Blog . Parallel corpora Comparable Corpus . Written Corpora Spoken Corpora 3
    4. 4. BasicTerminology English Term Collocation )( ()() () Concordances – : . . . .. Lemma 4
    5. 5. BasicTerminology English Term Part-of-Speech (PoS) Tagging codetag . Thesaurus () 5
    6. 6. What is Sketch Engine ?  It is a corpus query tool which takes as input a corpus of any language and a corresponding grammar patterns, and which generates, amongst other things, word sketches for the words of that language.  The Sketch Engine is designed for anyone wanting to research how words behave. 6 SkE Corpus Word Sketches
    7. 7. What is Sketch Engine ? 7 Upload your own corpus Access to public corpora Advanced search options
    8. 8. Sketch Engine Features 1 • Web based tool – No installation 2 • Support Arabic corpora 3 • The Concordancer with advanced options 4 • The Word Sketches 8
    9. 9. Sketch Engine Features 5 • The Thesaurus (find similar words) 6 • Support for parallel corpora, virtual sub- and super corpora 7 • Full regular-expression searching using CQL 8 • Corpus Architect: user corpora, uploaded by users or created by WebBootCaT 9
    10. 10. Who Use Sketch Engine ? 10 Language learners WritersLinguists Researchers
    11. 11. Sketch engine usage: 11 Common words/colloc ations synonyms grammar Words behavior
    12. 12. Available corpora 12 200+ corpora in 60+ languages
    13. 13. Available Arabic corpora 13
    14. 14.  14 How to create your corpus using SKE?
    15. 15. Steps to create a Corpus in SKE 15 Word Sketches Sketch Diff Thesaurus Raw text Tokenizati on Lemmatiz ation POS tagging Sketch Grammar SKE Features
    16. 16. 16 1- Upload your text: - Sketch engine accepts types of files such as (.xml .doc, .docx, .htm, .html, .pdf,.txt, …)
    17. 17. 17 2- Tokenization: - The process of splitting words and adding structure tags (<s>,<doc>,<p>). - The output will be a vertical line file
    18. 18. 18 3- Lemmatization (optional): - The process of attaching a word with its lemma.
    19. 19. 19 4- POS tagging:(mandatory for word sketch) - The process of attaching a word with its part-of-speech tag. - SKE Arabic tagger is not avaliable. • V • PN • N
    20. 20. 20 5- uploading Sketch Grammar: - A file describing the grammatical relations in a langauge. Example: 1: ”V” “(DET|NUM|ADJ|ADV|N)”* 2:”N”
    21. 21. Vertical line file with annotations 21
    22. 22. Adding data to the corpus by uploading a file 22
    23. 23. Adding data to the corpus usingWebBootCat 23 Seeds/URLs WebBootCat Your corpus
    24. 24. How to Use Sketch Engine ?  As a Corpus User (Querying Corpora) Concordance Word Lists Word Sketches Sketch Diff Thesaurus 24
    25. 25.  Concordance 25
    26. 26. Concordance What is Concordancer? A concordancer looks through the whole corpus and finds every example of a particular word or phrase, then displays it with its immediate context. . . 26
    27. 27. 27
    28. 28. Query Types Context Text Types 28
    29. 29. Concordance Query'sTypes Query’s Types Simple Lemma Phrase Word Character CQL 29
    30. 30. Concordance Query'sTypes Simple Will match the lemma (the stemmed form) as well as the word + work for phrases. « » ... 30
    31. 31. Concordance Query'sTypes Lemma Will match any lemma + you can select PoS (Not for Arabic corpus). This option will not work for phrases « » ... 31
    32. 32. Concordance Query'sTypes Phrase Will match a phrase + any capitalized variant (Not for Arabic corpus) but will not match the lemma « » « » 32
    33. 33. Concordance Query'sTypes Word Will match any word form exactly. +you can select the PoS (Not for Arabic corpus) +you can select "match case“ (Not for Arabic corpus) « »« » 33
    34. 34. Concordance Query'sTypes Character Matches a character string. « » ... 34
    35. 35. Concordance Query'sTypes CQL Is for inputting complex queries using Corpus Query Language 35
    36. 36.  The general form is: [attr="value"] o«»  “Match any character“ operator: * o«...»  Or , And operators: | , &: o«»«» 36 Concordance Corpus Query Language (Basics)
    37. 37.  “Match any token" operator: [] o«..»«»  Specifying number of tokens operator: {} o«..»«» o«..»0-3 «» 37 Concordance Corpus Query Language (Basics)
    38. 38. Concordance Exercises (CQL)  Ex1: : «»  Ex2:  38
    39. 39. Concordance Exercises (CQL)  Ex1: : «» "" [] "“  Ex2:  "" [] {0,3} "|" 39
    40. 40. Context 40
    41. 41.  Here you can specify criteria on the context for your query.  Ex1: «»«»  Ex2: «»«» 41 Concordance Context
    42. 42. 42 Concordance Context (Exercise)
    43. 43. 43 Concordance Context (Exercise)
    44. 44. Text Types 44
    45. 45.  Here you can:  Select a sub-corpus or  Create a new sub-corpus from a subset of the current corpus  You can also select constraints on the text types for documents that will be searched for your query 45 Concordance TextTypes
    46. 46. 46 Concordance TextTypes
    47. 47. 47 Concordance Concordance Menu Options  Save  View Options  Sort  Sample  Filter  Frequency  Collocations  ConcDesc  Visualize
    48. 48. Concordance Exercises  Ex1: Filter   Ex2: Collocation «»  Ex3: Frequency – Node Tags «»,  Ex4: CQL - Frequency – Node Forms : «» «» 48
    49. 49. Concordance Exercises  Ex1: Concordance:  Make Concordance  Filter  select negative, Simple query:  Ex2: Concordance:  Make Concordance  Collocation  Attribute: word  Make Candidate List  Ex3: Concordance:  Make Concordance  Click Node Tags  Ex4: Concordance  CQL: « » « | » 49
    50. 50.  Word List 50
    51. 51. WordList What is theWord List?  Word List: for obtaining word lists ranked by frequency for an entire corpus, or a specified sub-corpus  It can be useful for investigating whether a word is used most frequently in its verb or noun form, for instance. 51
    52. 52. 52 Input: RE pattern or any attribute (word, tag, lemma…) Word List Output: Filtered list of lemma and/ words with frequencies
    53. 53. 53
    54. 54. WordList Exercises  Ex1: «» «» 54
    55. 55. Choose lemma at Search attribute Type the lemma (e.g. ) into the RE pattern box. Tick the box that says change output attribute(s). In the first two levels, select “lemma" and "Tag". 55
    56. 56. 56
    57. 57. WordList Exercises  Ex1: «» 57
    58. 58. WordList Exercises 58
    59. 59. WordList Exercises 59
    60. 60.  Word Sketch 60
    61. 61. WordSketch What isWord Sketch?  Word Sketch: this allows you to explore the grammatical and collocational behaviour of a word.  The Word Sketch function doesn’t just tell you what words are commonly found in the company of your search word, but also tells you what their grammatical relationship is to the search word. 61
    62. 62. 62 Input: Lemma Word Sketch Output: Collocations in grammatical relation
    63. 63. WordSketch Example 63
    64. 64. WordSketch Example 64
    65. 65. WordSketch Exercises  Ex1: «»  Ex2: «» 65
    66. 66.  Thesaurus 66
    67. 67. Thesaurus What isThesaurus?  Thesaurus: this allows you to find other words that have similar grammatical and collocational behaviour to a given word.  Note that this thesaurus is produced automatically from statistics on word co- occurrences.  It is not a manually constructed thesaurus and will list words for each entry which are distributionally related but not necessarily synonyms. 67
    68. 68. 68 Input: Lemma + POS tag Thesaurus Output: Similar lemma
    69. 69. Thesaurus Example 69
    70. 70. Thesaurus Example 70
    71. 71. Thesaurus Example 71
    72. 72.  Word Sketch difference 72
    73. 73. Sketch-Diff What isWord Sketch Difference?  Sketch-Diff: this allows you to compare the behavior of two words  This function is also very useful for comparing/deciding between two possible translations of an item. 73
    74. 74. 74 Input: two words or lemmas Sketch-Diff Output: the different and common collocations of the two lemmas.
    75. 75. Sketch-Diff Example 75
    76. 76. Sketch-Diff Example 76
    77. 77. Sketch-Diff Exercises  Ex1: /  Ex2: / 77
    78. 78.  Compare corpora 78
    79. 79. 79
    80. 80. Research Issues! Please visit: http://goo.gl/HqhUir Limitations! Usage!
    81. 81. References  http://www.sketchengine.co.uk/  http://lisan1.com/wordpress/?p=146  Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). Itri-04-08 the sketch engine. Information Technology, 105, 116. 81
    82. 82. Thank You #__ 82

    ×