• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Text-mining practical
 

Text-mining practical

on

  • 83 views

 

Statistics

Views

Total Views
83
Views on SlideShare
81
Embed Views
2

Actions

Likes
0
Downloads
2
Comments
0

1 Embed 2

http://www.slideee.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Text-mining practical Text-mining practical Presentation Transcript

    • unix primer
    • the command line
    • some useful commands
    • cat
    • less
    • head -10
    • tail -10
    • grep ‘needle’
    • cut -f 2
    • sort
    • sort -nr
    • uniq -c
    • redirecting output
    • write to file
    • command > filename
    • using pipes
    • command1 | command2
    • putting it all together
    • cut -f 4 infile | sort | uniq -c | sort -nr | head -100 > outfile
    • the task
    • disease gene finding
    • named entity recognition
    • human genes
    • gene prioritization
    • what I have done
    • information retrieval
    • two diseases
    • prostate cancer
    • schizophrenia
    • two sets of documents
    • 62,755 abstracts
    • 65,588 abstracts
    • one directory with each set
    • one file with each abstract
    • dictionary
    • tab-delimited file
    • human genes
    • 22,523 entities
    • synonyms
    • from many databases
    • orthographic variation
    • prefixes and postfixes
    • automatically generated
    • 2,726,495 names
    • tagdir program
    • flexible matching
    • upper- and lower-case
    • spaces and hyphens
    • tab-delimited output
    • what you will do
    • named entity recognition
    • find unfortunate names
    • create “black list”
    • information extraction
    • co-mentioning
    • within abstracts
    • ank genes for each disease
    • find shared gene
    • a helping hand
    • “black list”
    • 100+ matches
    • 10+ matches
    • wrap up
    • prostate cancer
    • FOLH1
    • schizophrenia
    • Glutamate carboxypeptidase II
    • same protein
    • synonyms matter
    • “black list” is crucial
    • text mining is useful
    • not black magic
    • EMBO Practical Course Computational Biology: Genomesto Systems Puerto Varas, 3-9April2014 Thank you!Thank you!