Text-mining practical
Upcoming SlideShare
Loading in...5
×
 

Text-mining practical

on

  • 114 views

 

Statistics

Views

Total Views
114
Views on SlideShare
104
Embed Views
10

Actions

Likes
0
Downloads
2
Comments
0

1 Embed 10

http://www.slideee.com 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Text-mining practical Text-mining practical Presentation Transcript

  • unix primer
  • the command line
  • some useful commands
  • cat
  • less
  • head -10
  • tail -10
  • grep ‘needle’
  • cut -f 2
  • sort
  • sort -nr
  • uniq -c
  • redirecting output
  • write to file
  • command > filename
  • using pipes
  • command1 | command2
  • putting it all together
  • cut -f 4 infile | sort | uniq -c | sort -nr | head -100 > outfile
  • the task
  • disease gene finding
  • named entity recognition
  • human genes
  • gene prioritization
  • what I have done
  • information retrieval
  • two diseases
  • prostate cancer
  • schizophrenia
  • two sets of documents
  • 62,755 abstracts
  • 65,588 abstracts
  • one directory with each set
  • one file with each abstract
  • dictionary
  • tab-delimited file
  • human genes
  • 22,523 entities
  • synonyms
  • from many databases
  • orthographic variation
  • prefixes and postfixes
  • automatically generated
  • 2,726,495 names
  • tagdir program
  • flexible matching
  • upper- and lower-case
  • spaces and hyphens
  • tab-delimited output
  • what you will do
  • named entity recognition
  • find unfortunate names
  • create “black list”
  • information extraction
  • co-mentioning
  • within abstracts
  • ank genes for each disease
  • find shared gene
  • a helping hand
  • “black list”
  • 100+ matches
  • 10+ matches
  • wrap up
  • prostate cancer
  • FOLH1
  • schizophrenia
  • Glutamate carboxypeptidase II
  • same protein
  • synonyms matter
  • “black list” is crucial
  • text mining is useful
  • not black magic
  • EMBO Practical Course Computational Biology: Genomesto Systems Puerto Varas, 3-9April2014 Thank you!Thank you!