Information Retrieval and Extraction
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Information Retrieval and Extraction

  • 1,785 views
Uploaded on

Slides from a talk given at a meeting of the NY Perl Mongers on 5/21/13.

Slides from a talk given at a meeting of the NY Perl Mongers on 5/21/13.

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,785
On Slideshare
992
From Embeds
793
Number of Embeds
36

Actions

Shares
Downloads
17
Comments
0
Likes
0

Embeds 793

http://perlgems.blogspot.com 321
http://perlgems.blogspot.co.il 183
http://perlgems.blogspot.in 45
http://perlgems.blogspot.co.uk 30
http://perlgems.blogspot.de 28
http://perlgems.blogspot.com.es 26
http://perlgems.blogspot.fr 22
http://perlgems.blogspot.ca 20
http://perlgems.blogspot.ru 19
http://perlgems.blogspot.com.au 12
http://perlgems.blogspot.com.br 10
http://perlgems.blogspot.it 8
http://perlgems.blogspot.pt 6
http://perlgems.blogspot.nl 5
http://perlgems.blogspot.jp 5
http://perlgems.blogspot.co.at 5
http://perlgems.blogspot.kr 5
http://perlgems.blogspot.cz 4
http://plus.url.google.com 4
http://perlgems.blogspot.se 4
http://perlgems.blogspot.ch 4
http://translate.googleusercontent.com 4
http://perlgems.blogspot.tw 3
http://perlgems.blogspot.ro 3
http://perlgems.blogspot.sg 2
http://prlog.ru 2
http://perlgems.blogspot.fi 2
http://perlgems.blogspot.hk 2
http://perlgems.blogspot.ie 2
http://pluralistic11.rssing.com 1
http://perlgems.blogspot.no 1
http://feedly.com 1
http://perlgems.blogspot.mx 1
http://perlgems.blogspot.be 1
http://www.blogger.com 1
http://news.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Christopher M. Frenz
  • 2.  Information is being generated at a fasterrate than ever before The speed at which information can begenerated is continually increasing Continuous improvements in computers,storage, and networking make much of thisinformation readily available to indviduals
  • 3. 54,000 hits
  • 4.  Most search engines use a keyword basedapproach If a document contains all of the keywordsspecified it is returned as a match Ranking algorithms (e.g. PageRank) are usedto put the most relevant results at the top ofthe list and the least relevant at the bottom
  • 5.  Not everything can be easily expressed as akeyword Suppose you want to search for unknownphone numbers? How can you do this withkeywords? How do we recognize a phone number whenwe see one?
  • 6.  We recognize a phone number by recognizingthe pattern of digits◦ (XXX) XXX-XXXX While it is hard to express such a pattern inthe form of a keyword, it is really easy toexpress it in the form of a regular expression (s?(?d{3})?[-s.]?d{3}[-.]d{4})
  • 7. #!usr/bin/perluse strict;use warnings;(my $string=<<LIST);John (555) 555-5555 fits patternBob 234 567-8901Mary 734-234-9873Tom 999 999-9999Harry 111 111 1111 does not fit patternLISTwhile($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){print "$1n";}
  • 8.  Conduct a broad key word search using anexisting search engine Use your custom coded application to takethe returned search results and performregular expression based pattern matching The results that match your regularexpression are your refined search results
  • 9. General Search APIs Specialized Search APIs Bing Yahoo BOSS Blekko Yandex Twitter Medicine – Pubmed Physics –Arxiv Government –GovTrack Finance – YahooFinance etc
  • 10. Seeking to Extract: DFN [A-Z]d+Script described in:http://www.biomedcentral.com/1472-6947/7/32
  • 11.  #!usr/bin/perl use LWP; use strict; use warnings; #sets query and congress session my $query=fracking; my $congress=112; my $ua = LWP::UserAgent->new; my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress"; my $response=$ua->get($url); my $result=$response->content; print $result;Returns JSON formattedoutput
  • 12.  #!usr/bin/perl use LWP; use XML::LibXML; use strict; use warnings; my $ua=LWP::UserAgent->new(); my $query=perl programming; my $url="http://blekko.com/ws/?q=$query+/rss"; my $response=$ua->get($url); my $results=$response->content; die unless $response->is_success; my $parser=XML::LibXML->new; my $domtree=$parser->parse_string($results); my @Records=$domtree->getElementsByTagName("item"); my $i=0; foreach(@Records){ my $link=$Records[$i]->getChildrenByTagName("link"); print "$i $linkn"; my $description=$Records[$i]->getChildrenByTagName("description"); print "$descriptionnn"; $i++; }
  • 13.  Allows programmers to extract code samplespertaining to a set of keywords Recognizes the patterns associated withCC++ functions and CC++ Controlstructuresint myfunc ( ){//code here}while ( ) {//code here}
  • 14.  use Text::Balanced qw(extract_codeblock); #delimiter used to distinguish code blocks for use with Text::Balanced $delim={}; #regex used to match keywords/patterns that precede code blocks my $regex=(((int|long|double|float|void)s*?w{1,25})|if|while|for); foreach $link(@links){ $response=$request->get("$link"); # gets Web page $results=$response->content; while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript pos($results)=0; while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){ $code=$1 . extract_codeblock($results,$delim); print OFile "<h3><a href="$link">$link</a></h3> n"; print OFile "$code" . "n" . "n"; } }
  • 15.  A common challenge to performinginformation extraction and text mining onmany Web pages or parts of Web pages isthat the content is served up by JavaScript This can be dealt with by putting theJavaScript that serves up the content througha JavaScript Engine like V8
  • 16.  <title>Contact XYZ inc</title><H1>Contact XYZ inc</H1><br><p>For more information about XYZ inc, please contact us at the following Email address</p><script type="text/javascript" language="javascript"><!--// Email obfuscator script 2.1 by Tim Williams, University of Arizona// Random encryption key feature by Andrew Moulden, Site Engineering Ltd// This code is freeware provided these four comment lines remain intact// A wizard to generate this code is at http://www.jottings.com/obfuscator/{ coded = "OKUxkq@KwtoO2K.0ko"key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"shift=coded.lengthlink=""for (i=0; i<coded.length; i++) {if (key.indexOf(coded.charAt(i))==-1) {ltr = coded.charAt(i)link += (ltr)}else {ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.lengthlink += (key.charAt(ltr))}}document.write("<a href=mailto:"+link+">"+link+"</a>")}//--></script><noscript>Sorry, you need Javascript on to email me.</noscript>
  • 17.  #!usr/bin/perluse JavaScript::V8;use LWP;use Text::Balanced qw(extract_codeblock);use strict;use warnings;#delimiter used to distinguish code blocks for use with Text::Balancedmy $delim={};#downloads Web pagemy $ua=LWP::UserAgent->new;my $response=$ua->get(http://localhost/email.html);my $result=$response->content;#print "$resultnn";#extracts JavaScriptmy $js;if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){$js=extract_codeblock($result,$delim);}#modified JS to make it processable by V8 module$js=~s/document.write/write/;$js=~s///g;#print "$jsnn";#processes JSmy $context = JavaScript::V8::Context->new();$context->bind_function(write => sub { print @_ });my $mail=$context->eval("$js");print "$mailnn";
  • 18.  cfrenz@gmail.com http://www.linkedin.com/in/christopherfrenz/