Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Information Retrieval and Extraction

4,105 views

Published on

Slides from a talk given at a meeting of the NY Perl Mongers on 5/21/13.

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

Information Retrieval and Extraction

  1. 1. Christopher M. Frenz
  2. 2.  Information is being generated at a fasterrate than ever before The speed at which information can begenerated is continually increasing Continuous improvements in computers,storage, and networking make much of thisinformation readily available to indviduals
  3. 3. 54,000 hits
  4. 4.  Most search engines use a keyword basedapproach If a document contains all of the keywordsspecified it is returned as a match Ranking algorithms (e.g. PageRank) are usedto put the most relevant results at the top ofthe list and the least relevant at the bottom
  5. 5.  Not everything can be easily expressed as akeyword Suppose you want to search for unknownphone numbers? How can you do this withkeywords? How do we recognize a phone number whenwe see one?
  6. 6.  We recognize a phone number by recognizingthe pattern of digits◦ (XXX) XXX-XXXX While it is hard to express such a pattern inthe form of a keyword, it is really easy toexpress it in the form of a regular expression (s?(?d{3})?[-s.]?d{3}[-.]d{4})
  7. 7. #!usr/bin/perluse strict;use warnings;(my $string=<<LIST);John (555) 555-5555 fits patternBob 234 567-8901Mary 734-234-9873Tom 999 999-9999Harry 111 111 1111 does not fit patternLISTwhile($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){print "$1n";}
  8. 8.  Conduct a broad key word search using anexisting search engine Use your custom coded application to takethe returned search results and performregular expression based pattern matching The results that match your regularexpression are your refined search results
  9. 9. General Search APIs Specialized Search APIs Bing Yahoo BOSS Blekko Yandex Twitter Medicine – Pubmed Physics –Arxiv Government –GovTrack Finance – YahooFinance etc
  10. 10. Seeking to Extract: DFN [A-Z]d+Script described in:http://www.biomedcentral.com/1472-6947/7/32
  11. 11.  #!usr/bin/perl use LWP; use strict; use warnings; #sets query and congress session my $query=fracking; my $congress=112; my $ua = LWP::UserAgent->new; my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress"; my $response=$ua->get($url); my $result=$response->content; print $result;Returns JSON formattedoutput
  12. 12.  #!usr/bin/perl use LWP; use XML::LibXML; use strict; use warnings; my $ua=LWP::UserAgent->new(); my $query=perl programming; my $url="http://blekko.com/ws/?q=$query+/rss"; my $response=$ua->get($url); my $results=$response->content; die unless $response->is_success; my $parser=XML::LibXML->new; my $domtree=$parser->parse_string($results); my @Records=$domtree->getElementsByTagName("item"); my $i=0; foreach(@Records){ my $link=$Records[$i]->getChildrenByTagName("link"); print "$i $linkn"; my $description=$Records[$i]->getChildrenByTagName("description"); print "$descriptionnn"; $i++; }
  13. 13.  Allows programmers to extract code samplespertaining to a set of keywords Recognizes the patterns associated withCC++ functions and CC++ Controlstructuresint myfunc ( ){//code here}while ( ) {//code here}
  14. 14.  use Text::Balanced qw(extract_codeblock); #delimiter used to distinguish code blocks for use with Text::Balanced $delim={}; #regex used to match keywords/patterns that precede code blocks my $regex=(((int|long|double|float|void)s*?w{1,25})|if|while|for); foreach $link(@links){ $response=$request->get("$link"); # gets Web page $results=$response->content; while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript pos($results)=0; while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){ $code=$1 . extract_codeblock($results,$delim); print OFile "<h3><a href="$link">$link</a></h3> n"; print OFile "$code" . "n" . "n"; } }
  15. 15.  A common challenge to performinginformation extraction and text mining onmany Web pages or parts of Web pages isthat the content is served up by JavaScript This can be dealt with by putting theJavaScript that serves up the content througha JavaScript Engine like V8
  16. 16.  <title>Contact XYZ inc</title><H1>Contact XYZ inc</H1><br><p>For more information about XYZ inc, please contact us at the following Email address</p><script type="text/javascript" language="javascript"><!--// Email obfuscator script 2.1 by Tim Williams, University of Arizona// Random encryption key feature by Andrew Moulden, Site Engineering Ltd// This code is freeware provided these four comment lines remain intact// A wizard to generate this code is at http://www.jottings.com/obfuscator/{ coded = "OKUxkq@KwtoO2K.0ko"key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"shift=coded.lengthlink=""for (i=0; i<coded.length; i++) {if (key.indexOf(coded.charAt(i))==-1) {ltr = coded.charAt(i)link += (ltr)}else {ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.lengthlink += (key.charAt(ltr))}}document.write("<a href=mailto:"+link+">"+link+"</a>")}//--></script><noscript>Sorry, you need Javascript on to email me.</noscript>
  17. 17.  #!usr/bin/perluse JavaScript::V8;use LWP;use Text::Balanced qw(extract_codeblock);use strict;use warnings;#delimiter used to distinguish code blocks for use with Text::Balancedmy $delim={};#downloads Web pagemy $ua=LWP::UserAgent->new;my $response=$ua->get(http://localhost/email.html);my $result=$response->content;#print "$resultnn";#extracts JavaScriptmy $js;if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){$js=extract_codeblock($result,$delim);}#modified JS to make it processable by V8 module$js=~s/document.write/write/;$js=~s///g;#print "$jsnn";#processes JSmy $context = JavaScript::V8::Context->new();$context->bind_function(write => sub { print @_ });my $mail=$context->eval("$js");print "$mailnn";
  18. 18.  cfrenz@gmail.com http://www.linkedin.com/in/christopherfrenz/

×