Christopher M. Frenz
 Information is being generated at a fasterrate than ever before The speed at which information can begenerated is conti...
54,000 hits
 Most search engines use a keyword basedapproach If a document contains all of the keywordsspecified it is returned as a...
 Not everything can be easily expressed as akeyword Suppose you want to search for unknownphone numbers? How can you do ...
 We recognize a phone number by recognizingthe pattern of digits◦ (XXX) XXX-XXXX While it is hard to express such a patt...
#!usr/bin/perluse strict;use warnings;(my $string=<<LIST);John (555) 555-5555 fits patternBob 234 567-8901Mary 734-234-987...
 Conduct a broad key word search using anexisting search engine Use your custom coded application to takethe returned se...
General Search APIs Specialized Search APIs Bing Yahoo BOSS Blekko Yandex Twitter Medicine – Pubmed Physics –Arxiv...
Seeking to Extract: DFN [A-Z]d+Script described in:http://www.biomedcentral.com/1472-6947/7/32
 #!usr/bin/perl use LWP; use strict; use warnings; #sets query and congress session my $query=fracking; my $congres...
 #!usr/bin/perl use LWP; use XML::LibXML; use strict; use warnings; my $ua=LWP::UserAgent->new(); my $query=perl ...
 Allows programmers to extract code samplespertaining to a set of keywords Recognizes the patterns associated withCC++ f...
 use Text::Balanced qw(extract_codeblock); #delimiter used to distinguish code blocks for use with Text::Balanced $del...
 A common challenge to performinginformation extraction and text mining onmany Web pages or parts of Web pages isthat the...
 <title>Contact XYZ inc</title><H1>Contact XYZ inc</H1><br><p>For more information about XYZ inc, please contact us at th...
 #!usr/bin/perluse JavaScript::V8;use LWP;use Text::Balanced qw(extract_codeblock);use strict;use warnings;#delimiter use...
 cfrenz@gmail.com http://www.linkedin.com/in/christopherfrenz/
Information Retrieval and Extraction
Information Retrieval and Extraction
Information Retrieval and Extraction
Information Retrieval and Extraction
Information Retrieval and Extraction
Information Retrieval and Extraction
Upcoming SlideShare
Loading in …5
×

Information Retrieval and Extraction

3,285 views

Published on

Slides from a talk given at a meeting of the NY Perl Mongers on 5/21/13.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,285
On SlideShare
0
From Embeds
0
Number of Embeds
1,737
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Information Retrieval and Extraction

  1. 1. Christopher M. Frenz
  2. 2.  Information is being generated at a fasterrate than ever before The speed at which information can begenerated is continually increasing Continuous improvements in computers,storage, and networking make much of thisinformation readily available to indviduals
  3. 3. 54,000 hits
  4. 4.  Most search engines use a keyword basedapproach If a document contains all of the keywordsspecified it is returned as a match Ranking algorithms (e.g. PageRank) are usedto put the most relevant results at the top ofthe list and the least relevant at the bottom
  5. 5.  Not everything can be easily expressed as akeyword Suppose you want to search for unknownphone numbers? How can you do this withkeywords? How do we recognize a phone number whenwe see one?
  6. 6.  We recognize a phone number by recognizingthe pattern of digits◦ (XXX) XXX-XXXX While it is hard to express such a pattern inthe form of a keyword, it is really easy toexpress it in the form of a regular expression (s?(?d{3})?[-s.]?d{3}[-.]d{4})
  7. 7. #!usr/bin/perluse strict;use warnings;(my $string=<<LIST);John (555) 555-5555 fits patternBob 234 567-8901Mary 734-234-9873Tom 999 999-9999Harry 111 111 1111 does not fit patternLISTwhile($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){print "$1n";}
  8. 8.  Conduct a broad key word search using anexisting search engine Use your custom coded application to takethe returned search results and performregular expression based pattern matching The results that match your regularexpression are your refined search results
  9. 9. General Search APIs Specialized Search APIs Bing Yahoo BOSS Blekko Yandex Twitter Medicine – Pubmed Physics –Arxiv Government –GovTrack Finance – YahooFinance etc
  10. 10. Seeking to Extract: DFN [A-Z]d+Script described in:http://www.biomedcentral.com/1472-6947/7/32
  11. 11.  #!usr/bin/perl use LWP; use strict; use warnings; #sets query and congress session my $query=fracking; my $congress=112; my $ua = LWP::UserAgent->new; my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress"; my $response=$ua->get($url); my $result=$response->content; print $result;Returns JSON formattedoutput
  12. 12.  #!usr/bin/perl use LWP; use XML::LibXML; use strict; use warnings; my $ua=LWP::UserAgent->new(); my $query=perl programming; my $url="http://blekko.com/ws/?q=$query+/rss"; my $response=$ua->get($url); my $results=$response->content; die unless $response->is_success; my $parser=XML::LibXML->new; my $domtree=$parser->parse_string($results); my @Records=$domtree->getElementsByTagName("item"); my $i=0; foreach(@Records){ my $link=$Records[$i]->getChildrenByTagName("link"); print "$i $linkn"; my $description=$Records[$i]->getChildrenByTagName("description"); print "$descriptionnn"; $i++; }
  13. 13.  Allows programmers to extract code samplespertaining to a set of keywords Recognizes the patterns associated withCC++ functions and CC++ Controlstructuresint myfunc ( ){//code here}while ( ) {//code here}
  14. 14.  use Text::Balanced qw(extract_codeblock); #delimiter used to distinguish code blocks for use with Text::Balanced $delim={}; #regex used to match keywords/patterns that precede code blocks my $regex=(((int|long|double|float|void)s*?w{1,25})|if|while|for); foreach $link(@links){ $response=$request->get("$link"); # gets Web page $results=$response->content; while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript pos($results)=0; while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){ $code=$1 . extract_codeblock($results,$delim); print OFile "<h3><a href="$link">$link</a></h3> n"; print OFile "$code" . "n" . "n"; } }
  15. 15.  A common challenge to performinginformation extraction and text mining onmany Web pages or parts of Web pages isthat the content is served up by JavaScript This can be dealt with by putting theJavaScript that serves up the content througha JavaScript Engine like V8
  16. 16.  <title>Contact XYZ inc</title><H1>Contact XYZ inc</H1><br><p>For more information about XYZ inc, please contact us at the following Email address</p><script type="text/javascript" language="javascript"><!--// Email obfuscator script 2.1 by Tim Williams, University of Arizona// Random encryption key feature by Andrew Moulden, Site Engineering Ltd// This code is freeware provided these four comment lines remain intact// A wizard to generate this code is at http://www.jottings.com/obfuscator/{ coded = "OKUxkq@KwtoO2K.0ko"key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"shift=coded.lengthlink=""for (i=0; i<coded.length; i++) {if (key.indexOf(coded.charAt(i))==-1) {ltr = coded.charAt(i)link += (ltr)}else {ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.lengthlink += (key.charAt(ltr))}}document.write("<a href=mailto:"+link+">"+link+"</a>")}//--></script><noscript>Sorry, you need Javascript on to email me.</noscript>
  17. 17.  #!usr/bin/perluse JavaScript::V8;use LWP;use Text::Balanced qw(extract_codeblock);use strict;use warnings;#delimiter used to distinguish code blocks for use with Text::Balancedmy $delim={};#downloads Web pagemy $ua=LWP::UserAgent->new;my $response=$ua->get(http://localhost/email.html);my $result=$response->content;#print "$resultnn";#extracts JavaScriptmy $js;if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){$js=extract_codeblock($result,$delim);}#modified JS to make it processable by V8 module$js=~s/document.write/write/;$js=~s///g;#print "$jsnn";#processes JSmy $context = JavaScript::V8::Context->new();$context->bind_function(write => sub { print @_ });my $mail=$context->eval("$js");print "$mailnn";
  18. 18.  cfrenz@gmail.com http://www.linkedin.com/in/christopherfrenz/

×