SlideShare a Scribd company logo
Christopher M. Frenz
 Information is being generated at a faster
rate than ever before
 The speed at which information can be
generated is continually increasing
 Continuous improvements in computers,
storage, and networking make much of this
information readily available to indviduals
54,000 hits
 Most search engines use a keyword based
approach
 If a document contains all of the keywords
specified it is returned as a match
 Ranking algorithms (e.g. PageRank) are used
to put the most relevant results at the top of
the list and the least relevant at the bottom
 Not everything can be easily expressed as a
keyword
 Suppose you want to search for unknown
phone numbers? How can you do this with
keywords?
 How do we recognize a phone number when
we see one?
 We recognize a phone number by recognizing
the pattern of digits
◦ (XXX) XXX-XXXX
 While it is hard to express such a pattern in
the form of a keyword, it is really easy to
express it in the form of a regular expression
 (s?(?d{3})?[-s.]?d{3}[-.]d{4})
#!usr/bin/perl
use strict;
use warnings;
(my $string=<<'LIST');
John (555) 555-5555 fits pattern
Bob 234 567-8901
Mary 734-234-9873
Tom 999 999-9999
Harry 111 111 1111 does not fit pattern
LIST
while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){
print "$1n";
}
 Conduct a broad key word search using an
existing search engine
 Use your custom coded application to take
the returned search results and perform
regular expression based pattern matching
 The results that match your regular
expression are your refined search results
General Search APIs Specialized Search APIs
 Bing
 Yahoo BOSS
 Blekko
 Yandex
 Twitter
 Medicine – Pubmed
 Physics –Arxiv
 Government –
GovTrack
 Finance – Yahoo
Finance
 etc
Seeking to Extract: DFN [A-Z]d+
Script described in:
http://www.biomedcentral.com/1472-6947/7/32
 #!usr/bin/perl
 use LWP;
 use strict;
 use warnings;
 #sets query and congress session
 my $query='fracking';
 my $congress=112;
 my $ua = LWP::UserAgent->new;
 my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";
 my $response=$ua->get($url);
 my $result=$response->content;
 print $result;
Returns JSON formatted
output
 #!usr/bin/perl

 use LWP;
 use XML::LibXML;
 use strict;
 use warnings;

 my $ua=LWP::UserAgent->new();
 my $query='perl programming';
 my $url="http://blekko.com/ws/?q=$query+/rss";
 my $response=$ua->get($url);
 my $results=$response->content; die unless $response->is_success;

 my $parser=XML::LibXML->new;
 my $domtree=$parser->parse_string($results);
 my @Records=$domtree->getElementsByTagName("item");
 my $i=0;
 foreach(@Records){
 my $link=$Records[$i]->getChildrenByTagName("link");
 print "$i $linkn";
 my $description=$Records[$i]->getChildrenByTagName("description");
 print "$descriptionnn";
 $i++;
 }
 Allows programmers to extract code samples
pertaining to a set of keywords
 Recognizes the patterns associated with
CC++ functions and CC++ Control
structures
int myfunc ( ){
//code here
}
while ( ) {
//code here
}
 use Text::Balanced qw(extract_codeblock);

 #delimiter used to distinguish code blocks for use with Text::Balanced
 $delim='{}';

 #regex used to match keywords/patterns that precede code blocks
 my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';

 foreach $link(@links){
 $response=$request->get("$link"); # gets Web page
 $results=$response->content;
 while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript
 pos($results)=0;
 while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){
 $code=$1 . extract_codeblock($results,$delim);
 print OFile "<h3><a href="$link">$link</a></h3> n";
 print OFile "$code" . "n" . "n";
 }
 }
 A common challenge to performing
information extraction and text mining on
many Web pages or parts of Web pages is
that the content is served up by JavaScript
 This can be dealt with by putting the
JavaScript that serves up the content through
a JavaScript Engine like V8
 <title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
<!--
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
{ coded = "OKUxkq@KwtoO2K.0ko"
key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
}
//-->
</script><noscript>Sorry, you need Javascript on to email me.</noscript>
 #!usr/bin/perl
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$resultnn";
#extracts JavaScript
my $js;
if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/'/g;
#print "$jsnn";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mailnn";
 cfrenz@gmail.com
 http://www.linkedin.com/in/christopherfrenz/

More Related Content

What's hot

Google Dorks
Google DorksGoogle Dorks
Google Dorks
Andrea D'Ubaldo
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php Security
Dave Ross
 
All About HTML Tags
All About HTML TagsAll About HTML Tags
All About HTML Tags
Performics.Convonix
 
How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020
Roxana Stingu
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translation
wcto2017
 
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
Felipe Prado
 
Website Security
Website SecurityWebsite Security
Website Security
MODxpo
 
47300 php-web-backdoor-decode
47300 php-web-backdoor-decode47300 php-web-backdoor-decode
47300 php-web-backdoor-decode
Attaporn Ninsuwan
 

What's hot (11)

Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php Security
 
All About HTML Tags
All About HTML TagsAll About HTML Tags
All About HTML Tags
 
How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translation
 
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
 
Website Security
Website SecurityWebsite Security
Website Security
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Documento
DocumentoDocumento
Documento
 
47300 php-web-backdoor-decode
47300 php-web-backdoor-decode47300 php-web-backdoor-decode
47300 php-web-backdoor-decode
 

Viewers also liked

What the fuzz
What the fuzzWhat the fuzz
What the fuzz
Christopher Frenz
 
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS AttacksXSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
Christopher Frenz
 
Searchable Encryption Systems
Searchable Encryption SystemsSearchable Encryption Systems
Searchable Encryption Systems
Christopher Frenz
 
Hot fuzz - textual analysis
Hot fuzz - textual analysis Hot fuzz - textual analysis
Hot fuzz - textual analysis
Kamila Glomska
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
Carlos Castillo (ChaTo)
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
Ankush Jain
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalSvitlana volkova
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
PROJECT CONSULT Unternehmensberatung Dr. Ulrich Kampffmeyer GmbH
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsGUANBO
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaAhmedali Durga
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosisask2372
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
Ankit Sharma
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalChen Xi
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2ndhit_alex
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...
Jim Jenkins
 
2 13
2 132 13
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
Tommaso Teofili
 

Viewers also liked (20)

What the fuzz
What the fuzzWhat the fuzz
What the fuzz
 
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS AttacksXSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
 
Searchable Encryption Systems
Searchable Encryption SystemsSearchable Encryption Systems
Searchable Encryption Systems
 
Hot fuzz - textual analysis
Hot fuzz - textual analysis Hot fuzz - textual analysis
Hot fuzz - textual analysis
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...
 
2 13
2 132 13
2 13
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 

Similar to Information Retrieval and Extraction

My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
Wim Godden
 
Web application security
Web application securityWeb application security
Web application security
Ravi Raj
 
Website Security
Website SecurityWebsite Security
Website SecurityCarlos Z
 
Ch1(introduction to php)
Ch1(introduction to php)Ch1(introduction to php)
Ch1(introduction to php)
Chhom Karath
 
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
James Titcumb
 
User authentication module using php
User authentication module using phpUser authentication module using php
User authentication module using php
Rishabh Srivastava
 
Hacking Client Side Insecurities
Hacking Client Side InsecuritiesHacking Client Side Insecurities
Hacking Client Side Insecuritiesamiable_indian
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
Prof. Wim Van Criekinge
 
Salzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP SymfonySalzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP Symfony
Georg Sorst
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
iMasters
 
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
James Titcumb
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
daveconnors
 
PHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHPPHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHP
Lariya Minhaz
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
Wim Godden
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awareness
Leon Teale
 
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Dennis de Greef
 
20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP
chadtindel
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 
Getting More Traffic From Search Advanced Seo For Developers Presentation
Getting More Traffic From Search  Advanced Seo For Developers PresentationGetting More Traffic From Search  Advanced Seo For Developers Presentation
Getting More Traffic From Search Advanced Seo For Developers Presentation
Seo Indonesia
 

Similar to Information Retrieval and Extraction (20)

My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Web application security
Web application securityWeb application security
Web application security
 
Website Security
Website SecurityWebsite Security
Website Security
 
Ch1(introduction to php)
Ch1(introduction to php)Ch1(introduction to php)
Ch1(introduction to php)
 
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
 
User authentication module using php
User authentication module using phpUser authentication module using php
User authentication module using php
 
Hacking Client Side Insecurities
Hacking Client Side InsecuritiesHacking Client Side Insecurities
Hacking Client Side Insecurities
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
Salzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP SymfonySalzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP Symfony
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
PHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHPPHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHP
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awareness
 
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
 
20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Getting More Traffic From Search Advanced Seo For Developers Presentation
Getting More Traffic From Search  Advanced Seo For Developers PresentationGetting More Traffic From Search  Advanced Seo For Developers Presentation
Getting More Traffic From Search Advanced Seo For Developers Presentation
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Information Retrieval and Extraction

  • 2.  Information is being generated at a faster rate than ever before  The speed at which information can be generated is continually increasing  Continuous improvements in computers, storage, and networking make much of this information readily available to indviduals
  • 4.
  • 5.  Most search engines use a keyword based approach  If a document contains all of the keywords specified it is returned as a match  Ranking algorithms (e.g. PageRank) are used to put the most relevant results at the top of the list and the least relevant at the bottom
  • 6.  Not everything can be easily expressed as a keyword  Suppose you want to search for unknown phone numbers? How can you do this with keywords?  How do we recognize a phone number when we see one?
  • 7.
  • 8.  We recognize a phone number by recognizing the pattern of digits ◦ (XXX) XXX-XXXX  While it is hard to express such a pattern in the form of a keyword, it is really easy to express it in the form of a regular expression  (s?(?d{3})?[-s.]?d{3}[-.]d{4})
  • 9. #!usr/bin/perl use strict; use warnings; (my $string=<<'LIST'); John (555) 555-5555 fits pattern Bob 234 567-8901 Mary 734-234-9873 Tom 999 999-9999 Harry 111 111 1111 does not fit pattern LIST while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){ print "$1n"; }
  • 10.  Conduct a broad key word search using an existing search engine  Use your custom coded application to take the returned search results and perform regular expression based pattern matching  The results that match your regular expression are your refined search results
  • 11. General Search APIs Specialized Search APIs  Bing  Yahoo BOSS  Blekko  Yandex  Twitter  Medicine – Pubmed  Physics –Arxiv  Government – GovTrack  Finance – Yahoo Finance  etc
  • 12. Seeking to Extract: DFN [A-Z]d+ Script described in: http://www.biomedcentral.com/1472-6947/7/32
  • 13.  #!usr/bin/perl  use LWP;  use strict;  use warnings;  #sets query and congress session  my $query='fracking';  my $congress=112;  my $ua = LWP::UserAgent->new;  my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";  my $response=$ua->get($url);  my $result=$response->content;  print $result; Returns JSON formatted output
  • 14.  #!usr/bin/perl   use LWP;  use XML::LibXML;  use strict;  use warnings;   my $ua=LWP::UserAgent->new();  my $query='perl programming';  my $url="http://blekko.com/ws/?q=$query+/rss";  my $response=$ua->get($url);  my $results=$response->content; die unless $response->is_success;   my $parser=XML::LibXML->new;  my $domtree=$parser->parse_string($results);  my @Records=$domtree->getElementsByTagName("item");  my $i=0;  foreach(@Records){  my $link=$Records[$i]->getChildrenByTagName("link");  print "$i $linkn";  my $description=$Records[$i]->getChildrenByTagName("description");  print "$descriptionnn";  $i++;  }
  • 15.
  • 16.  Allows programmers to extract code samples pertaining to a set of keywords  Recognizes the patterns associated with CC++ functions and CC++ Control structures int myfunc ( ){ //code here } while ( ) { //code here }
  • 17.  use Text::Balanced qw(extract_codeblock);   #delimiter used to distinguish code blocks for use with Text::Balanced  $delim='{}';   #regex used to match keywords/patterns that precede code blocks  my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';   foreach $link(@links){  $response=$request->get("$link"); # gets Web page  $results=$response->content;  while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript  pos($results)=0;  while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){  $code=$1 . extract_codeblock($results,$delim);  print OFile "<h3><a href="$link">$link</a></h3> n";  print OFile "$code" . "n" . "n";  }  }
  • 18.
  • 19.  A common challenge to performing information extraction and text mining on many Web pages or parts of Web pages is that the content is served up by JavaScript  This can be dealt with by putting the JavaScript that serves up the content through a JavaScript Engine like V8
  • 20.  <title>Contact XYZ inc</title> <H1>Contact XYZ inc</H1><br> <p>For more information about XYZ inc, please contact us at the following Email address</p> <script type="text/javascript" language="javascript"> <!-- // Email obfuscator script 2.1 by Tim Williams, University of Arizona // Random encryption key feature by Andrew Moulden, Site Engineering Ltd // This code is freeware provided these four comment lines remain intact // A wizard to generate this code is at http://www.jottings.com/obfuscator/ { coded = "OKUxkq@KwtoO2K.0ko" key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn" shift=coded.length link="" for (i=0; i<coded.length; i++) { if (key.indexOf(coded.charAt(i))==-1) { ltr = coded.charAt(i) link += (ltr) } else { ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length link += (key.charAt(ltr)) } } document.write("<a href='mailto:"+link+"'>"+link+"</a>") } //--> </script><noscript>Sorry, you need Javascript on to email me.</noscript>
  • 21.
  • 22.  #!usr/bin/perl use JavaScript::V8; use LWP; use Text::Balanced qw(extract_codeblock); use strict; use warnings; #delimiter used to distinguish code blocks for use with Text::Balanced my $delim='{}'; #downloads Web page my $ua=LWP::UserAgent->new; my $response=$ua->get('http://localhost/email.html'); my $result=$response->content; #print "$resultnn"; #extracts JavaScript my $js; if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){ $js=extract_codeblock($result,$delim); } #modified JS to make it processable by V8 module $js=~s/document.write/write/; $js=~s/'/'/g; #print "$jsnn"; #processes JS my $context = JavaScript::V8::Context->new(); $context->bind_function(write => sub { print @_ }); my $mail=$context->eval("$js"); print "$mailnn";
  • 23.