The document discusses building a basic search engine in Ruby by creating an inverted index from a set of documents and demonstrating basic querying of the index. It covers splitting documents into words, building a hash mapping each unique word to the documents it appears in, and performing set intersections on word indexes to return document matches for queries. It also raises questions about additional features needed for a more complete search engine implementation.
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsDistilled
In the SEO industry, we obsess on everything Google says, from John Mueller dropping a hint in a Webmaster Hangout, to the ranking data we spend £1000s to gather. Yet we ignore the data Google throws at us every day, the crawling data. For the longest time, site crawls, traffic data, and rankings have been the pillars of SEO data gathering. Log files should join them as something everyone is doing. We'll go through how to get everything set-up, look at some of the tools to make it easy and repeatable and go through the kinds of analysis you can do to get insights from the data.
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsDistilled
In the SEO industry, we obsess on everything Google says, from John Mueller dropping a hint in a Webmaster Hangout, to the ranking data we spend £1000s to gather. Yet we ignore the data Google throws at us every day, the crawling data. For the longest time, site crawls, traffic data, and rankings have been the pillars of SEO data gathering. Log files should join them as something everyone is doing. We'll go through how to get everything set-up, look at some of the tools to make it easy and repeatable and go through the kinds of analysis you can do to get insights from the data.
4 Nisan 2015 tarihinde Kadir Has Üniversitesi'nde yapılan 9. Yazılım Teknolojileri Seminer etkinliğinde Eralp Erat'ın yaptığı TDD (Test Driven Design) sunumu
Monitoring web application behaviour with cucumber-nagiosLindsay Holmwood
Setting up monitoring for web applications can be complicated - tests tend to lack expressiveness, or and quite often they don't even test the right problem in the first place.
cucumber-nagios lets a sysadmin write behavioural tests for their web apps in plain English, and outputs the test results in the Nagios plugin format, allowing a sysadmin to be notified by Nagios when their production apps aren't behaving.
J2EE is already the perfect solution for complex business/enterprise systems, and JSF2.x is the perfect chance to reach out to the consumer and small business market. JSF is easier to use than it's ever been before, but small businesses have different needs than larger companies and corporations. PrettyFaces is for all projects, small and large; this presentation explains why "pretty, bookmark-able URLs" are important for client-facing applications, addressing SEO optimization, and creating clean, consistent, intuitive client interactions on the web.
Rails have long co-existed with Javascript through a variety of ways. As the Javascript ecosystem grows more powerful and complex each day, finding a better way to make Javascript a first-class citizen in the Rails world has become compelling. Rails 5.1 will officially comes with Webpack through the Webpacker gem, but you don't have to wait for that. You can use Webpacker with Rails 4.2+ today. We describe briefly how Javascript existed in the Rails world, and the jump straight into creating a simple Rails/Javascript app from scratch in less 3 minutes.
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
Tim Berners-Lee - On the Next Web talks about open, linked data. Sweet may the future be, but what if you need the data entangled in the vast web right now?
Mostly inspired from author's work on SpojBackup, this talk familiarizes beginners with the ease and power of web scraping in Python. It would introduce basics of related modules - Mechanize, urllib2, BeautifulSoup, Scrapy, and demonstrate simple examples to get them started with.
4 Nisan 2015 tarihinde Kadir Has Üniversitesi'nde yapılan 9. Yazılım Teknolojileri Seminer etkinliğinde Eralp Erat'ın yaptığı TDD (Test Driven Design) sunumu
Monitoring web application behaviour with cucumber-nagiosLindsay Holmwood
Setting up monitoring for web applications can be complicated - tests tend to lack expressiveness, or and quite often they don't even test the right problem in the first place.
cucumber-nagios lets a sysadmin write behavioural tests for their web apps in plain English, and outputs the test results in the Nagios plugin format, allowing a sysadmin to be notified by Nagios when their production apps aren't behaving.
J2EE is already the perfect solution for complex business/enterprise systems, and JSF2.x is the perfect chance to reach out to the consumer and small business market. JSF is easier to use than it's ever been before, but small businesses have different needs than larger companies and corporations. PrettyFaces is for all projects, small and large; this presentation explains why "pretty, bookmark-able URLs" are important for client-facing applications, addressing SEO optimization, and creating clean, consistent, intuitive client interactions on the web.
Rails have long co-existed with Javascript through a variety of ways. As the Javascript ecosystem grows more powerful and complex each day, finding a better way to make Javascript a first-class citizen in the Rails world has become compelling. Rails 5.1 will officially comes with Webpack through the Webpacker gem, but you don't have to wait for that. You can use Webpacker with Rails 4.2+ today. We describe briefly how Javascript existed in the Rails world, and the jump straight into creating a simple Rails/Javascript app from scratch in less 3 minutes.
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
Tim Berners-Lee - On the Next Web talks about open, linked data. Sweet may the future be, but what if you need the data entangled in the vast web right now?
Mostly inspired from author's work on SpojBackup, this talk familiarizes beginners with the ease and power of web scraping in Python. It would introduce basics of related modules - Mechanize, urllib2, BeautifulSoup, Scrapy, and demonstrate simple examples to get them started with.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Essentials of Automations: The Art of Triggers and Actions in FME
Building A Mini Google High Performance Computing In Ruby
1. Building Mini-Google in Ruby
Ilya Grigorik
@igrigorik
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
2. postrank.com/topic/ruby
The slides… Twitter My blog
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
3. Ruby + Math
PageRank
Optimization
Examples Indexing
Misc Fun
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
4. PageRank PageRank + Ruby
Tools
+ Examples Indexing
Optimization
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
5. Consume with care…
everything that follows is based on released / public domain info
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
6. Search-engine graveyard
Google did pretty well…
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
7. Query: Ruby
Results
1. Crawl 2. Index 3. Rank
Search pipeline
50,000-foot view
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
8. Query: Ruby
Results
1. Crawl 2. Index 3. Rank
Bah Interesting Fun
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
9. CPU Speed 333Mhz
RAM 32-64MB
Index 27,000,000 documents
Index refresh once a month~ish
PageRank computation several days
Laptop CPU 2.1Ghz
VM RAM 1GB
1-Million page web ~10 minutes
circa 1997-1998
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
10. Creating & Maintaining an Inverted Index
DIY and the gotchas within
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
11. require 'set'
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
pages = {
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;1quot; => quot;it is what it isquot;,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;2quot; => quot;what is itquot;,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;3quot; => quot;it is a bananaquot;
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
}
}
index = {}
pages.each do |page, content|
content.split(/s/).each do |word|
if index[word]
index[word] << page
else
index[word] = Set.new(page)
end
end
end
Building an Inverted Index
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
12. require 'set'
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
pages = {
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;1quot; => quot;it is what it isquot;,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;2quot; => quot;what is itquot;,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;3quot; => quot;it is a bananaquot;
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
}
}
index = {}
pages.each do |page, content|
content.split(/s/).each do |word|
if index[word]
index[word] << page
else
index[word] = Set.new(page)
end
end
end
Building an Inverted Index
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
13. require 'set'
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
pages = {
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;1quot; => quot;it is what it isquot;,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;2quot; => quot;what is itquot;,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;3quot; => quot;it is a bananaquot;
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
}
}
index = {}
pages.each do |page, content|
Word => [Document]
content.split(/s/).each do |word|
if index[word]
index[word] << page
else
index[word] = Set.new(page)
end
end
end
Building an Inverted Index
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
14. # query: quot;what is bananaquot;
p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
# > #<Set: {}>
# query: quot;a bananaquot;
p index[quot;aquot;] & index[quot;bananaquot;]
# > #<Set: {quot;3quot;}>
1 3
2
# query: quot;what isquot;
p index[quot;whatquot;] & index[quot;isquot;]
# > #<Set: {quot;1quot;, quot;2quot;}>
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
Querying the index
}
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
15. # query: quot;what is bananaquot;
p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
# > #<Set: {}>
# query: quot;a bananaquot;
p index[quot;aquot;] & index[quot;bananaquot;]
# > #<Set: {quot;3quot;}>
1 3
2
# query: quot;what isquot;
p index[quot;whatquot;] & index[quot;isquot;]
# > #<Set: {quot;1quot;, quot;2quot;}>
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
Querying the index
}
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
16. # query: quot;what is bananaquot;
p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
# > #<Set: {}>
# query: quot;a bananaquot;
p index[quot;aquot;] & index[quot;bananaquot;]
# > #<Set: {quot;3quot;}>
1 3
2
# query: quot;what isquot;
p index[quot;whatquot;] & index[quot;isquot;]
# > #<Set: {quot;1quot;, quot;2quot;}>
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
Querying the index
}
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
17. # query: quot;what is bananaquot;
p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
# > #<Set: {}>
# query: quot;a bananaquot;
p index[quot;aquot;] & index[quot;bananaquot;]
# > #<Set: {quot;3quot;}>
What order?
# query: quot;what isquot;
p index[quot;whatquot;] & index[quot;isquot;]
[1, 2] or [2,1]
# > #<Set: {quot;1quot;, quot;2quot;}>
{
quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
quot;aquot;=>#<Set: {quot;3quot;}>,
quot;bananaquot;=>#<Set: {quot;3quot;}>,
quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
Querying the index
}
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
18. require 'set'
pages = {
quot;1quot; => quot;it is what it isquot;,
quot;2quot; => quot;what is itquot;,
quot;3quot; => quot;it is a bananaquot;
}
PDF, HTML, RSS?
index = {}
Lowercase / Upcase?
pages.each do |page, content| Compact Index?
Hmmm?
content.split(/s/).each do |word| Stop words?
if index[word] Persistence?
index[word] << page
else
index[word] = Set.new(page)
end
end
end
Building an Inverted Index
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
20. Ferret is a high-performance, full-featured text search engine library written for Ruby
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
21. require 'ferret'
include Ferret
index = Index::Index.new()
index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
index << {:title => quot;2quot;, :content => quot;what is itquot;}
index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}
index.search_each('content:quot;bananaquot;') do |id, score|
puts quot;Score: #{score}, #{index[id][:title]} quot;
end
> Score: 1.0, 3
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
22. require 'ferret'
include Ferret
index = Index::Index.new()
index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
index << {:title => quot;2quot;, :content => quot;what is itquot;}
index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}
index.search_each('content:quot;bananaquot;') do |id, score|
puts quot;Score: #{score}, #{index[id][:title]} quot;
end
> Score: 1.0, 3
Hmmm?
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
23. class Ferret::Analysis::Analyzer class Ferret::Search::BooleanQuery
class Ferret::Analysis::AsciiLetterAnalyzer class Ferret::Search::ConstantScoreQuery
class Ferret::Analysis::AsciiLetterTokenizer class Ferret::Search::Explanation
class Ferret::Analysis::AsciiLowerCaseFilter class Ferret::Search::Filter
class Ferret::Analysis::AsciiStandardAnalyzer class Ferret::Search::FilteredQuery
class Ferret::Analysis::AsciiStandardTokenizer class Ferret::Search::FuzzyQuery
class Ferret::Analysis::AsciiWhiteSpaceAnalyzer class Ferret::Search::Hit
class Ferret::Analysis::AsciiWhiteSpaceTokenizer class Ferret::Search::MatchAllQuery
class Ferret::Analysis::HyphenFilter class Ferret::Search::MultiSearcher
class Ferret::Analysis::LetterAnalyzer class Ferret::Search::MultiTermQuery
class Ferret::Analysis::LetterTokenizer class Ferret::Search::PhraseQuery
class Ferret::Analysis::LowerCaseFilter class Ferret::Search::PrefixQuery
class Ferret::Analysis::MappingFilter class Ferret::Search::Query
class Ferret::Analysis::PerFieldAnalyzer class Ferret::Search::QueryFilter
class Ferret::Analysis::RegExpAnalyzer class Ferret::Search::RangeFilter
class Ferret::Analysis::RegExpTokenizer class Ferret::Search::RangeQuery
class Ferret::Analysis::StandardAnalyzer class Ferret::Search::Searcher
class Ferret::Analysis::StandardTokenizer class Ferret::Search::Sort
class Ferret::Analysis::StemFilter class Ferret::Search::SortField
class Ferret::Analysis::StopFilter class Ferret::Search::TermQuery
class Ferret::Analysis::Token class Ferret::Search::TopDocs
class Ferret::Analysis::TokenStream class Ferret::Search::TypedRangeFilter
class Ferret::Analysis::WhiteSpaceAnalyzer class Ferret::Search::TypedRangeQuery
class Ferret::Search::WildcardQuery
class Ferret::Analysis::WhiteSpaceTokenizer
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
25. Ranking Results
0-60 with PageRank…
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
26. index.search_each('content:quot;the brown cowquot;') do |id, score|
puts quot;Score: #{score}, #{index[id][:title]} quot;
end
> Score: 0.827, 3
> Score: 0.523, 5 Relevance?
> Score: 0.125, 4
3 5 4
the 4 3 5
brown 1 3 1
cow 1 4 1
Score 6 10 7
Naïve: Term Frequency
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
27. index.search_each('content:quot;the brown cowquot;') do |id, score|
puts quot;Score: #{score}, #{index[id][:title]} quot;
end
> Score: 0.827, 3
> Score: 0.523, 5
> Score: 0.125, 4
3 5 4
the 4 3 5
Skew
brown 1 3 1
cow 1 4 1
Score 6 10 7
Naïve: Term Frequency
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
28. 3 5 4
the 4 3 5
Skew
brown 1 3 1
cow 1 4 1
# of docs
Score = TF * IDF
the 6
TF = # occurrences / # words
brown 3
IDF = # docs / # docs with W
cow 4
Total # of documents: 10
TF-IDF
Term Frequency * Inverse Document Frequency
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
29. 3 5 4
the 4 3 5
brown 1 3 1
cow 1 4 1
Doc # 3 score for ‘the’:
# of docs
4/10 * ln(10/6) = 0.204
the 6
Doc # 3 score for ‘brown’:
brown 3
1/10 * ln(10/3) = 0.120
cow 4
Doc # 3 score for ‘cow’:
1/10 * ln(10/4) = 0.092
Total # of documents: 10
# words in document: 10
TF-IDF
Score = 0.204 + 0.120 + 0.092 = 0.416
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
30. W1 W2 … … … … … … WN
Doc 1 15 23 …
Doc 2 24 12 …
… … … …
…
Doc K
Size = N * K * size of Ruby object
Ouch.
Pages = N = 10,000
Words = K = 2,000
Ruby Object = 20+ bytes
Frequency Matrix
Footprint = 384 MB
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
31. NArray is an Numerical N-dimensional Array class (implemented in C)
# create new NArray. initialize with 0.
NArray.new(typecode, size, ...)
# 1 byte unsigned integer
NArray.byte(size,...)
# 2 byte signed integer
NArray.sint(size,...)
# 4 byte signed integer
NArray.int(size,...)
# single precision float
NArray.sfloat(size,...)
# double precision float
NArray.float(size,...)
# single precision complex
NArray.scomplex(size,...)
# double precision complex
NArray.complex(size,...)
# Ruby object
NArray.object(size,...)
NArray
http://narray.rubyforge.org/
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
32. NArray is an Numerical N-dimensional Array class (implemented in C)
NArray
http://narray.rubyforge.org/
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
33. Links as votes
PageRank
the google juice
Problem: link gaming
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
34. P = 0.85
Follow link from page he/she is currently on.
Teleport to a random location on the web.
P = 0.15
Random Surfer
powerful abstraction
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
35. Follow link from page he/she is currently on.
Page K
Teleport to a random location on the web.
Page N Page M
Surfin’
rinse & repeat, ad naseum
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
36. On Page P, clicks on link to K
P = 0.85
On Page K clicks on link to M
P = 0.85
On Page M teleports to X
P = 0.15
Surfin’
…
rinse & repeat, ad naseum
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
37. P = 0.05 P = 0.20
X
N
P = 0.15
M
K
P = 0.6
Analyzing the Web Graph
extracting PageRank
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
38. What is PageRank?
It’s a scalar!
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
39. P = 0.05 P = 0.20
X
N
P = 0.15
M
K
P = 0.6
What is PageRank?
it’s a probability!
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
40. P = 0.05 P = 0.20
X
N
P = 0.15
M
K
P = 0.6
What is PageRank?
Higher Pr, Higher Importance?
it’s a probability!
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
41. Teleportation?
sci-fi fans, … ?
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
42. 1. No in-links! 3. Isolated Web
X
N
K
2. No out-links!
M
M
Reasons for teleportation
enumerating edge cases
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
44. P(T) = 0.03
P(T) = 0.15 / # of pages
P(T) = 0.03
P(T) = 0.03
X
N
K P(T) = 0.03
M
P(T) = 0.03
M
P(T) = 0.03
Teleportation
probabilities
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
45. Assume the web is N pages big
Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85
Assume that teleportation probability (E) is uniform
Assume that you start on any random page (uniform distribution L), then
0.15
������
������ = ������ = ⋮
0.15
������
Then after one step, the probability your on page X is:
������ ∗ ������������ + ������������
������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������)
PageRank: Simplified Mathematical Def’n
cause that’s how we roll
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
46. Link Graph No link from 1 to N
1 2 … … N
1 1 0 … … 0
2 0 1 … … 1
… … … … … …
… … … … … …
N 0 1 … … 1
G = The Link Graph
Huge!
ginormous and sparse
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
47. Links to…
{
quot;1quot; => [25, 26],
quot;2quot; => [1],
Page
quot;5quot; => [123,2],
quot;6quot; => [67, 1]
}
G as a dictionary
more compact…
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
48. Follow link from page he/she is currently on.
Page K
Teleport to a random location on the web.
Computing PageRank
the tedious way
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
49. Don’t trust me! Verify it yourself!
������1
−1 ⋮
������ = ������ ������ − ������������ ������ =
������������
Identity matrix
Computing PageRank
in one swoop
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
50. Enough hand-waving, dammit!
show me the code
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
51. Hot, Fast, Awesome
Birth of EM-Proxy
flash of the obvious
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
52. http://rb-gsl.rubyforge.org/
Hot, Fast, Awesome
Click there! … Give yourself a weekend.
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
53. http://ruby-gsl.sourceforge.net/
Click there! … Give yourself a weekend.
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
54. require quot;gslquot;
include GSL
# INPUT: link structure matrix (NxN)
# OUTPUT: pagerank scores
def pagerank(g)
Verify NxN
raise if g.size1 != g.size2
i = Matrix.I(g.size1) # identity matrix
p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector
s = 0.85 # probability of following a link
t = 1-s # probability of teleportation
t*((i-s*g).invert)*p
end
PageRank in Ruby
6 lines, or less
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
55. require quot;gslquot;
include GSL
# INPUT: link structure matrix (NxN)
# OUTPUT: pagerank scores
def pagerank(g) Constants…
raise if g.size1 != g.size2
i = Matrix.I(g.size1) # identity matrix
p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector
s = 0.85 # probability of following a link
t = 1-s # probability of teleportation
t*((i-s*g).invert)*p
end
PageRank in Ruby
6 lines, or less
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
56. require quot;gslquot;
include GSL
# INPUT: link structure matrix (NxN)
# OUTPUT: pagerank scores
def pagerank(g)
raise if g.size1 != g.size2
i = Matrix.I(g.size1) # identity matrix
p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector
s = 0.85 # probability of following a link
t = 1-s # probability of teleportation
t*((i-s*g).invert)*p
end
PageRank in Ruby
PageRank!
6 lines, or less
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
57. X
P = 0.33 P = 0.33
N
P = 0.33
K
pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])
> [0.33, 0.33, 0.33]
Ex: Circular Web
testing intuition…
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
58. X
P = 0.05 P = 0.07
N
P = 0.87
K
pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])
> [0.05, 0.07, 0.87]
Ex: All roads lead to K
testing intuition…
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
59. PageRank + Ferret
awesome search, ftw!
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
60. 2
P = 0.05 P = 0.07
1
require 'ferret' P = 0.87
3
include Ferret
index = Index::Index.new()
index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 }
index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 }
index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 }
Store PageRank
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
64. Search*: Graphs are ubiquitous!
PageRank is a general purpose hammer
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
66. Hmm…
Analyze the social graph:
- Filter messages by ‘TwitterRank’
- Suggest users by ‘TwitterRank’
-…
PageRank + Social Graph
Twitter
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
67. PageRank + Product Graph
E-commerce
Link items purchased in same cart… Run PR on it.
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
68. PageRank = Powerful Hammer
use it!
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
69. Personalization
how would you do it?
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
70. 0.15
������ Teleportation distribution doesn’t
������ = ⋮ have to be uniform!
0.15
������
yahoo.com is
my homepage!
PageRank + Personalization
customize the teleportation vector
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
71. Make pages with links!
Gaming PageRank
http://bit.ly/pagerank-spam for fun and profit (I don’t endorse it)
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
72. Slides: http://bit.ly/railsconf-pagerank
Ferret: http://bit.ly/ferret
RB-GSL: http://bit.ly/rb-gsl
PageRank on Wikipedia: http://bit.ly/wp-pagerank
Gaming PageRank: http://bit.ly/pagerank-spam
Michael Nielsen’s lectures on PageRank:
http://michaelnielsen.org/blog
Questions?
The slides… Twitter My blog
Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf