The document discusses search implementation and ElasticSearch. It begins with an overview of how search works by indexing documents into an inverted index of tokens and associated document postings. It then provides a Ruby implementation of a basic search index and demonstrates indexing documents and searching the index. The document concludes by describing features of ElasticSearch like its use of HTTP and JSON, schema-free indexing, distributed search capabilities, and Ruby integration.
"ElasticSearch in action" by Thijs Feryn.
ElasticSearch is a really powerful search engine, NoSQL database & analytics engine. It is fast, it scales and it's a child of the Cloud/BigData generation. This talk will show you how to get things done using ElasticSearch. The focus is on doing actual work, creating actual queries and achieving actual results. Topics that will be covered: - Filters and queries - Cluster, shard and index management - Data mapping - Analyzers and tokenizers - Aggregations - ElasticSearch as part of the ELK stack - Integration in your code.
ElasticSearch introduction talk. Overview of the API, functionality, use cases. What can be achieved, how to scale? What is Kibana, how it can benefit your business.
Elasticsearch - Devoxx France 2012 - English versionDavid Pilato
Elasticsearch presentation for Devoxx France 2012
English translation (feel free to correct my bad english ;-) )
French version is available here : http://www.slideshare.net/dadoonet/elasticsearch-devoxx-france-2012
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
"ElasticSearch in action" by Thijs Feryn.
ElasticSearch is a really powerful search engine, NoSQL database & analytics engine. It is fast, it scales and it's a child of the Cloud/BigData generation. This talk will show you how to get things done using ElasticSearch. The focus is on doing actual work, creating actual queries and achieving actual results. Topics that will be covered: - Filters and queries - Cluster, shard and index management - Data mapping - Analyzers and tokenizers - Aggregations - ElasticSearch as part of the ELK stack - Integration in your code.
ElasticSearch introduction talk. Overview of the API, functionality, use cases. What can be achieved, how to scale? What is Kibana, how it can benefit your business.
Elasticsearch - Devoxx France 2012 - English versionDavid Pilato
Elasticsearch presentation for Devoxx France 2012
English translation (feel free to correct my bad english ;-) )
French version is available here : http://www.slideshare.net/dadoonet/elasticsearch-devoxx-france-2012
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
The ultimate guide for Elasticsearch pluginsItamar
Elasticsearch is a great product - for search, for scale, for analyzing data, and much more. But sometimes you need to do something that is not supported by Elasticsearch out of the box, and that's where plugins come into play.
Join me in this talk to explore the plugins land of Elasticsearch. We will discuss the various ways Elasticsearch can be extended, and the various types of plugins available to do that. By giving concrete examples and browsing the large selection of pre-made plugins, we will see how plugins can help us overcome various challenges. We will also discuss possible issues with plugins, and ways to work around them.
Finally, we will discuss scenarios in which custom plugin development is necessary and can really save the day. By showing a demo of one such scenario, and the way we built and debugged a plugin to solve it, we will complete the picture of the Elasticsearch plugin land, and hopefully inspire you to create your own!
Elasticsearch is quite common tool nowadays. Usually as a part of ELK stack, but in some cases to support main feature of the system as search engine. Documentation on regular use cases and on usage in general is pretty good, but how it really works, how it behaves beneath the surface of the API? This talk is about that, we will look under the hood of Elasticsearch and dive deep in the largely unknown implementation details. Talk covers cluster behaviour, communication with Lucene and Lucene internals to literally bits and pieces. Come and see Elasticsearch dissected.
Elasticsearch is a powerful, distributed, open source searching technology. By integrating Elasticsearch into your application, you instantly provide a way to search a lot of data very quickly. Elasticsearch has a RESTful API, it scales, its super fast, you can use plugins to customize it, and much more. In this talk I go over the basics of setting up Elasticsearch, creating a search index, importing your data, and doing some basic searching. I also touch on a few advanced topics that will show the flexibility of this awesome service.
Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs.
Video: https://www.youtube.com/watch?v=GL7xC5kpb-c
Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK
An introduction to and a couple of examples and tips on how to use Elasticsearch for general data analytics. Examples are based on Elasticsearch version 2.x.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerSematext Group, Inc.
Sematext engineer Rafal Kuc (@kucrafal) walks through the details of running high-performance, fault tolerant Elasticsearch clusters on Docker. Topics include: Containers vs. Virtual Machines, running the official Elasticsearch container, container constraints, good network practices, dealing with storage, data-only Docker volumes, scaling, time-based data, multiple tiers and tenants, indexing with and without routing, querying with and without routing, routing vs. no routing, and monitoring. Talk was delivered at DevOps Days Warsaw 2015.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
The ultimate guide for Elasticsearch pluginsItamar
Elasticsearch is a great product - for search, for scale, for analyzing data, and much more. But sometimes you need to do something that is not supported by Elasticsearch out of the box, and that's where plugins come into play.
Join me in this talk to explore the plugins land of Elasticsearch. We will discuss the various ways Elasticsearch can be extended, and the various types of plugins available to do that. By giving concrete examples and browsing the large selection of pre-made plugins, we will see how plugins can help us overcome various challenges. We will also discuss possible issues with plugins, and ways to work around them.
Finally, we will discuss scenarios in which custom plugin development is necessary and can really save the day. By showing a demo of one such scenario, and the way we built and debugged a plugin to solve it, we will complete the picture of the Elasticsearch plugin land, and hopefully inspire you to create your own!
Elasticsearch is quite common tool nowadays. Usually as a part of ELK stack, but in some cases to support main feature of the system as search engine. Documentation on regular use cases and on usage in general is pretty good, but how it really works, how it behaves beneath the surface of the API? This talk is about that, we will look under the hood of Elasticsearch and dive deep in the largely unknown implementation details. Talk covers cluster behaviour, communication with Lucene and Lucene internals to literally bits and pieces. Come and see Elasticsearch dissected.
Elasticsearch is a powerful, distributed, open source searching technology. By integrating Elasticsearch into your application, you instantly provide a way to search a lot of data very quickly. Elasticsearch has a RESTful API, it scales, its super fast, you can use plugins to customize it, and much more. In this talk I go over the basics of setting up Elasticsearch, creating a search index, importing your data, and doing some basic searching. I also touch on a few advanced topics that will show the flexibility of this awesome service.
Elasticsearch what is it ? How can I use it in my stack ? I will explain how to set up a working environment with Elasticsearch. The slides are in English.
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs.
Video: https://www.youtube.com/watch?v=GL7xC5kpb-c
Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK
An introduction to and a couple of examples and tips on how to use Elasticsearch for general data analytics. Examples are based on Elasticsearch version 2.x.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerSematext Group, Inc.
Sematext engineer Rafal Kuc (@kucrafal) walks through the details of running high-performance, fault tolerant Elasticsearch clusters on Docker. Topics include: Containers vs. Virtual Machines, running the official Elasticsearch container, container constraints, good network practices, dealing with storage, data-only Docker volumes, scaling, time-based data, multiple tiers and tenants, indexing with and without routing, querying with and without routing, routing vs. no routing, and monitoring. Talk was delivered at DevOps Days Warsaw 2015.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
Elasticsearch Query DSL - Not just for wizards...clintongormley
The Elasticsearch Query DSL is a rich, flexible, powerful query language for full text and structured search, but with power comes complexity. Which of the 40 available queries should you use? What's a filter and when should you use it? How do you combine multiple filters, or multiple queries or queries with filters?
To most users, "relevance", and how it is affected by different queries, is a black box. Multi-field queries in particular can be difficult to get right if you don't understand how they work.
In this talk, I will explain the Query DSL from the ground up: how filters and queries use the inverted index to find matching documents, how the relevance score is calculated, how to combine the filter/query building blocks into complex statements. And finally, I will talk about the pitfalls of multi-field queries and how to avoid them.
Nested and Parent/Child Docs in ElasticSearchBeyondTrees
A key part of the architecture of RefWorks Flow, a new document workflow tool for researchers, is an ElasticSearch cluster used for citation canonicalization. We will present our findings of how to use the "nested" type and parent-child relations in ElasticSearch to do complex where-clause queries in an efficient way
Logging. Everyone does it. Many don't know why they do it. It is often considered a boring chore. A chore that is done by habit rather than for a purpose. But it doesn't have to be! Learn how to build a powerful, scalable open source logging environment with LogStash.
Scaling real-time search and analytics with Elasticsearchclintongormley
See the video here: https://www.youtube.com/watch?v=o6lSeNatVFM
A look at the elements required by Elasticsearch to turn a simple inverted index into an auto-clustering, horizontally scalable real time search and analytics engine. The talk will start from first principles, explaining how an inverted index works, how to make an inverted index suitable for real time search, how to scale that out, and how to add reliability and failover to the cluster.
Deep Dive on ElasticSearch Meetup event on 23rd May '15 at www.meetup.com/abctalks
Agenda:
1) Introduction to NOSQL
2) What is ElasticSearch and why is it required
3) ElasticSearch architecture
4) Installation of ElasticSearch
5) Hands on session on ElasticSearch
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
With growing trend of Big data, companies are tend to rely on high cost SIEM solutions. However, with introduction of open source and lightweight cluster management solution like ElasticSearch this has been the highlight of the year. Similarly, the log aggregation has been simplified by logstash and kibana providing a visual look to the complex data structure. This presentation will exactly cater to this need of having a appropriate log analysis+Detecting Intrusion+Visualizing data in a powerful interface.
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]Karel Minarik
A presentation from the New Media Inspiration 2013 conference (http://www.tuesday.cz/akce/new-media-inspiration-2013/) about using Elasticsearch's faceting features for realtime analytics of big data.
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
Building a real time, solr-powered recommendation engineTrey Grainger
Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving Systemindeedeng
Video available: http://youtu.be/jwq_0mPNnN8
As Indeed’s traffic has grown to its current level of over 3 billion job searches per month worldwide, we have evolved our job data storage and serving architecture in order to maintain high levels of reliability and performance, including an average retrieval time per document of 31ms. This talk describes that evolution, from the initial direct-access MySQL-based solution to a dedicated service and custom data store built around a log-structured merge-tree (LSM-Tree) implementation.
Speakers:
Jack Humphrey is director of the engineering teams that build Indeed’s job search and resume products. Since joining Indeed in 2009, he has helped build the service architecture that now handles over 3 billion job searches monthly.
Jeff Plaisance is a software engineer at Indeed focused on data storage infrastructure and analysis tools, including the datastore that serves up billions of jobs daily for Indeed’s search results.
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
Mobile data is becoming the new source for data. Managing data in the mobile devices has become easier with NoSQL Couchbase Lite mobile database. Making sense, analyzing, scaling to exabytes has also become easier with LexisNexis Big Data platform HPCC Systems.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
This is a presentation for my class in graduate school. I'm going to introduce a command line based full text search engine written in Python by scratch.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
Технологии с открытым исходным кодом, такие как Microsoft Orleans и ElasticSearch, - ключевые элементы архитектуры YouScan. О том, как они помогают справляться с постоянно растущими объемами данных из социальных сетей, об эволюции архитектуры YouScan, я расскажу в данном докладе.
Similar to Your Data, Your Search, ElasticSearch (EURUKO 2011) (20)
General introduction to Elasticsearch at the RubyShift 2013 conference.
Download the source code for demos:
* http://git.io/hello-elasticsearch-ruby
* http://git.io/stackexchange-elasticsearch
Shell's Kitchen: Infrastructure As Code (Webexpo 2012)Karel Minarik
Slides for the tutorial by Karel Minarik and Vojtech Hyza at the Webexpo 2012 conference.
Please see the GitHub repositories for the code:
* http://git.io/chef-solo-hello-world
* http://git.io/chef-hello-cloud
Interaktivita, originalita a návrhové vzoryKarel Minarik
Přednáška na katedře Studia nových médií (FFUK, prosinec 2006) v rámci kursu "Principy interaktivní tvorby". Interaktivita na současném webu, návrhové vzory a pojetí originality.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
17. WHY SEARCH SUCKS?
How do you implement search?
class MyModel
include Whatever::Search
end
MyModel.search "something"
18. WHY SEARCH SUCKS?
How do you implement search?
class MyModel
include Whatever::Search
MAGIC
end
MyModel.search "whatever"
19. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
def search
@results = MyModel.search params[:q]
respond_with @results
end
20. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
MAGIC
def search
@results = MyModel.search params[:q]
respond_with @results
end
21. WHY SEARCH SUCKS?
How do you implement search?
Query Results Result
MAGIC +
def search
@results = MyModel.search params[:q]
respond_with @results
end
25. HOW DOES SEARCH WORK?
A collection of documents
file_1.txt
The ruby is a pink to blood-‐red colored gemstone ...
file_2.txt
Ruby is a dynamic, reflective, general-‐purpose object-‐oriented
programming language ...
file_3.txt
"Ruby" is a song by English rock band Kaiser Chiefs ...
26. HOW DOES SEARCH WORK?
How do you search documents?
File.read('file1.txt').include?('ruby')
27. HOW DOES SEARCH WORK?
The inverted index
TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
28. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "ruby"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
29. HOW DOES SEARCH WORK?
The inverted index
MySearchLib.search "song"
ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
30. module SimpleSearch
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end
def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t
extend self
end
A naïve Ruby implementation
31. HOW DOES SEARCH WORK?
Indexing documents
SimpleSearch.index "file1", "Ruby is a language. Java is also a language.
SimpleSearch.index "file2", "Ruby is a song."
SimpleSearch.index "file3", "Ruby is a stone."
SimpleSearch.index "file4", "Java is a language."
Indexed document file1 with tokens:
["ruby", "language", "java", "also", "language"]
Indexed document file2 with tokens:
["ruby", "song"] Words downcased,
stopwords removed.
Indexed document file3 with tokens:
["ruby", "stone"]
Indexed document file4 with tokens:
["java", "language"]
32. HOW DOES SEARCH WORK?
The index
puts "What's in our index?"
p SimpleSearch::INDEX
{
"ruby" => ["file1", "file2", "file3"],
"language" => ["file1", "file4"],
"java" => ["file1", "file4"],
"also" => ["file1"],
"stone" => ["file3"],
"song" => ["file2"]
}
33. HOW DOES SEARCH WORK?
Search the index
SimpleSearch.search "ruby"
Results for token 'ruby':
* file1
* file2
* file3
34. HOW DOES SEARCH WORK?
The inverted index
TOKENS POSTINGS
ruby 3 file_1.txt file_2.txt file_3.txt
pink 1 file_1.txt
gemstone file_1.txt
dynamic file_2.txt
reflective file_2.txt
programming file_2.txt
song file_3.txt
english file_3.txt
rock file_3.txt
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices
35. It is very practical to know how search works.
For instance, now you know that
the analysis step is very important.
Most of the time, it's more important than the search step.
ElasticSearch
36. module SimpleSearch
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end
def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t
extend self
end
A naïve Ruby implementation
37. HOW DOES SEARCH WORK?
The Search Engine Textbook
Search Engines
Information Retrieval in Practice
Bruce Croft, Donald Metzler and Trevor Strohma
Addison Wesley, 2009
http://search-engines-book.com
38. SEARCH IMPLEMENTATIONS
The Baseline Information Retrieval Implementation
Lucene in Action
Michael McCandless, Erik Hatcher and Otis Gospodnetic
July, 2010
http://manning.com/hatcher3
52. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
The “Sliding Window” problem
curl -‐X DELETE http://localhost:9200 / logs_2010_01
logs_2010_02
logs
logs_2010_03
logs_2010_04
“We can really store only three months worth of data.”
53. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Index Templates
curl -‐X PUT localhost:9200/_template/bookmarks_template -‐d '
{
"template" : "users_*", Apply this configuration
for every matching
"settings" : { index being created
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 3
}
},
"mappings": {
"url": {
"properties": {
"url": {
"type": "string", "analyzer": "simple", "boost": 10
},
"title": {
"type": "string", "analyzer": "snowball", "boost": 5
}
// ...
}
}
}
}
'
http://www.elasticsearch.org/guide/reference/api/admin-indices-templates.html
55. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Index A is split into 3 shards, and duplicated in 2 replicas.
A1 A1' A1'' Replicas
A2 A2' A2''
A3 A3' A3''
curl -‐XPUT 'http://localhost:9200/A/' -‐d '{
"settings" : {
"index" : {
Shards "number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'
56. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
Im
pr
ce
ove
an
rm
in
de
rfo
xi
pe
ng
h
pe
a rc
rfo
se
rm
e
ov
an
pr
ce
Im
SH
AR
AS
DS
IC
PL
RE
57. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
$ curl -‐X GET "http://localhost:9200/_search?q=<YOUR QUERY>"
apple
Terms
apple iphone
Phrases "apple iphone"
Proximity "apple safari"~5
Fuzzy apple~0.8
app*
Wildcards
*pp*
Boosting apple^10 safari
[2011/05/01 TO 2011/05/31]
Range
[java TO json]
apple AND NOT iphone
+apple -‐iphone
Boolean
(apple OR iphone) AND NOT review
title:iphone^15 OR body:iphone
Fields published_on:[2011/05/01 TO "2011/05/27 10:00:00"]
http://lucene.apache.org/java/3_1_0/queryparsersyntax.html
67. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
K R O A T I E N
K R O
}
R O A
O A T
Trigrams
A T I
T I E
I E N
68. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
Tire.index 'articles' do
delete
create
store :title => 'One', :tags => ['ruby'], :published_on => '2011-‐01-‐01'
store :title => 'Two', :tags => ['ruby', 'python'], :published_on => '2011-‐01-‐02'
store :title => 'Three', :tags => ['java'], :published_on => '2011-‐01-‐02'
store :title => 'Four', :tags => ['ruby', 'php'], :published_on => '2011-‐01-‐03'
refresh
end
s = Tire.search 'articles' do
query { string 'title:T*' }
filter :terms, :tags => ['ruby']
sort { title 'desc' }
http://github.com/karmi/tire
facet 'global-‐tags' { terms :tags, :global => true }
facet 'current-‐tags' { terms :tags }
end
69. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
end
$ rake environment tire:import CLASS='Article'
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end
http://github.com/karmi/tire
70. ELASTICSEARCH FEATURES
HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
class Article
include Whatever::ORM
include Tire::Model::Search
include Tire::Model::Callbacks
end
$ rake environment tire:import CLASS='Article'
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end
http://github.com/karmi/tire
71.
72. Try ElasticSearch and Tire with a one-line command.
$ rails new tired -‐m "https://gist.github.com/raw/951343/tired.rb"
A “batteries included” installation.
Downloads and launches ElasticSearch.
Sets up a Rails applicationand and launches it.
When you're tired of it, just delete the folder.