This document provides an overview of Lucene and how it can be used with MySQL. It discusses:
- What Lucene is and its origins as an open source information retrieval library.
- How Lucene works as a toolkit for building search applications rather than a turnkey search engine.
- Core Lucene classes like IndexWriter, Directory, Analyzer, and Document that are used for indexing data.
- Classes like IndexSearcher and Query that support basic search operations through queries and hits.
- Examples of loading data from a MySQL database into a Lucene index and performing searches on that indexed data.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Data warehouses are time variant in the sense because they maintain both
historical and (nearly) current data. Operational databases, in contrast, contain only the most
current, up-to-date data values. Furthermore, they generally maintain this information for not
more than a year. In case of DWs, these are generally loaded from the operational databases
daily, weekly, or monthly which is then typically maintained for a long period.
Learn from the author of SQLTXPLAIN the fundamentals of SQL Tuning: 1) Diagnostics Collection; 2) Root Cause Analysis (RCA); and 3) Remediation.
SQL Tuning is a complex and intimidating area of knowledge, and it requires years of frequent practice to master it. Nevertheless, there are some concepts and practices that are fundamental to succeed. From basic understanding of the Cost-based Optimizer (CBO) and the Execution Plans, to more advance topics such as Plan Stability and the caveats of using SQL Profiles and SQL Plan Baselines, this session is full of advice and experience sharing. Learn what works and what doesn't when it comes to SQL Tuning.
Participants of this session will also learn about several free tools (besides SQLTXPLAIN) that can be used to diagnose a SQL statement performing poorly, and some others to improve Execution Plan Stability.
Either if your are a novice DBA, or an experienced DBA or Developer, there will be something new for you on this session. And if this is your first encounter with SQL Tuning, at least you will learn the basic concepts and steps to succeed in your endeavor.
Splunk ES is one of the premium and widely used add-ons on the Splunk platform for security use cases. Asset and identity are one of the main concepts or we can say the building block of Splunk ES . In this session, we will understand how the asset and identity concept works and what supportive data we required for them to create a lookup.
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
PostgreSQL (or Postgres) began its life in 1986 as POSTGRES, a research project of the University of California at Berkeley.
PostgreSQL isn't just relational, it's object-relational.it's object-relational. This gives it some advantages over other open source SQL databases like MySQL, MariaDB and Firebird.
Oracle Data Integrator (ODI) seems to be slow when it is installed out-of-the-box, since it has to comply with different versions of the databases and operating systems. The default installation is generally not the optimal choice. ODI is a flexible product, that can be customized for specific requirements and to implement new features of the database or operating systems. Attendees will learn how to easily create a customized ODI environment.
This presentation will demonstrate the flexibility of the Knowledge Module, configuration best practices and the best query response time tips and techniques depending on complex business requirements. It will include information about how to load an extensive number of files quickly with a special algorithm, as well as how to define new customized data types, analytical and database functions, archiving ODI logs in a timely fashion and using Oracle HINTS in a variabled and static way due to business and IT needs.
The most common problems a database administrator or user encounters while working on a database is to audit and monitor the health of their databases which includes observing the database activity so as to be aware of the actions of the users, finding the slow running statements, finding why database systems performance is low during certain intervals or diagnosing the corruption issues.
In this webinar, you'll learn:
- More about the auditing feature in PostgreSQL/EPAS
- About the utility of built-in monitoring tools in PostgreSQL/EPAS - To help solve some commonly reported scenarios
- How to use pg_stat_statements and pg_stat_activity to find the execution-time statistics for long running queries, the number of
- Buffer reads and wait events
- How to use pgstattuple in debugging scenarios where relation level information is required and
- How to use pageinspect for page level details for a relation.
This webinar focuses on the built-in monitoring tools available in PostgreSQL/EPAS and does not discuss external tools available for monitoring PostgreSQL.
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with SplunkSplunk
ITOA user-beginner Splunk Admin-new to Splunk
Description: If you’re just getting started with Splunk, this session will help you understand how to use Splunk software to turn your silos of data into insights that are actionable. In this session, we’ll dive right into a Splunk environment and show you how to use the simple Splunk search interface to quickly find the needle-in-the-haystack or multiple needles in multiple haystacks. We’ll demonstrate how to perform rapid ad hoc searches to conduct routine investigations across your entire IT infrastructure in one place, whether physical, virtual or in the cloud. We’ll show you how to then convert these searches into real-time alerts and dashboards, so you can proactively monitor for problems before they impact your end user. We’ll also demonstrate how you can use Splunk to connect the dots across heterogeneous systems in your environment for cross-tier, cross-silo visibility.
You’ll have access to a demo environment. So, don’t forget to bring your laptop and follow along for a hands-on experience.
Data warehouses are time variant in the sense because they maintain both
historical and (nearly) current data. Operational databases, in contrast, contain only the most
current, up-to-date data values. Furthermore, they generally maintain this information for not
more than a year. In case of DWs, these are generally loaded from the operational databases
daily, weekly, or monthly which is then typically maintained for a long period.
Learn from the author of SQLTXPLAIN the fundamentals of SQL Tuning: 1) Diagnostics Collection; 2) Root Cause Analysis (RCA); and 3) Remediation.
SQL Tuning is a complex and intimidating area of knowledge, and it requires years of frequent practice to master it. Nevertheless, there are some concepts and practices that are fundamental to succeed. From basic understanding of the Cost-based Optimizer (CBO) and the Execution Plans, to more advance topics such as Plan Stability and the caveats of using SQL Profiles and SQL Plan Baselines, this session is full of advice and experience sharing. Learn what works and what doesn't when it comes to SQL Tuning.
Participants of this session will also learn about several free tools (besides SQLTXPLAIN) that can be used to diagnose a SQL statement performing poorly, and some others to improve Execution Plan Stability.
Either if your are a novice DBA, or an experienced DBA or Developer, there will be something new for you on this session. And if this is your first encounter with SQL Tuning, at least you will learn the basic concepts and steps to succeed in your endeavor.
Splunk ES is one of the premium and widely used add-ons on the Splunk platform for security use cases. Asset and identity are one of the main concepts or we can say the building block of Splunk ES . In this session, we will understand how the asset and identity concept works and what supportive data we required for them to create a lookup.
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
PostgreSQL (or Postgres) began its life in 1986 as POSTGRES, a research project of the University of California at Berkeley.
PostgreSQL isn't just relational, it's object-relational.it's object-relational. This gives it some advantages over other open source SQL databases like MySQL, MariaDB and Firebird.
Oracle Data Integrator (ODI) seems to be slow when it is installed out-of-the-box, since it has to comply with different versions of the databases and operating systems. The default installation is generally not the optimal choice. ODI is a flexible product, that can be customized for specific requirements and to implement new features of the database or operating systems. Attendees will learn how to easily create a customized ODI environment.
This presentation will demonstrate the flexibility of the Knowledge Module, configuration best practices and the best query response time tips and techniques depending on complex business requirements. It will include information about how to load an extensive number of files quickly with a special algorithm, as well as how to define new customized data types, analytical and database functions, archiving ODI logs in a timely fashion and using Oracle HINTS in a variabled and static way due to business and IT needs.
The most common problems a database administrator or user encounters while working on a database is to audit and monitor the health of their databases which includes observing the database activity so as to be aware of the actions of the users, finding the slow running statements, finding why database systems performance is low during certain intervals or diagnosing the corruption issues.
In this webinar, you'll learn:
- More about the auditing feature in PostgreSQL/EPAS
- About the utility of built-in monitoring tools in PostgreSQL/EPAS - To help solve some commonly reported scenarios
- How to use pg_stat_statements and pg_stat_activity to find the execution-time statistics for long running queries, the number of
- Buffer reads and wait events
- How to use pgstattuple in debugging scenarios where relation level information is required and
- How to use pageinspect for page level details for a relation.
This webinar focuses on the built-in monitoring tools available in PostgreSQL/EPAS and does not discuss external tools available for monitoring PostgreSQL.
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with SplunkSplunk
ITOA user-beginner Splunk Admin-new to Splunk
Description: If you’re just getting started with Splunk, this session will help you understand how to use Splunk software to turn your silos of data into insights that are actionable. In this session, we’ll dive right into a Splunk environment and show you how to use the simple Splunk search interface to quickly find the needle-in-the-haystack or multiple needles in multiple haystacks. We’ll demonstrate how to perform rapid ad hoc searches to conduct routine investigations across your entire IT infrastructure in one place, whether physical, virtual or in the cloud. We’ll show you how to then convert these searches into real-time alerts and dashboards, so you can proactively monitor for problems before they impact your end user. We’ll also demonstrate how you can use Splunk to connect the dots across heterogeneous systems in your environment for cross-tier, cross-silo visibility.
You’ll have access to a demo environment. So, don’t forget to bring your laptop and follow along for a hands-on experience.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk will present a design and implementation of a flexible, version-independent serialization format for Lucene indexes and its applications in index upgrades / downgrades, in distributed document analysis, in distributed indexing, and in integration with external indexing pipelines. This format enables submitting pre-analyzed documents to Lucene/Solr, and transferring parts of indexes between nodes in a distributed setup.
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.
Zoe Slattery's slides from PHPNW08:
The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.
For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.
In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
Ashnik Database Solution Architect, Sameer Kumar, an Open Source evangelist presented at FOSSASIA 2015 about the features of open source database like PostgreSQL which are missed by developers stuck on proprietary databases.
10 Features you would love as an Open Source developer!
- New JSON Datatype
- Vast set of datatypes supported
- Rich support for foreign Data Wrap
- User Defined Operators
- User Defined Extensions
- Filter Based Indexes or Partial Indexes
- Granular control of parameters at User, Database, Connection or Transaction Level
- Use of indexes to get statistics
- JDBC API for COPY -Command
- Full Text Search
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
Philly PHP April 2017 Meetup: Introduction to Elastic Search as presented by Aditya Bhamidpati on April 19, 2017.
These slides cover an introduction to using Elastic Search
All you need to start with Apache Solr (elastic search). This presentation includes all the information of Solr i.e. what it is, installation, indexing & searching for beginners.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
3. What is Lucene?
Started in 1997 “self serving project”
2001: Apache folks adopts Lucene
Open Source Information Retrieval (IR) Library
- available from the Apache Software Foundation
- Search and Index any textual data
- Doesn’t care about language, source and format of data
4. Lucene?
Not a turnkey search engine
Standard
- for building open-source based large-scale search
applications
- a high performance, scalable, cross-platform search toolkit
- Today: translated into C++, C#, Perl, Python, Ruby
- for embedded and customizable search
- widely adopted by OEM software vendors and enterprise IT
departments
6. What types of queries it supports
Single and multi-term queries
Phrase queries
Wildcards
Result ranking
+apple –computer +pie
country:USA
country:USA AND state:CA
7. Cons
Need Java resources (programmers)
- JSP experience plus
Implementation and Maintenance Cost
By default
- No installer or wizard for setup (it’s a toolkit )
- No administration or command line tools (demo avail.)
- No spider
- Coding yourself is always an option
- No complex script language support by default
- 3rd
party tools available
8. Cons 2
- No built-in support for (Demos avail. for how to implement)
- HTML format
- PDF format
- Microsoft Office Documents
- Advanced XML queries
- “How tos” available.
- No database gateway
- Integrates with MySQL with little work
- Web interface
- JSP sample available
- Missing enterprise support
9. Lucene Libraries
1. The Lucene libraries include core search components such
as a document indexer, index searcher, query parser, and
text analyzer.
10. Who is behind Lucene?
Doug Cutting (Author)
Previously at Excite
Apache Software Foundation
11. Who uses Lucene?
IBM
- IBM OmniFind Yahoo! Edition
CNET
- http://reviews.cnet.com/
- http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html
Wikipedia
Fedex
Akamai’s EdgeComputing platform
Technorati
FURL
Sun
- Open Solaris Source Browser
12. When to use Lucene?
Search applications
Search functionality for existing applications
Search enabling database application
13. When not to use?
Not ideal for
- Adding generic search to site
- Enterprise systems needing support for proprietary formats
- Extremely high volume systems
- Through a better architecture this can be solved
- Investigate carefully if
- You need more than 100 QPS per system
- Highly volatile data
- Updates are actually Deletes and Additions
- Additions visible to new sessions only
14. Why Lucene?
What problems does Lucene solve?
- Full text with MySQL
- Pros and Cons
Powerful features
Simple API
Scalable, cost-effective, efficient Indexing
- Powerful Searching through multiple query types
17. IndexWriter
IndexWriter
- Creates new index
- Adds document to new index
- Gives you “write” access but no “read” access
- Not the only class used to modify an index
- Lucene API can be used as well
18. Directory
Directory
- Represents location of the Lucene Index
- Abstract class
- Allows its subclasses to store the index as they see fit
- FSDirectory
- RAMDirectory
- Interface Identical to FSDirectory
19. Analyzer
Analyzer
- Text passed through analyzer before indexing
- Specified in the IndexWriter constructor
- Incharge of extracting tokens out of text to be indexed
- Rest is eliminated
- Several implementation available (stop words, lower case
etc)
20. Document
Document
- Collection of fields (virtual document)
- Chunk of data
- Fields of a document represent the document or meta-data
associated with that document
- -Original source of Document data (word PDF) irrelevant
- Metadata indexed and stored separately as fields of a
document
- Text only: java.lang.String and java.io.Reader are the only
things handled by core
21. Field 1
Field
- Document in an index contains one or more fields (in a class called Field)
- Each field represents data that is either queried against or retrieved from index during
search.
- Four different types:
- Keyword
- Isn’t analyzed
- But indexed and stored in the index
- Ideal for:
- URLs
- Paths
- SSN
- Names
- Orginal value is reserved in entirety
22. Field types
- Unindexed
- Neither analyzed nor indexed
- Value stored in index as is
- Fields that need to be displayed with search results (URL
etc)
- But you won’t search based on these fields
- Because original values are stored
- Don’t store fields with very large values
- Especially if index size will be an issue
23. Field types
- Unstored
- Opposite of UnIndexed
- Field type is analyzed and indexed but isn’t stored in the
index
- Suitable for indexing a large amount of text that’s not going
to be needed in original form
- E.g.
- HTML of a webpage etc
24. Field types
- Text
- Analyzed and indexed
- Field of this type can be searched against
- Be careful about the field size
- If data indexed is String, it will be stored
- If Data is from a Reader
- It will not be stored
25. Note:
Field.Text(String, String) and Field.Text(String, Reader) are
different.
- (String, String) stores the field data
- (String, Reader) does not
To index a String, but not store it, use
- Field.UnStored(String, String)
26. Classes for Basic Search Operations
IndexSearcher
- Opens an index in read-only mode
- Offers a number of search methods
- Some of which implemented in Searcher class
IndexSearcher is = new IndexSearcher(
FSDirectory.getDirectory("/tmp/index", false));
Query q = new TermQuery(new Term("contents",
"lucene"));
Hits hits = is.search(q);
27. Classes for Basic Search Operations
Term
- Basic unit for searching
- Consists of pair of string elements: name of field and value
of field
- Term objects are involved in indexing process
- Term objects can be constructed and used with TermQUery
Query q = new TermQuery(new Term("contents",
"lucene"));
Hits hits = is.search(q);
28. Classes for Basic Search Operations
Query
- A number of query subclasses
- BooleanQuery
- PhraseQuery
- PrefixQuery
- PhrasePrefixQuery
- RangeQuery
- FilteredQuery
- SpanQuery
29. Classes for Basic Search Operations
TermQuery
- Most basic type of query supported by Lucene
- Used for matching documents that contain fields with
specific values
Hits
- Simple container of pointers to ranked search results.
- Hits instances don’t load from index all documents that
match a query but only a small portion (performance)
30. Indexing
Multiple type indexing
- Scalable
- High Performance
- “over 20MB/minute on Pentium M 1.5GHz”
- Incremental indexing and batch indexing have same cost
- Index Size
- index size roughly 20-30% the size of text indexed
- Compare to MySQL’s FULL-TEXT index size
- Cost-effective
- 1 MB heap (small RAM needed)
31. Powerful Searching & Sorting
- Ranked Searching
- Multiple Powerful Query Types
- phrase queries, wildcard queries, proximity queries, range
queries and more
- Fielded Searching
- fielded searching (e.g., title, author, contents)
- Date Range Searching
- date-range searching
- Multiple Index Searching with Merged Results
- Sort by any field
32. How to Integrate Your Application With Lucene
Install JDK (5 or 6)
Testing Lucene Demo
33. Prerequisites: JDK
Installing JDK
- For downloading visit the JDK5
http://java.sun.com/javase/downloads/index_jdk5.jsp page
- or JDK 6 download page
http://java.sun.com/javase/downloads/index.jsp
- Once downloaded:
- Change Permissions
- [root@srv31 jdk-install]# chmod 755 jdk-1_5_0_09-linux-
i586.bin
- Install
- [root@srv31 jdk-install]# ./jdk-1_5_0_09-linux-i586.bin
34. Testing Lucene Demo
Step 2: Testing Lucene Demo
- Set up your environment
- vi /root/.bashrc
- export PATH=/var/www/html/java/jdk1.5.0_09/bin:$PATH
export
CLASSPATH=.:/var/www/html/java/jdk1.5.0_09:/var/www/html/java/jdk1.5.0_09/lib:/var/www/html/jav
a/jdk1.5.0_09/lib/lucene-2.1.0/lucene-core-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/lucene-
2.1.0/lucene-demos-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar
- Now get and place in /var/www/html/java/jdk1.5.0_09/lib/lucene-2.1.0/
- Lucene Java
- http://www.apache.org/dyn/closer.cgi/lucene/java/
- XMLRPC Library
- [root@srv31 lib]# wget http://mirror.candidhosting.com/pub/apache/lucene/java/lucene-2.1.0.zip
[root@srv31 lib]# unzip lucene-2.1.0.zip
[root@srv31 lib]# cp -p lucene-2.1.0/lucene-core-2.1.0.jar ../lib/
[root@srv31 lib]# cp -p lucene-2.1.0/lucene-demos-2.1.0.jar ../lib/
[root@srv31 lib]# cp -p /var/www/html/java/jdk1.5.0_06/lib/xmlrpc-3.0a1.jar
/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar
Now "dot" the above file:
[root@srv31 lib]# . /root/.bashrc
35. Testing Lucene Demo 2
- Believe it or not, we are now ready to test the Lucene Demo.
- Indexing
- I just let it loose on a randomly picked directory to give you an
idea:
[root@srv31 lib]# java org.apache.lucene.demo.IndexFiles
/var/www/html/java/jdk1.5.0_09/
adding /var/www/html/java/jdk1.5.0_09/include/jni.h
adding /var/www/html/java/jdk1.5.0_09/include/linux/jawt_md.h
adding /var/www/html/java/jdk1.5.0_09/include/linux/jni_md.h
adding /var/www/html/java/jdk1.5.0_09/include/jvmti.h
adding /var/www/html/java/jdk1.5.0_09/include/jvmdi.h
Optimizing...
157013 total milliseconds
37. Loading data from MySQL
…
String url = "jdbc:mysql://127.0.0.1/odp";
Connection con = DriverManager.getConnection(url, “user",
“pass");
Statement Stmt = con.createStatement();
ResultSet RS = Stmt.executeQuery
("SELECT * FROM " +
" articles" );
38. Loading data from MySQL 2
while (RS.next()) {
// System.out.print(""" + RS.getString(1) + """);
try {
final Document doc = new Document();
// create Document
doc.add(Field.Text("title", RS.getString("title")));
doc.add(Field.Text("type", "article"));
doc.add(Field.Text("author",
RS.getString("author")));
doc.add(Field.Text("body", RS.getString("body")));
doc.add(Field.Text("extended",
RS.getString("extended")));
…
39. Loading data from MySQL 3
…
doc.add(Field.Text("tags", RS.getString("tags")));
doc.add(Field.UnIndexed("permalink", RS.getString("permalink") ));
doc.add(Field.UnIndexed("id", RS.getString("id")));
doc.add(Field.UnIndexed("member_id", RS.getString("member_id")));
doc.add(Field.UnIndexed("portal_id", RS.getString("portal_id")));
//doc.add(Field.Text("id", RS.getString("id")));
writer.addDocument(doc);
}
catch (IOException e) { System.err.println("Unable to index student"); }
}
// close connection
40. Searching Data using XML RPC
public static void searchArticles( final String search, final int numberOfResults)
throws Exception
{
final Query query;
Analyzer analyzer = new StandardAnalyzer();
query = QueryParser.parse(search, "title", analyzer);
final ArrayList ids = new ArrayList();
try {
final IndexReader reader = IndexReader.open(INDEX_DIR);
final IndexSearcher searcher = new IndexSearcher(reader);
final Hits hits = searcher.search(query);
for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {
final Document doc = hits.doc(i);
// id field needs to be added //ids.add(new Integer(doc.getField("id").stringValue()));
…
42. Searching Data using XML RPC 3
}
searcher.close();
reader.close();
}
catch (IOException e) {
System.out.println("Error while reading student data
from index");
}
}
43. Future of Lucene
Advanced Linguistics Modules that integrate with Lucene
- Support for complex script languages
- Basis Technologies’ Rosette® Linguistics Platform
- The same linguistic software that powers multilingual web
search on Google, Live.com, Yahoo! and leading enterprise
search engines
- “allows Lucene-based applications to index and search text
in multiple languages concurrently, including complex script
languages such as Arabic, Chinese, Farsi, Japanese and
Korean. “
- www.basistech.com/lucene
44. What are the ports of Lucene
Lucene4c - C
CLucene - C++
MUTIS - Delphi
Lucene.Net - a straight C#/.NET port of Lucene by the
Apache Software Foundation, fully compatible with it.
Plucene - Perl
Kinosearch - Perl
Pylucene - Lucene interfaced with a Python front-end
Ferret and RubyLucene - Ruby
Zend Framework (Search) - PHP
Montezuma - Common Lisp
45. Where to get help about Lucene?
http://lucene.apache.org/java/docs/mailinglists.html
IRC