The document discusses building a custom tokenizer for Apache Solr using JFlex to provide flexible tokenization. It describes the motivation for custom tokenizers when dealing with custom metadata like IDs, types, and descriptions. It then provides examples of use cases and demonstrates how to set up a JFlex tokenizer project, define token patterns in a scanner description, and integrate it with Solr.
2. Agenda
⢠ME & SHI
⢠JFLEX Tokenizer
⢠Motivation
⢠JFlex ?!
⢠Solr implementation
⢠Demo
⢠Q & A
3. Markus Klose â Search Consultant
⢠Expertise in Solr, Lucene, Elasticsearch,
Fast ESP
⢠Certified Apache Solr Trainer
⢠Speaker, Blogger, Coder
⢠Author âEinfĂźhrung in Apache Solrâ
⢠@markus_klose
4. SHI GmbH & Co KG
2013
2011
Delivering mission-critical data-driven solution for multiple industries.
Partnering with
Partnering with LucidWorks
2000 Embracing Open Source.
1994
Foundation. Development of home-grown information retrieval
platform.
2014
5. OUR MISSION
Vendor-independent IT Consulting and Software Engineering company.
Dedicated to deliver next generation Semantic Search, Big Data and Exploratory Data
Analytics solutions.
Using Enterprise Data Hub approach for 360° data integration.
And helping customers to Accelerate (e)Business through better technology adoption
and data utilization.
6. Technology
Drives
Business
CUSTOM TOKENIZER WITH JFLEX
JFlex based tokenizer - the idea is not new, but great
7. Motivation 1
⢠In customer projects we have to deal very
often with custom âmetaâ data
⢠IDs
⢠Type designation
⢠Product description
⢠How to face that problem? PatternTokenizer?
8. Motivation 2
⢠Use and combine
existing tools to be more
flexible
⢠Configuration over
Coding
⢠JFlex allready used in
ClassicTokenizer /
StandardTokenizer
10. JFlex - The Fast Scanner Generator
⢠JFlex is a lexical analyzer generator (aka
scanner generator)
⢠Current version 1.5.1
⢠Download - http://jflex.de/download.html
⢠Mailing Lists
⢠BSD-style license
⢠CLI API & GUI
11. JFlex - The Fast Scanner Generator
⢠Berlin Buzzword 26.05.2014
⢠LETTERS -> âBerlinâ, âBuzzwordâ
⢠LETTERS and SPACE -> âBerlin Buzzwordâ
⢠DIGITS -> â26â, â05â, â2014â
⢠DIGITS and . -> â26.05.2014â
⢠LETTERS and SPACE or DIGITS and .
-> âBerlin Buzzwordâ , â26.05.2014â
JFlex is a lexical analyzer generator (also known as scanner generator) for Java(tm), written in Java(tm). It is also a rewrite of the very useful tool JLex which was developed by Elliot Berk at Princeton University. As Vern Paxson states for his C/C++ tool flex: They do not share any code though. JFlex is designed to work together with the LALR parser generator CUP by Scott Hudson, and the Java modification of Berkeley Yacc BYacc/J by Bob Jamison. It can also be used together with other parser generators like ANTLR or as a standalone tool.
JFlex has three mailing lists: jflex-announce is low traffic and read-only for announcements of new releases, jflex-users is for help and discussions, and jflex-devel for developer discussions. If you would like to subscribe to either of the first two, please enter your email address below, check the appropriate boxes, and press [subscribe]. For the developer list, see the mailing lists page.
Creating java classes based on a grammar that parses input
Show factory & solrconfig.xml
Show Tokenizer -> incrementToken
Show JFlex File + Compilation
Step 1 nur text
Step 2 simple kombination
Step 3 kompex setup
ISBN
url protocol://subdomain.site.domain/directory