Ted Sullivan has over 15 years of experience building search applications using various technologies. He watched Lucene grow from a small search engine to a major player and now works at Lucidworks. Through his career, Ted has advocated for techniques like autophrasing, query autofiltering, and leveraging metadata to improve search quality. He blogs as the "Search Curmudgeon" to critique new technologies and provide pragmatic advice based on his extensive experience.
Online sources of information december 2010Vere Software
In her "Online Sources of Information" webinar (March 2011), PI Cynthia Navarro listed her favorite resources for finding information about companies and individuals.
Slides for VU Web Technology course lecture on "Search on the Web". Explaining how search engines work, some basic information laws and inverted indices.
Fluidinfo: Publishing in an Openly Writeable WorldFluidinfo
The slides I used during my presentation at Pearson's "From Book to Tablet: How Data is changing publishing" event on 2011/04/13 in London. I've added some notes.
Online sources of information december 2010Vere Software
In her "Online Sources of Information" webinar (March 2011), PI Cynthia Navarro listed her favorite resources for finding information about companies and individuals.
Slides for VU Web Technology course lecture on "Search on the Web". Explaining how search engines work, some basic information laws and inverted indices.
Fluidinfo: Publishing in an Openly Writeable WorldFluidinfo
The slides I used during my presentation at Pearson's "From Book to Tablet: How Data is changing publishing" event on 2011/04/13 in London. I've added some notes.
Linked Data: The Real Web 2.0 (from 2008)Uche Ogbuji
"Linking Open Data (LOD) is a community initiative moving the Web from the idea of separated documents to a wide information space of data. The key principles of LOD are that it is simple, readily adaptable by Web developers, and complements many other popular Web trends. Linked, open data is the real substance of Web 2.0, and not flashy AJAX effects. Learn how to make your data more widely used by making its components easier to discover, more valuable, and easier for people to reuse—in ways you might not anticipate."
Latest presentation on the development of the CyberCemetery, an archive of "dead" websites for now-defunct government agencies and commissions. The CyberCemetery archive is maintained by the University of North Texas (UNT) Libraries, an Affiliated Archive of the National Archives and Records Administration (NARA).
A semantic web is a relativity modern technology coined by Sir Tim Berners-Lee in 2001. Web 2.0 is readable by humans. We have HTML 5 and CSS and it does a great job of allowing information to be read by humans. Where web 2.0 fails is supporting machine reading. This then brings up web 3.0. Being able to support data is great, but often what we are most interested in is not the data itself, but the relationships between and among data. Think about how hard it is currently to get all water features. Those features are often in different services and provided by different organizations. I want to quickly and easily get all water features nationally. This is where a semantic web would be very useful because one can store the relationships between data to give you all water features. This talk will show you some of the advantages of a semantic web and how it can be used to answer questions that one would struggle to answer without it.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
If information stewards and custodians are to collect, create, appraise, preserve, store, use and access sophisticated, flexible, responsive and future- friendly content at scale, then they will have to think strategically about who's going to use the content, how and where they're going to consume it. COPE – Create Once, Publish Everywhere - is an acronym that describes how content should be conceived once and then disseminated through multiple conduits. The goal of COPE is to capture all content (text, media), context and metadata in a single manner, and then ensure that this content can be accessed and used across a range of publishing platforms.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
This event was on May 2, 2017 at Wesley University, Ondo State, Nigeria. I trained the university's staff (academic and non-academic) on "Information Discovery and Search Strategies for Evidence-Based Research" in an information/digital literacy session.
Linked Data: The Real Web 2.0 (from 2008)Uche Ogbuji
"Linking Open Data (LOD) is a community initiative moving the Web from the idea of separated documents to a wide information space of data. The key principles of LOD are that it is simple, readily adaptable by Web developers, and complements many other popular Web trends. Linked, open data is the real substance of Web 2.0, and not flashy AJAX effects. Learn how to make your data more widely used by making its components easier to discover, more valuable, and easier for people to reuse—in ways you might not anticipate."
Latest presentation on the development of the CyberCemetery, an archive of "dead" websites for now-defunct government agencies and commissions. The CyberCemetery archive is maintained by the University of North Texas (UNT) Libraries, an Affiliated Archive of the National Archives and Records Administration (NARA).
A semantic web is a relativity modern technology coined by Sir Tim Berners-Lee in 2001. Web 2.0 is readable by humans. We have HTML 5 and CSS and it does a great job of allowing information to be read by humans. Where web 2.0 fails is supporting machine reading. This then brings up web 3.0. Being able to support data is great, but often what we are most interested in is not the data itself, but the relationships between and among data. Think about how hard it is currently to get all water features. Those features are often in different services and provided by different organizations. I want to quickly and easily get all water features nationally. This is where a semantic web would be very useful because one can store the relationships between data to give you all water features. This talk will show you some of the advantages of a semantic web and how it can be used to answer questions that one would struggle to answer without it.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
If information stewards and custodians are to collect, create, appraise, preserve, store, use and access sophisticated, flexible, responsive and future- friendly content at scale, then they will have to think strategically about who's going to use the content, how and where they're going to consume it. COPE – Create Once, Publish Everywhere - is an acronym that describes how content should be conceived once and then disseminated through multiple conduits. The goal of COPE is to capture all content (text, media), context and metadata in a single manner, and then ensure that this content can be accessed and used across a range of publishing platforms.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
This event was on May 2, 2017 at Wesley University, Ondo State, Nigeria. I trained the university's staff (academic and non-academic) on "Information Discovery and Search Strategies for Evidence-Based Research" in an information/digital literacy session.
SearchLeeds 2018 - Dawn Anderson - Power from what lies beneath ... The icebe...Branded3
Dawn takes a look at ‘The Iceberg Approach to SEO’. As we move increasingly to an era of smaller screen search (or no screen), we need to consider ways to say more with less and communicate this to both search engines and users. She explores semantics, the knowledge graph, schema and ontologies combined with UX as methods to pass themed ‘equivalence’ from below the surface of the site or the individual page.
Most people do not know how Google uses Machine Learning in the Search algorithms. This talk covers everything from Rank Brain to the Helpful Content Update to Neural Matching, how they work and what you need to do to take advantage of how they work.
The more you understand how Google works, the better you can increase your site's visibility in Search.
ES298 Computer Education
Learning Objectives:
1. Distinguish different information repositories in the World Wide Web
2.Know and apply knowledge of different types of search engines on actual web search process.
3. Apply Boolean techniques in Web Research
4. Create Learning Activities that develop students skills on Web Search
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Solr now smoothly integrates with Lucene-level payloads.
Payloads provide optional per-term metadata, numeric or otherwise. Payloads help solve challenging use cases such as per-store product pricing and per-term confidence/weighting.
This session will present the payload feature from the Lucene layer up to the Solr integration, including per-store pricing, per-term weighting, and more.
Think *inside* the box. Inside the *search* box, that is.
The "best"* search results incorporate many more factors than (just) textual matching and relevancy. Search experience owners manage query context rules, signals automatically feed back machine learned factors, users implicit and explicit behaviors filter and weight future interactions. Synergy emerges with several cooperating (just) searches.
This talk will showcase and detail several (just) search examples including rules, typeahead/suggest, signals, and location awareness, bringing them all together into a cohesive search experience.
Lucene powers the search capabilities of practically all library discovery platforms, by way of Solr, etc. The Lucene project evolves rapidly, and it's a full-time job to keep up with the ever improving features and scalability. This talk will distill and showcase the most relevant(!) advancements to date.
This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Using Apache Lucene and Solr search technologies, information and knowledge have become vastly more searchable, findable, and accessible. Because scholars and researchers are some of the most demanding users of search systems, the problems encountered by the implementers are complex. For example, many of the applications built on these technologies also thrive on intentionally designed-in serendipitous discovery capabilities, bringing to light previously unknown, yet related and potentially interesting, content.
Libraries and other public knowledge-sharing environments, such as Wikipedia, generally embrace "open source" and community improving contributions as core principles, making a lovely synergy with the power, features, and community-driven ecosystem provided by Lucene and Solr.
This talk will introduce you to several Solr powered library-related systems, detail how they work, and leave you with lessons learned that can be applied to your applications.
"Solr Update" at code4lib '13 - ChicagoErik Hatcher
Solr is continually improving. Solr 4 was recently released, bringing dramatic changes in the underlying Lucene library and Solr-level features. It's tough for us all to keep up with the various versions and capabilities.
This talk will blaze through the highlights of new features and improvements in Solr 4 (and up). Topics will include: SolrCloud, direct spell checking, surround query parser, and many other features. We will focus on the features library coders really need to know about.
In this talk, Solr's built-in query parsers will be detailed included when and how to use them. Solr has nested query parsing capability, allowing for multiple query parsers to be used to generate a single query. The nested query parsing feature will be described and demonstrated. In many domains, e-commerce in particular, parsing queries often means interpreting which entities (e.g. products, categories, vehicles) the user likely means; this talk will conclude with techniques to achieve richer query interpretation.
Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.
Solr Recipes provides quick and easy steps for common use cases with Apache Solr. Bite-sized recipes will be presented for data ingestion, textual analysis, client integration, and each of Solr’s features including faceting, more-like-this, spell checking/suggest, and others.
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward.
Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward. Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
Come learn how you can get your content into Solr and integrate it into your applications!
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
3. - Ted Sullivan, PhD
“(old Phuddy Duddy)”
“Senior (very much so I’m afraid)
Solutions (I hope)
Architect (and sometime plumber)”
4. - Ted Sullivan
When is my search app done?
“How do you get there grasshopper? Add semantic
intelligence to the engine!”
5. In his own words...
For the past 15 or so years now I have been building search applications, first with Verity K2 for a project
with a publishing company H.W. Wilson, then with most of the vendor products in the search space,
Ultraseek, Fast, Autonomy, Endeca, Vivissimo, MarkLogic and Exalead. I watched Lucene grow and
develop from an interesting little search engine to a major force in the search technology business. Before
that, I was building collaborative battlefield planning applications for the U.S. Army and before that I was
working on Internet stuff back in the dawn of the Web (well almost - 1994). I have been programming in
Java since 1995 and professionally since 1996 or so. I was learning JavaScript when Netscape was still
developing it, but only recently have begun to truly understand its power! John Resig and Bear Bibeault's
book "Secrets of the JavaScript Ninja" is a must read for anyone that wants to follow this path. Currently, I
am struggling up the AngularJS learning curve.
Before my work in the web with my friend Jim Spatz at Spatz Computer Graphics, I published some Math
games for kids on the original Mac OS, and before that, I did science - Auditory Neuroscience to be more
precise. I studied the auditory system of 'fly-by-night' critters, bats and owls first at Washington University in
St Louis, then at Caltech and Princeton. I was pretty good at Science but didn't like the writing part as
much as I should have. I had much more fun writing code (C, FORTRAN and PDP 8/11 assembler).
Currently, I am enjoying becoming part of the Open Source Revolution working at Lucidworks. Back in
1995 when Linux came out, I had a bet with my boss Jim Spatz about its future - I'm happy to say now that
I lost that bet. I would aspire to be an Open Source evangelist but there are enough of those already. I'll
settle for Solr Evangelist.
I'll settle for Solr Evangelist.
7. Random Rants from the
Search Curmudgeon
• https://lucidworks.com/2015/03/09/random-
rants-search-curmudgeon/
• Search vs. Information Access
8. Data Science for
Dummies
• https://lucidworks.com/2016/09/06/data-
science-for-dummies/
• "A conditional probability is like the probability
that you are a moron if you text while driving
(pretty high it turns out – and would be a good
source of Darwin awards except for the innocent
people that also suffer from this lunacy.)"
9. The Twilight of the Vengine Gods
(Die Göttervenginedämmerung) or
Die Hard with A Vengines!!!
• https://lucidworks.com/2016/10/18/the-
twilight-of-the-vengine-gods-die-
gottervenginedammerung/
• "The Curmudgeon doesn’t dispense news, he just
tells you what information, new or old sucks or
what pisses him off and then rants about it. "
10. Where did all the
Librarians go?
• https://lucidworks.com/2017/11/21/where-did-
all-the-librarians-go/
• "You’ve probably gotten tired of me by now, that’s
OK because I’m tired of me too."
11. Search Legacy
• Blogs: as Search Curmudgeon and himself
• Lucidworks: heavy duty implementations
• Techniques: autophrasing and query autofiltering
• Presentations: Revolutions and inaugural Haystack
12. Automatic Phrase Tokenization:
Improving Lucene Search Precision
by More Precise Linguistic Analysis
• https://lucidworks.com/2014/07/02/automatic-
phrase-tokenization-improving-lucene-search-
precision-by-more-precise-linguistic-analysis/
• Takeaway: moving from bag of words towards bag
of things
13. Solution for Multi-term Synonyms in
Lucene/Solr Using the Auto
Phrasing TokenFilter
• https://lucidworks.com/2014/07/12/solution-for-
multi-term-synonyms-in-lucenesolr-using-the-auto-
phrasing-tokenfilter/
• LUCENE-2605 & Friends resolved over two years
later
• split on whitespace = false
15. The Well Tempered Search
Application – Fugue
• https://lucidworks.com/2015/02/03/well-tempered-search-application-fugue/
• autophrasing
• "red sofa" problem
• Takeaway: ahead of its time (evolving into Solr Text Tagger and query
rewriting)
• "seed crystals of knowledge": SME tagging
16. Introducing Query
Autofiltering
• https://lucidworks.com/2015/02/17/introducing-query-autofiltering/
• "autotagging of the incoming query where the knowledge source is the
search index itself"
• we already have the information that we need to “do the right thing”
we just don’t use it
• "Another approach that was suggested by Erik Hatcher, is to have a
separate collection that is specialized as a knowledge store and query it to
get the categories with which to autofilter on the content collection."
• The key is that in both cases, we are using the search index itself as a
knowledge source that we can use for intelligent query introspection
and thus powerful inferential search!!
17. Thoughts on
“Search vs. Discovery”
• https://lucidworks.com/2015/03/02/thoughts-search-
vs-discovery/
• "findability", facets, aboutness, relatedness
• "However if a document is not appropriately tagged, it
may become invisible..."; Data quality really matters here!
• Auto classification and manual subject matter expert
tagging
• Visualization, search driven analytics
18. Query Autofiltering Revisited
– Lets be more precise!!!
• https://lucidworks.com/2015/05/13/query-autofiltering-
revisited-can-precise/
• "blue red lion socks"
19. Query Autofiltering Extended –
On Language and Logic in Search
• https://lucidworks.com/2015/06/06/query-
autofiltering-extended-language-logic-search/
• If you've got metadata, use (autofilter) it. If you've
got known multi-word phrases, use them.
• Language usage understanding of AND vs. OR
20. Focusing on Search Quality at
Lucene/Solr Revolution 2015
• https://lucidworks.com/2015/10/19/focusing-on-
search-quality-at-lucenesolr-revolution-2015/
• "Again, the “knowledge base” ... can be the Solr/
Lucene index itself!"
• “On-The-Fly Predictive Analytics” – as we say in
the search quality biz – its ALL about context!
21. Query Autofiltering IV:
A Novel Approach to NLP
• https://lucidworks.com/2015/11/19/query-
autofiltering-chapter-4-a-novel-approach-to-
natural-language-processing/
• Verbs
• Bob Dylan cover tunes
• Query Introspection: inferring user intent
• POS mapped to query fields
22. Pivoting to the Query: Using Pivot
Facets to build a Multi-Field
Suggester
• https://lucidworks.com/2016/08/12/pivoting-to-the-
query-using-pivot-facets-to-build-a-multi-field-suggester/
• Pivot facets: "Think of it as a way of generating a facet
value “taxonomy” – on the fly."
• Facet Phrases
• Once we commit to building a special Solr collection (also
known as a ‘sidecar’ collection) just for typeahead, there
are other powerful search features that we now have to
work with. One of them is contextual metadata. [!!!]
23. Building a Subject Classifier using
Automatically Discovered Keyword
Clusters, Part I
• https://lucidworks.com/2017/02/28/building-a-
subject-classifier-using-automatically-discovered-
keyword-clusters-part-i/
• subject classifier that uses automatically discovered
key term “clusters” that can then be used to classify
documents
• autophrasing + /terms....
• blah blah relatedness(...) blah blah
24. Why Facets are Even More
Fascinating than you Might Have
Thought
• https://lucidworks.com/2017/09/22/why-facets-are-even-more-
fascinating-than-you-might-have-thought/
• Context matters!
• Spatial metaphor: N-Dimensional hyperspace
• "Paul McCartney" => "John Lennon"
• contextual usage of first result to boost second
• Facets and UI
• This is “surfin’ the meta-informational universe” that is your Solr collection.
• The Facet Theorem
25. When Worlds Collide – Artificial
Intelligence Meets Search
• https://lucidworks.com/2018/04/30/when-worlds-collide-artificial-
intelligence-meets-search/
• The Search Loop: questions, answers, then more questions
• Inferring User Intent: NLP, POS, head-tail analysis, directed pattern-
based
• Information Spaces: conceptually near
• Knowledge Spaces and Semantic Reference Frames
• Word Embedded Vectors
• Knowledge Graphs: taxonomies and ontologies
27. “the Curmudgeon doesn’t dispense
news, he just tells you what
information, new or old sucks or
what pisses him off and then rants
about it. ”
28. “You may be thinking – "Who’s this
Search Curmudgeon guy? He’s a real
jerk". No argument there.”
29. “hey IT guys – Buy More Memory for
chrissake! Thanks to Moore’s Law it’s
pretty cheap now so don’t be such a
tight-ass”
30. “And the role of DBA will likely be
staffed by curmudgeons like me – so
be nice to them – they can save your
ass. We’ve seen our share of techno
cliff jumpers – it doesn’t end well.”
31. “what we old guys know is that some
of the hot things that you whiz kids
are doing now were done before, i.e.,
`back in the day`. ”
32. “You are not as smart as you think
you are kiddies – dual quad core, 3
GHz CPUs and 512 GB of RAM can
hide lots of coding sins. ”
33. “When I was your age sonny, we had
to walk three miles through snow to
submit our box of punch cards … talk
about crappy BAUD rates!)”
34. “....because in my opinion (notice that
I didn’t say ‘humble’ because that is
one thing that the Curmudgeon is
definitely NOT)...”
35. “I’m a humanist believe it or not – I
like humans even if they don’t like
me sometimes – I EARNED my
nickname of ‘curmudgeon’ you
know.”
36. “proper care and feeding of these
"analysis chains" can make you
some serious money – especially you
eCommerce guys”
37. “You’ve probably gotten tired of me
by now, that’s OK because I’m tired
of me too. Believe me, you don’t have
to live with me – I do.”
38. Ted on...
• IDOL: "should really be spelled IDLE"
• Fast vs. Solr: "One is named Fast, the other actually is fast"
• Endeca: "what took several hours in Endeca indexed in
about 10 minutes in Solr"
• elidedsearch: "The name of the company is like the material
that is used to hold up my Jockey Shorts (hint, hint)", Fruit-
of-the-Loom Finders, Tightie Whitie Quest, RubberBand
Finders, Brain Splitters, BungeeSeek
40. Ted's Big Adventure
• Semantics: bag of things, not bag of words
• synonyms, autophrasing, lemmatization
• "in text search – semantics matter"
• Linguistics: noun phrases, POS, NLP
• Facets
• autofiltering
• The Facet Theorem
• Relatedness
• Knowledge Space, Semantic Reference Frames
• Context matters
41. The Facet Theorem
• Lemma 1: Similar things tend to occur in similar
contexts
• Lemma 2: Facets are a tool for exploring meta-
informational contexts
•it therefore follows that:
• Theorem: Facets can be used to find similar things.