Apache Lucene's next major release, 4.0, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data.
Zoe Slattery's slides from PHPNW08:
The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.
For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.
In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk will present a design and implementation of a flexible, version-independent serialization format for Lucene indexes and its applications in index upgrades / downgrades, in distributed document analysis, in distributed indexing, and in integration with external indexing pipelines. This format enables submitting pre-analyzed documents to Lucene/Solr, and transferring parts of indexes between nodes in a distributed setup.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Munching & crunching - Lucene index post-processingabial
Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
Zoe Slattery's slides from PHPNW08:
The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.
For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.
In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk will present a design and implementation of a flexible, version-independent serialization format for Lucene indexes and its applications in index upgrades / downgrades, in distributed document analysis, in distributed indexing, and in integration with external indexing pipelines. This format enables submitting pre-analyzed documents to Lucene/Solr, and transferring parts of indexes between nodes in a distributed setup.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Munching & crunching - Lucene index post-processingabial
Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Presented by Fotolog. Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.
In this presentation, Frank Mash shows you how you can use Lucene with MySQL to offer powerful searching capabilities to your stakeholders. The presentation will cover installation, usage. optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there’s enough projects weird enough to require custom search scoring and ranking. In this talk, I’ll walk through how to use Lucene to implement your custom scoring and search ranking. We’ll see how you can achieve both amazing power (and responsibility) over your search results. We’ll see the flexibility of Lucene’s data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Presented by Fotolog. Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.
In this presentation, Frank Mash shows you how you can use Lucene with MySQL to offer powerful searching capabilities to your stakeholders. The presentation will cover installation, usage. optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
ElasticSearch in Production: lessons learnedBeyondTrees
With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.
Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there’s enough projects weird enough to require custom search scoring and ranking. In this talk, I’ll walk through how to use Lucene to implement your custom scoring and search ranking. We’ll see how you can achieve both amazing power (and responsibility) over your search results. We’ll see the flexibility of Lucene’s data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Marty Kaszubowski
This is a presentation given to the Hampton Roads Economic Development Alliance (HREDA) on 9-25-14. It describes the vision and goals for the new Old Dominion University (ODU) Center for Enterprise Innovation (CEI).
Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites.
Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it's helpful to begin with an overview of the steps and components involved in a Solr search.http://www.lucidimagination.com/developers/whitepapers/whats-new-solr-14
Maybe you’ve heard pundits say that in the next year, humans will create more data than in all of human history. The problem with those predictions, Stephen O’Grady of Redmonk said in his keynote to Day 2 of Lucene Revolution, is that they’re true.
Blocks is a cool concept and is very much needed for performance improvements and responsiveness. GCD helps run blocks effortlessly by scheduling on a desired queue, priority and lots more.
A short presentation at a CSC internal workshop of the prospects of using container technologies, especially Docker, in the context of High Performance Computing (HPC).
In this keynote from Deltaware Data Solutions' 2016 Emerging Technology Summit, Stephen Foskett gives essential background on the emerging trend of containerization of enterprise applications. What are containers and how will they affect enterprise IT? Why is Docker so important? Foskett addresses both the technical and architectural questions, discussing which applications will be containerized, the benefits and costs, and what it means for IT operations.
Containers in depth – Understanding how containers work to better work with c...All Things Open
Presented by: Brent Laster
Presented at the All Things Open 2021
Raleigh, NC, USA
Raleigh Convention Center
Abstract: Containers are all the rage these days – from Docker to Kubernetes and everywhere in-between. But to get the most out of them it can be helpful to understand how containers are constructed, how they depend and interact with the operating system, and what the differences and interactions are between layers, images, and containers. Join R&D Director, Brent Laster as he does a quick, visual overview of how containers work and how applications such as Docker work with them.
Topics to be discussed include:
• What containers are and the benefits they provide
• How containers are constructed
• The differences between layers, images, and containers
• What does immutability really mean
• The core Linux functionalities that containers are based on • How containers reuse code
• The differences between containers and VMs
• What Docker really does
• The Open Container Initiative
• A good analogy for understanding all of this
Latest (storage IO) patterns for cloud-native applications OpenEBS
Applying micro service patterns to storage giving each workload its own Container Attached Storage (CAS) system. This puts the DevOps persona within full control of the storage requirements and brings data agility to k8s persistent workloads. We will go over the concept and the implementation of CAS, as well as its orchestration.
The Fn project is a container-native Apache 2.0 licensed serverless platform that you can run anywhere – on any cloud or on-premise. It’s easy to use, supports every programming language, and is extensible and performant. This YourStory-Oracle Developer Meetup covers various design aspects of Serverless for polyglot programming, implementation of Saga pattern, etc. It also emphasizes on the monitoring aspect of Fn project using Prometheus and Grafana
ASP.NET Core Quick Start covering Configuration, Logging, and .NET Framework versus .NET Core. Source code for the demos are on GitHub: https://github.com/ErikNoren/AspNetCoreDemos
Slides from my presentation at .NET Southeast, Brighton. More information at https://blogs.msdn.microsoft.com/webdev/2017/08/14/announcing-asp-net-core-2-0/ and https://blogs.msdn.microsoft.com/webdev/2017/08/25/asp-net-core-2-0-features-1/
Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.
TechTalk: Connext DDS 5.2 - Faster and Easier Development of Industrial Internet Systems and Applications
Watch on-demand: https://youtu.be/j1G0MHC0Vwc
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following:
- A powerful UI to analyze time series data stored in Lucene/Solr
- Creating and sharing visualizations, dashboards and reports
- Discovery and analysis of data coming from servers, applications, devices and more
- Exploration of click, geospatial and social data in ways previously unimaginable
LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications.
This webinar will explore the features of the App, and provide attendees with valuable information on the following key components:
Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports
Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields
NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore
Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
2. My Background
• I am committer and PMC member of Apache Lucene and
Solr. My main focus is on development of Lucene Java.
• Implemented fast numerical search and maintaining the new
attribute-based text analysis API. Well known as Generics
and Sophisticated Backwards Compatibility Policeman.
• Working as consultant and software architect for SD
DataSolutions GmbH in Bremen, Germany. The main task is
maintaining PANGAEA (Publishing Network for Geoscientific
& Environmental Data) where I implemented the portal's
geo-spatial retrieval functions with Lucene Java.
• Talks about Lucene at various international conferences like
ApacheCon EU/US, Lucene Revolution, Lucene Eurocon,
Berlin Buzzwords and various local meetups.
4. Lucene up to version 3
• Lucene started > 10 years ago
• Lucene‟s vInt format is old and not as friendly as new
compression algorithms to CPU‟s optimizers (exists since
Lucene 1.0)
• It‟s hard to add additional statistics for scoring to the
index
• IR researchers don‟t use Lucene to try out new algorithms
• Small changes to index format are often huge patches
covering tons of files
• Example from days of Lucene 2.4: final omit-TFaP patch of
LUCENE-1340 is ~70 KiB covering ~25 files
5
5. Why Flexible Indexing?
• Many new compression approaches ready to
be implemented for Lucene
• Separate index encoding from terms/postings
enumeration API
• Make innovations to the postings format easier
Targets to make Lucene extensible even on
the lowest level
6
7. Quick Overview
• Will be Apache Lucene ≥ 4.0 only!
• Allows to
• store new information into the index
• change the way existing information is stored
• Under heavy development
• almost stable API, may break suddenly
• lots of classes still in refactoring
• Replaces a lot of existing classes and
interfaces → Lucene 4.0 will not be backwards
compatible (API wise)
8
8. Quick Overview
• Pre-4.0 indexes are still readable, but
< 3.0 is no longer supported
• Index upgrade tool will be available as of
Lucene 3.2
• Supports upgrade of pre-3.0 indexes in-place →
two-step migration to 4.0 (LUCENE-3082)
9
11. Flex - Enum API Properties (1)
• Replaces the TermEnum / TermDocs /
TermPositions
• Unified iterator-like behavior: no longer strange
do…while vs. while
• Decoupling of term from field
• IndexReader returns enum of fields, each field has separate
TermsEnum
• No interning of field names needed anymore (will be removed soon)
• Improved RAM efficiency:
• efficient re-usage of byte buffer with the BytesRef class
12
12. Flex - Enum API Properties (2)
• Terms are arbitrary byte slices, no 16bit UTF-16 chars
anymore
• Term now using byte[] instead of char[]
• compact representation of numeric terms
• Can be any encoding
• Default is UTF-8 → changes sort order for supplementary
characters
• Compact representation with low memory footprint for western
languages
• Support for e.g. BOCU1 (LUCENE-1799) possible, but some
queries rely on UTF-8 (especially MultiTermQuery)
• Indexer consumes BytesRefAttribute, which
converts the good old char[] to UTF-8
13
13. Flex - Enum API Properties (3)
• All Flex Enums make use of AttributeSource
• Custom Attribute deserialization (planned)
• E.g., BoostAttribute for FuzzyQuery
• TermsEnum supports seeking (enables fast
automaton queries like FuzzyQuery)
• DocsEnum / DocsAndPositionsEnum now
support separate skipDocs property: Custom
control of excluded documents (default
implementation by IndexReader returns deleted
docs)
14
14. FieldCache
• FieldCache consumes the flex APIs
• Terms (previously strings) field cache more
RAM efficient, low GC load
• Used with SortField.STRING
• Shared byte[] blocks instead of separate String
instances
• Terms remain as byte[] in few big byte[] slices
• PackedInts for ords & addresses (LUCENE-
1990)
• RAM reduction ~ 40-60%
15
16. Extending Flex with Codecs
• A Codec represents the primary Flex-API
extension point
• Directly passed to SegmentReader to
decode the index format
• Provides implementations of the enum
classes to SegmentReader
• Provides writer for index files to
IndexWriter
17
18. Components of a Codec
• Codec provides read/write for one segment
• Unique name (String)
• FieldsConsumer (for writing)
• FieldsProducer is 4D enum API + close
• CodecProvider creates Codec instance
• Passed to IndexWriter/IndexReader
• You can override merging
• Reusable building blocks
• Terms dict + index, postings
19
19. Build-In Codecs: PreFlex
• Pre-Flex-Indexes will be “read-only” in
Lucene 4.0
• all additions to indexes will be done with
CodecProvider„s default codec
Warning: PreFlexCodec needs to
reorder terms in term index on-the-fly:
„surrogates dance“
→ use index upgrader, LUCENE-3082
20
20. Default: StandardCodec (1)
Default codec was moved out of
o.a.l.index:
• lives now in its own package
o.a.l.index.codecs.standard
• is the default implementation
• similar to the pre-flex index format
• requires far less ram to load term index
21
21. Default: StandardCodec (2)
• Terms index:
VariableGapTermsIndexWriter/Reader
• uses finite state transducers from new FST package
• removes useless suffixes Talk yesterday morning:
“Finite State Automata in Lucene”
• reusable by other codecs by Dawid Weiss
• Terms dict: BlockTermsWriter/Reader
• more efficient metadata decode
• faster TermsEnum
• reusable by other codecs
• Postings: StandardPostingsWriter/Reader
• Similar to pre-flex postings
22
22. Build-In Codecs: PulsingCodec (1)
• Inlines low doc-freq terms into terms dict
• Saves extra seek to get the postings
• Excellent match for primary key fields, but also
“normal” field (Zipf‟s law)
• Wraps any other codec
• Likely default codec will use Pulsing
See Mike McCandless’ blog:
http://s.apache.org/XEw
23
24. Build-In Codecs: SimpleText
• All postings stored in _X.pst text file
• Read / write
• Not performant
• Do not use in production!
• Fully functional
• Passes all Lucene/Solr unit tests
(slowly...)
• Useful / fun for debugging
See Mike McCandless’ blog:
http://s.apache.org/eh
25
25. Other Codecs shipped with Lucene /
Lucene Contrib
• PerFieldCodecWrapper (core)
• Can be used to delegate specific fields to different
codecs (e.g., use pulsing for primary key)
• AppendingCodec (contrib/misc)
• Never rewinds a file pointer during write
• Useful for Hadoop
26
26. Heavy Development: Block Codecs
• Separate branch for developing block codecs:
• http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings
• All codecs use default terms index, but different
encodings of postings
• Support for recent compression developments
• FOR, PFORDelta, Simple64, BulkVInt
• Framework for blocks of encoded ints (using
SepCodec)
• Easy to plugin a new compression by providing a
encoder/decoder between a block int[] ↔
IndexOutput/IndexInput
27
27. Block Codecs: Use Case
• Smaller index & better compression for certain
index contents by replacing vInt
• Frequent terms (stop words) have low doc deltas
• Short fields (city names) have small position values
• TermQuery
• High doc frequency terms
• MultiTermQuery
• Lots of terms (possibly with low docFreq) in order
request doc postings
28
28. Block Codecs: Disadvantages
• Crazy API for consumers, e.g. queries
• Removes encapsulation of DocsEnum
• Consumer needs to do loading of blocks
• Consumer needs to do doc delta
decoding
• Lot„s of queries may need duplicate
code
• Impl for codecs that support „bulk
postings“
• Classical DocsEnum impl
29
30. Flexible Indexing - Current State
• All tests pass with a pool of random codecs
• If you find a bug, post the test output to mailing list!
• Test output contains random seed to reproduce
Talk yesterday evening:
“Using Lucene's Test Framework” by Robert Muir
• many more tests and documentation is needed
• community feedback is highly appreciated
31
31. Flexible Indexing - TODO
• Serialize custom attributes to the index
• Convert all remaining queries to use internally
BytesRef-terms
• Remove interning of field names
• Die, Term (& Token) class, die! ☺
• Make block-/bulk-postings API more user-
friendly and merge to SVN trunk
• Merge DocValues branch
(Talk yesterday: “DocValues & Column Stride Fields in Lucene” by Simon Willnauer)
• More heavy refactoring!
32
33. Summary
• New 4D postings enum APIs
• Pluggable codec lets you customize
index format
• Many codecs already available
• State-of-the-art postings formats are in
progress
• Innovation is easier!
• Exciting time for Lucene...
• Sizable performance gains, RAM/GC
34
34. We need YOU!
• Download nightly builds or check
out SVN trunk
• Run tests, run tests, run tests,…
• Use in your own developments
• Report back experience with
developing own codecs
35
35. Contact
Uwe Schindler
uschindler@sd-datasolutions.de
http://www.thetaphi.de
@thetaph1
Many thanks to Mike McCandless for slides & help on preparing this talk!
SD DataSolutions GmbH
Wätjenstr. 49
28213 Bremen, Germany
+49 421 40889785-0
http://www.sd-datasolutions.de
36