This document summarizes a presentation about MontySolr, an extension that allows embedding CPython in Solr. It was created by Roman Chyla of CERN to connect Python and Java applications without compromises. MontySolr uses JCC to embed a Python interpreter in Java, allowing Python code to interface with Solr. This provides a robust, tested integration that works for any Python or C/C++ application and leverages the strengths of both Solr and Invenio.
How to Gain Greater Business Intelligence from Lucene/Solrlucenerevolution
Presented by Patrick Beaucamp | Bpm-Conseil - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Vanilla, an Open Source business intelligence application by bpm-conseil.com, offers unique features such as report indexing through an embedded Lucene integration. Using Vanilla and Lucene, developers can manage both report indexing and external document indexing, which ultimately saves end users time when they search for specific keywords such as "product code," or "customer code." Vanilla can build upon an existing Solr/Lucene installation that takes care of all the indexing processes while Vanilla takes care of the Reporting/Dashboard creation. During this presentation, attendees will learn how we moved from embed Lucene Api to a Solr/Lucene platform and all the technical and business benefits from this architecture in terms of clustering, caching and access mode.
How to Gain Greater Business Intelligence from Lucene/Solrlucenerevolution
Presented by Patrick Beaucamp | Bpm-Conseil - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Vanilla, an Open Source business intelligence application by bpm-conseil.com, offers unique features such as report indexing through an embedded Lucene integration. Using Vanilla and Lucene, developers can manage both report indexing and external document indexing, which ultimately saves end users time when they search for specific keywords such as "product code," or "customer code." Vanilla can build upon an existing Solr/Lucene installation that takes care of all the indexing processes while Vanilla takes care of the Reporting/Dashboard creation. During this presentation, attendees will learn how we moved from embed Lucene Api to a Solr/Lucene platform and all the technical and business benefits from this architecture in terms of clustering, caching and access mode.
How The Guardian Embraced the Internet using Content, Search, and Open SourceLucidworks (Archived)
This talk will cover how The Guardian opened up their business, enriched it, and reached new markets with its Open Platform strategy. Stephen will cover the technical architecture, implementation of Solr (the key technology powering the platform), and how The Guardian has used it to embrace disruption in the media space, while finding new sources of revenue and innovation
In the mid-1990s, the high-energy physics community (think FermiLab and CERN) started planning for the Large Hadron Collider. Managing the petabytes of data that would be generated by the facility and sharing it with the globally distributed community of over 10,000 researchers would be a major infrastructure and technology problem. This same community that brought us the web has now developed standards, software, and infrastructure for grid computing. In this seminar I'll present some of the exciting science that is being done on the Open Science Grid, the US national cyberinfrastructure linking 60 institutions (Harvard included) into a massive distributed computing and data processing system.
Introduction, overview, and caveats for Part 3 of our New Energy curriculum. In Part 3, we look at the key sciences which make energy from the quantum vacuum possible and how we can apply these principles in a myriad of specific technologies that will revolutionize life in the twenty-first century.
The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data with chemical structures linked to our free online chemical compound database, ChemSpider. In this talk we will survey our recent efforts to extract all kinds of data – chemical structures, experimental and bibliographic data – from both our backfile and frontfile. We will also discuss our future work to extract chemical reactions to host in our ChemSpider Reactions database and will discuss the potential applications of optical structure recognition technologies for converting structure images to structures as well as using similar techniques to convert experimental spectral data into interactive data formats. A key aspect of this project is the delivery of a crowdsourcing platform for the interactive annotation and validation of the extracted data.
Jean-Claude Bradley presents at the Science Commons Symposium on Feb 20, 2010 at the Microsoft Campus in Redmond. The talk covers doing Open Notebook Science using free and hosted tools, including new archiving protocols developed with Andrew Lang.
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH at Lucene/Solr Revolution 2013 Dublin
Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown.
Configure
Building Client-side Search Applications with Solrlucenerevolution
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.
Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc.
Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.
More Related Content
Similar to Cpython embedded in solr - By Roman Chyla
How The Guardian Embraced the Internet using Content, Search, and Open SourceLucidworks (Archived)
This talk will cover how The Guardian opened up their business, enriched it, and reached new markets with its Open Platform strategy. Stephen will cover the technical architecture, implementation of Solr (the key technology powering the platform), and how The Guardian has used it to embrace disruption in the media space, while finding new sources of revenue and innovation
In the mid-1990s, the high-energy physics community (think FermiLab and CERN) started planning for the Large Hadron Collider. Managing the petabytes of data that would be generated by the facility and sharing it with the globally distributed community of over 10,000 researchers would be a major infrastructure and technology problem. This same community that brought us the web has now developed standards, software, and infrastructure for grid computing. In this seminar I'll present some of the exciting science that is being done on the Open Science Grid, the US national cyberinfrastructure linking 60 institutions (Harvard included) into a massive distributed computing and data processing system.
Introduction, overview, and caveats for Part 3 of our New Energy curriculum. In Part 3, we look at the key sciences which make energy from the quantum vacuum possible and how we can apply these principles in a myriad of specific technologies that will revolutionize life in the twenty-first century.
The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data with chemical structures linked to our free online chemical compound database, ChemSpider. In this talk we will survey our recent efforts to extract all kinds of data – chemical structures, experimental and bibliographic data – from both our backfile and frontfile. We will also discuss our future work to extract chemical reactions to host in our ChemSpider Reactions database and will discuss the potential applications of optical structure recognition technologies for converting structure images to structures as well as using similar techniques to convert experimental spectral data into interactive data formats. A key aspect of this project is the delivery of a crowdsourcing platform for the interactive annotation and validation of the extracted data.
Jean-Claude Bradley presents at the Science Commons Symposium on Feb 20, 2010 at the Microsoft Campus in Redmond. The talk covers doing Open Notebook Science using free and hosted tools, including new archiving protocols developed with Andrew Lang.
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin
Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure
Presented by Markus Klose, Search + Big Data Consultant SHI Elektronische Medien GmbH at Lucene/Solr Revolution 2013 Dublin
Kibana4Solr is search-driven, scalable, browser based and extremely user friendly (also for non-technical users). Logs are everywhere. Any device, system or human can potentially produce a huge amount of information saved in logs. The amount of available logs and their semi-structured nature make a meaningful processing in real-time quite a difficult task. Thus, valuable business insights stored in logs might be not found. Kibana4Solr is a search-driven approach to handle that challenge. It offers user-friendly and browser-based dashboard which can be easily customized to particular needs. In the session the Kibana4Solr will be introduced. Some light will be shed on the architectural features of Kibana4Solr. Some ideas will be given in terms of possible business uses cases. And finally a live demo of Kibana4Solr will be shown.
Configure
Building Client-side Search Applications with Solrlucenerevolution
Presented by Daniel Beach, Search Application Developer, OpenSource Connections
Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you'll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You'll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.
Presented by Rafal Kuć, Consultant and Software engineer, , Sematext Group, Inc.
Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You'll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you'll learn what to do when things go awry - we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.
Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Solr's Admin UI - Where does the data come from?lucenerevolution
Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.
Steve will show how and why to use Solr’s new Schemaless Mode, under which document indexing can be performed with no up-front schema configuration. Solr uses content clues to choose among a predefined set of field types and then automatically add previously unseen fields to the schema.
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
Presented by Renaud Delbru, Co-Founder, SindiceTech
In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Presented by Shai Erera, Researcher, IBM
Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.
As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL
This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Communications Mining Series - Zero to Hero - Session 1
Cpython embedded in solr - By Roman Chyla
1. MontySolr:
Embedding CPython in Solr
Roman Chyla, CERN
roman.chyla@cern.ch, May 26, 2011
Thursday, May 26, 2011
2. Why should I care?
- Our challenge is to connect Python and Java
- Without compromises
- We created MontySolr extension
- Robust, tested (will be used by our system)
- But works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Open source (GPL v2)
- Try it out!
- https://github.com/romanchyla/montysolr
2
Thursday, May 26, 2011
3. Outline
‣ Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up
3
Thursday, May 26, 2011
4. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
5. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
6. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
7. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
8. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
9. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
10. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
11. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
12. SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The first web outside Europe/CERN
- The first database on web
5
Thursday, May 26, 2011
13. SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The first web outside Europe/CERN
- The first database on web
5
Thursday, May 26, 2011
16. Invenio
- Integrated digital library software behind INSPIRE
- Used by very large institutional repositories
- http://repositories.webometrics.info/toprep_inst.asp
- Customizable virtual collections
- Flexible management of metadata
- 3 000 authors per article
- Powerful search engine
- Incl. citation map analysis
- Written in Python (since 2001)
- 290 000 lines of code
8
Thursday, May 26, 2011
17. Outline
- Context
‣ The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up
9
Thursday, May 26, 2011
18. The Challenge
- HEP scientific community
- Searches metadata oriented
- However fulltexts are changing the situation
- And we want to provide even better service
- Bigger volumes of data
- NLP processing
- Semantic search
10
Thursday, May 26, 2011
20. The Challenge
Query: supersymmetry AND author:ellis
Invenio
11
Thursday, May 26, 2011
21. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
11
Thursday, May 26, 2011
22. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
23. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
24. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
25. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
26. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
27. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
1. only IDs,
no score
= no ranking
11
Thursday, May 26, 2011
28. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
2. score merging 1. only IDs,
difficult (if no score
available) = no ranking
11
Thursday, May 26, 2011
29. The Challenge
3. push IDs ?
Query: supersymmetry AND author:ellis
(eg._faceting)
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
2. score merging 1. only IDs,
difficult (if no score
available) = no ranking
11
Thursday, May 26, 2011
30. What is the “best” solution?
- We love Python...
- ...and our applications are written in Python...
- But what if Solr is the master search engine?
- Merge results inside Solr?
- Typical size: 1-10 mil. IDs
- Expected latency: 1-2 s.
- What we want to achieve:
- Fast transfer of hits from Invenio to Solr
- Leverage the power of both (no compromises)
- Developer-friendly integration, simplicity
12
Thursday, May 26, 2011
31. Outline
- Context
- The Challenge
‣ Key components
- Available technologies
- Our approach
- Evaluation
- Demonstration
- Wrap-up
13
Thursday, May 26, 2011
32. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
33. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
34. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
35. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
36. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
37. To use Solr in non-Java app
- Solr is already usable via HTTP requests, but we
need something else here...
- Remote objects/calls?
- Pyro, execnet, CORBA, SOAP...
- or simply pipes?
- Access Python from Java?
- Jython
- JEPP
- Access Java from Python?
- JPype
- JCC
15
Thursday, May 26, 2011
38. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
16
Thursday, May 26, 2011
39. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
17
Thursday, May 26, 2011
40. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
17
Thursday, May 26, 2011
41. JEPP - Java Embedded Python
- Python code runs inside
Python interpreter
- Embeds CPython interpreter
via Java Native Interface
(JNI) in Java
- http://jepp.sourceforge.net/
- recently updated (27-Jan)
- but JCC is more active
18
Thursday, May 26, 2011
43. JCC
- Embeds JVM in Python
- C++ code generator
- C++ object interface
wraps a Java library
- C++ wrappers conform
to Python's C type
system
- result: complete Python
extension module
20
Thursday, May 26, 2011
47. To use Solr in non-Java app
Jython JCC JEPP
Python ✓ ✓
CModules
Speed ✓ ?
No code ✓ ✓
changes
Access from ✓ ✓
Python
Access from ✓ ... ✓
Java
22
Thursday, May 26, 2011
48. The first try
Invenio
Solr
JCC
23
Thursday, May 26, 2011
49. Devil is in details...
24
Thursday, May 26, 2011
50. GIL - Global Interpreter Lock
Unfortunately Python webapp is not like Java...
25
Thursday, May 26, 2011
51. GIL - Global Interpreter Lock
We can have 200 threads, but only 4 will run at time...
26
Thursday, May 26, 2011
52. GIL - Global Interpreter Lock
27
Thursday, May 26, 2011
53. Fortunately solution exists
- JCC can embed Python inside Java
- Special thanks to Andi Vajda! (JCC creator)
- We write ‘empty’ classes in Java ...
- ... and implement them in Python
Python /w Java inside Java /w Python inside 28
Thursday, May 26, 2011
54. The second try
Solr /w Invenio
Invenio (backend)
frontend
XML
JCC
29
Thursday, May 26, 2011
55. Implementing the bridge
- Special Java class
- With method pythonExtension()
- Native method pythonDecRef()
- JCC provides its implementation
- And number of other native methods
- These will be implemented using Python
- Like writing JNI Java/C code but without
compilation...
30
Thursday, May 26, 2011
56. MontySolr extension
- JCC has great potential, but also added
complexity...
- So the MontySolr project was born
- Modules must be built in shared mode
- JCC dynamic library loaded and started from the main
thread
- Simple mechanism of the Python bridge and message
- Configurable handlers on the Python side
- Secured dereferencing of the native objects
- Threading on the Java side
- Multiprocessing on the Python side
- Easy ant targets (compilation) ...
31
Thursday, May 26, 2011
57. Hello World - Java part
public class MontySolrBridge extends BasicBridge implements
PythonBridge {
private long pythonObject;
public void pythonExtension(long pythonObject) {
this.pythonObject = pythonObject;
}
public long pythonExtension() {
return this.pythonObject;
}
public void finalize() throws Throwable {
pythonDecRef();
}
public native void pythonDecRef();
public void sendMessage(PythonMessage message) {
PythonVM vm = PythonVM.get();
vm.acquireThreadState();
receive_message(message);
vm.releaseThreadState();
}
public native void receive_message(PythonMessage message);
} 32
Thursday, May 26, 2011
58. Hello World - Python part
from montysolr import MontySolrBridge
class SimpleBridge(MontySolrBridge):
def __init__(self):
super(SimpleBridge, self).__init__()
def receive_message(self, message):
query = message.getParam(‘query’)
message.setResults(‘Hello world!’)
print ‘Python received from Java:’, query
33
Thursday, May 26, 2011
59. Example - running MontySolr
- Java side
- JRE (32/64 bit)
- Standard Solr/Lucene jars
- JCC dynamic library
- Python side
- Python interpreter (32/64 bit)
- 4 Python modules (jcc, solr, lucene, montysolr)
- In the main thread
- First we load JCC
- Then start Python interpreter ...
- ... load Python handlers
34
Thursday, May 26, 2011
60. Solr as search service
Solr /w Invenio
Invenio (backend)
frontend
XML
JCC
35
Thursday, May 26, 2011
61. Example
Solr
MyCustom
Handler
36
Thursday, May 26, 2011
62. Example
refersto:author:ellis
Solr
MyCustom
Handler
37
Thursday, May 26, 2011
63. Example - Solr custom handler
MontySolrVM.INSTANCE.sendMessage(message);
PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");
MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();
}
38
Thursday, May 26, 2011
64. Example - JNI connection
refersto:author:ellis
Solr
MyCustom Python
Handler Bridge
39
Thursday, May 26, 2011
65. Example - JNI connection
refersto:author:ellis
Solr
MyCustom Python Invenio
Handler Bridge wrappers
40
Thursday, May 26, 2011
66. Example - Python side
# handler is made ‘visible’ at startup
SolrpieTarget('Invenio:perform_search',
perform_search)
# search time - called from Java
def perform_search(message):
query = message.getParam(“query”)
hits = call_real_search(query)
# cast Python list into Java array
message.setResults(JArray_ints(hits))
41
Thursday, May 26, 2011
67. Example
refersto:author:ellis
Solr
Invenio
Invenio
MyCustom Python Invenio
Handler Bridge wrappers
Invenio
Invenio
42
Thursday, May 26, 2011
68. Example - Java side again
MontySolrVM.INSTANCE.sendMessage(message);
PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");
MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();
}
43
Thursday, May 26, 2011
69. Solr as search service
Solr /w Invenio
Apache (backend)
webserver
XML
Invenio
Invenio
JCC
44
Thursday, May 26, 2011
70. Outline
- Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
‣ Evaluation
- Wrap-up
45
Thursday, May 26, 2011
74. Robust?
- Extensive siege tests show very good
performance and stability under high load
- 100-200 users, complex searches
- 50 concurrent users, citation analysis
- JCC incurs small overhead
- We detected no memory leaks
- The same as dbpedia.org
- But watch out for errors in C
- An error in C module brings down the whole JVM
- (errors in pure Python module can be handled)
49
Thursday, May 26, 2011
75. Easy to develop/maintain?
- Added complexity
- Java in the toolbox
- Need to compile C++ extensions
- Python/OS version dependencies
- For this we get
- Easy integration with Invenio
- The best of two applications
- A lot of features for free
- And we can control Solr from Python!
50
Thursday, May 26, 2011
76. Outline
- Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
‣ Wrap-up
51
Thursday, May 26, 2011
77. Wrap-up
- Our challenge was to connect two different
languages/systems
- And we wanted to get the best of the two...
- So we had to plug Python into Solr
- And now our Solr knows citation analysis!
- We created MontySolr extension
- Robust, tested (will be used by INSPIRE)
- Works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Free software license
- Try it out! Help us make it better!
- https://github.com/romanchyla/montysolr
52
Thursday, May 26, 2011
78. Questions?
- MontySolr
- https://github.com/romanchyla/montysolr
- Roman Chyla
- Fellow, CERN Scientific Information Service
- roman.chyla@cern.ch
- @rchyla
- https://svnweb.cern.ch/trac/rcarepo
Thursday, May 26, 2011
80. Links
- Invenio platform
- http://invenio-software.org/
- INSPIRE Digital library
- http://inspirebeta.net/
- Diagrams of JCC and JEPP
- Andreas Schreiber : Mixing Java and Python
- http://www.slideshare.net/onyame/mixing-python-and-
java
- On Jython C Extension API
- http://stackoverflow.com/questions/3097466/using-
numpy-and-cpython-with-jython
- Demo of a running service:
- http://insdev01.cern.ch 55
Thursday, May 26, 2011
81. #1 - How to embed Solr (standard)
- solr.client.solrj.embedded.EmbeddedSolrServer
56
Thursday, May 26, 2011
82. #2 - How to embed Solr (simplified)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very flexible and probably suitable for quick
integration
57
Thursday, May 26, 2011
83. #2 - How to embed Solr (simplified)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very flexible and probably suitable for quick
integration
57
Thursday, May 26, 2011
84. #3 - Example of a Solr custom handler
58
Thursday, May 26, 2011
85. #4 - Example Python handler
59
Thursday, May 26, 2011