Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
David Kale and Ruben Fizsel from Skymind talk about deep learning for the JVM and enterprise using deeplearning4j (DL4J). Deep learning (nouveau neural nets) have sparked a renaissance in empirical machine learning with breakthroughs in computer vision, speech recognition, and natural language processing. However, many popular deep learning frameworks are targeted to researchers and poorly suited to enterprise settings that use Java-centric big data ecosystems. DL4J bridges the gap, bringing high performance numerical linear algebra libraries and state-of-the-art deep learning functionality to the JVM.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following:
- A powerful UI to analyze time series data stored in Lucene/Solr
- Creating and sharing visualizations, dashboards and reports
- Discovery and analysis of data coming from servers, applications, devices and more
- Exploration of click, geospatial and social data in ways previously unimaginable
Join Apache Solr committer and Lucidworks engineer Tim Potter for a webinar to learn how to unlock and understand your big data - and get the most out of your Hadoop investment.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
David Kale and Ruben Fizsel from Skymind talk about deep learning for the JVM and enterprise using deeplearning4j (DL4J). Deep learning (nouveau neural nets) have sparked a renaissance in empirical machine learning with breakthroughs in computer vision, speech recognition, and natural language processing. However, many popular deep learning frameworks are targeted to researchers and poorly suited to enterprise settings that use Java-centric big data ecosystems. DL4J bridges the gap, bringing high performance numerical linear algebra libraries and state-of-the-art deep learning functionality to the JVM.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following:
- A powerful UI to analyze time series data stored in Lucene/Solr
- Creating and sharing visualizations, dashboards and reports
- Discovery and analysis of data coming from servers, applications, devices and more
- Exploration of click, geospatial and social data in ways previously unimaginable
Join Apache Solr committer and Lucidworks engineer Tim Potter for a webinar to learn how to unlock and understand your big data - and get the most out of your Hadoop investment.
Slides for a talk.
Talk abstract:
In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases.
In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
Spark fait partie de la nouvelle génération de frameworks de manipulation de données basés sur Hadoop. L’outil utilise agressivement la mémoire pour offrir des temps de traitement jusqu’à 100 fois plus rapides qu'Hadoop. Dans cette session, nous découvrirons les principes de traitement de données (notamment MapReduce) et les options mises à disposition pour monter un cluster (Zookeper, Mesos…). Nous ferons un point sur les différents modules proposés par le framework, et notamment sur Spark Streaming pour le traitement de données en flux continu.
Présentation jouée chez Ippon le 11 décembre 2014.
Apache Storm vs. Spark Streaming - two stream processing platforms comparedGuido Schmutz
Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
The traditional and typical search use case is the one large search collection distributed among many nodes and shared by all users. However, there is a class of applications which need a large number of small or medium collections which can be used, managed and scaled separately. This talk will cover our effort in helping a client set up a large scale SolrCloud setup with thousands of collections running on hundreds of nodes. I will describe the bottlenecks that we found in SolrCloud when running a large number of collections. I will also take you through the multiple features and optimizations that we contributed to Apache Solr to reduce or remove the choke points in the system. Finally, I will talk about the benchmarking process and the lessons learned from supporting such an installation in production.
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
Slides from my talk during ApacheCon EU 2012 - "Battle of the giants: Apache Solr vs ElasticSearch". Video available at http://player.vimeo.com/video/55645629
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
Search faceting presentation and comparison of BigData textual content indexing/search analytics solutions. Presentation focuses on the comparison of the open source solutions provided by Apache Solr to those of ElasticSearch.
Got data? Let's make it searchable! This interactive presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
Solr search engine with multiple table relationJay Bharat
Here you can learn how to use solr search engine and implement in your application like in PHP/MYSQL.
I am introducing how to handle multiple table data handling in SOLR.
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
Enterprise Search is a challenging problem for most organizations. Public search technologies such as Google can index content and use link popularity to rank content in addition to the basic keyword matches. Enterprise Search is different. Sometimes it requires specially designed indexes as well as several processing steps.
At the U.S. Patent & Trademark Office, part of the Department of Commerce, a team of professionals is building the next generation of search tools using open source technologies. Like any large undertaking, it’s not a simple plug and play project.
Main topics to be covered in this talk:
+ Architectures for Large Scale Enterprise Search
+ Leveraging Apache Cassandra & Spark
+ Customizing / Configuring Apache SolR and Indexing
+ Writing a custom Parser for SolR in Scala
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
Enterprise Search is a challenging problem for most organizations. Public search technologies such as Google can index content and use link popularity to rank content in addition to the basic keyword matches. Enterprise Search is different. Sometimes it requires specially designed indexes as well as several processing steps.
At the U.S. Patent & Trademark Office, part of the Department of Commerce, a team of professionals is building the next generation of search tools using open source technologies. Like any large undertaking, it’s not a simple plug and play project.
Main topics to be covered in this talk:
+ Architectures for Large Scale Enterprise Search
+ Leveraging Apache Cassandra & Spark
+ Customizing / Configuring Apache SolR and Indexing
+ Writing a custom Parser for SolR in Scala
Attendees will learn how eBay Germany has implemented Solr, why Solr was selected, which Solr features are utilized. and how Solr is configured and used in production. Recommended best practices will be profiled alomng with eBay Kleinanzeigen plans for future deployment of Solr.
Anyone who has tried integrating search in their application knows how good and powerful Solr is but always wished it was simpler to get started and simpler to take it to production.
I will talk about the recent features added to Solr making it easier for users and some of the changes we plan on adding soon to make the experience even better.
Video that accompanies this presentation at: http://www.youtube.com/watch?v=1t3Z2pJyulA
Join us for a guided tour of the Alfresco SOLR integration and new search sub-systems. We’ll discuss how it works, the limitations of eventual consistency, guidance for configuration and set-up. We’ll also cover the steps required to migrate, improved PATH performance, in-query ACL evaluation, cross-language support and monitoring as well as performance.
Similar to ApacheCon Europe 2012 -Big Search 4 Big Data (20)
Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
Three aspects of search quality; focusing on relevance; why this is not just a technology problem; measuring search maturity & relevance; open source tools and techniques; Solr and Elasticsearch
Payloads have been a powerful aspect of Lucene for a long time, but have only had limited exposure in Solr. The Tika project has only recently finished integrating the powerful Tesseract OCR library, bringing the prospect of OCR to the masses.
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
For e-commerce applications, matching users with the items they want is the name of the game. If they can't find what they want then how can they buy anything?! Typically this functionality is provided through search and browse experience. Search allows users to type in text and match against the text of the items in the inventory. Browse allows users to select filters and slice-and-dice the inventory down to the subset they are interested in. But with the shift toward mobile devices, no one wants to type anymore - thus browse is becoming dominant in the e-commerce experience.
But there's a problem! What if your inventory is not categorized? Perhaps your inventory is user generated or generated by external providers who don't tag and categorize the inventory. No categories and no tags means no browse experience and missed sales. You could hire an army of taxonomists and curators to tag items - but training and curation will be expensive. You can demand that your providers tag their items and adhere to your taxonomy - but providers will buck this new requirement unless they see obvious and immediate benefit. Worse, providers might use tags to game the system - artificially placing themselves in the wrong category to drive more sales. Worst of all, creating the right taxonomy is hard. You have to structure a taxonomy to realistically represent how your customers think about the inventory.
Eventbrite is investigating a tantalizing alternative: using a combination of customer interactions and machine learning to automatically tag and categorize our inventory. As customers interact with our platform - as they search for events and click on and purchase events that interest them - we implicitly gather information about how our users think about our inventory. Search text effectively acts like a tag and a click on an event card is a vote for that clicked event is representative of that tag. We are able to use this stream of information as training data for a machine learning classification model; and as we receive new inventory, we can automatically tag it with the text that customers will likely use when searching for it. This makes it possible to better understand our inventory, our supply and demand, and most importantly this allows us to build the browse experience that customers demand.
In this talk I will explain in depth the problem space and Eventbrite's approach in solving the problem. I will describe how we gathered training data from our search and click logs, and how we built and refined the model. I will present the output of the model and discuss both the positive results of our work as well as the work left to be done. Those attending this talk will leave with some new ideas to take back to their own business.
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
With an increasing amount of relevancy factors, relevancy fine-tuning becomes more complex as changing the impact of factors produces increasingly more unintended side effects. In recent years, there has been a lot of discussion about how learning algorithms can replace manual relevancy fine-tuning in order to manage this complexity. However, discussions about the challenge of relevancy should additionally consider architectural aspects. Especially microservice-based architectures provide many ways to encapsulate and to separate complexities of search solutions, which facilitates optimizing the search as well as locating and fixing problems.
Generally, relevancy factors can be assigned to three different groups, each handled at a different stage of the search request processing. The first group contains contextual factors that depend on certain characteristics of a query, such as query-related boosts lifting up top-sellers for queries or category-related boosts to distinguish products from their accessories. Such contextual factors can be handled as a step of the preprocessing of queries. The respective boosting information can simply be appended to the query before it is actually sent to the search engine. Ideally, the normalization of the query is done beforehand.
The second group contains factors that are considered for all queries in more or less the same way, e. g. a ranking function basing on keyword occurrences, product topicality or sales in total. Factors related to this group can be handled directly by configuring the search engine.
The third group contains situational factors. For instance, a certain product might be a good match for a certain query in general, but for situational circumstances it should not appear among the top five products (e. g. because it is out of stock). Such situational factors can be handled by resorting result sets, after they were returned by the search engine.
The handling of the different factors within successive stages of search request processing will be discussed from an architectural perspective. Implications for applying learning algorithms and the implementation of a personalized search will be considered.
Does your search application include a custom query syntax with various search operators such as Booleans, proximity, term or phrase frequency, capitalization, quoted text or as-is operator, and other advanced operators? Although most search applications offer a natural language-oriented search box, some advanced applications may also offer a custom query syntax for advanced users or automated tasks. The Lucene "classic" query operators that are supported by the Solr edismax query parser (Boolean, phrase with slop, wildcard, etc.) cover a good amount of use cases, but they only get you so far. In this talk, we will explore various strategies to support a custom and advanced query syntax in Solr, covering a spectrum of options from leveraging the out-of-the-box Solr query DSL, to a custom Solr query parser, and hybrid solutions in between. We will identify the options' pros and cons, discuss relevancy considerations, and illustrate the options in Java.
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
For a relevance engineer one of the most difficult tasks in the tuning process is to convince others in the organization that this is a joint effort. Even the brightest search guru doesn't get very far when working in isolation, so establishing cross-collaboration through the organization is essential. But how to get there?
On top of that, in a large organization a relevance engineer often works on multiple seemingly unrelated search projects. The challenge is not to get drowned in building custom solutions for each project, but to design generic and re-usable strategies which solve many problems at once.
In this session we'll discuss how to build a widely supported basis for search quality improvements in an organization. It is full of practical tips and examples which could help you in establishing a cross-functional culture that is optimal for relevance tuning. It also zooms in on an holistic approach of solving multiple equivalent search issues at once.
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
Relevance metrics like NDGC or ERR require graded judgements to evaluate query relevance performance. But what happens when we don't know what 'good' looks like ahead of time? This talk will look at using click modeling techniques to infer relevance judgements from user interaction logs.
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face.
Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics.
We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
3. Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
5. war
Telling some stories ^
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
6. Not an intro to
SolrCloud/ElasticSearch!
• Great round table discussion yesterday led
by Mark Miller
• SolrCloud 4 Architecture talk in this room
NEXT!
• Solr4 vs Elastic Search at 4:45 PM TODAY!
7. Background for Client
X’s Project
• Big Data is any data set that is primarily at
rest due to the difficulty of working with it.
• 100’s of millions of documents to search
• Aggressive timeline.
• All the data must be searched per query.
• Limited selection of tools available.
• On Solr 3.x line
8. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
9. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
10. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
11. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
12. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
13. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
17. Make it easy to change
sharding
public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
"com.o19s.solr.ModShardIndexStrategy").newInstance();
indexStrategy.configure(options);
for (SolrInputDocument doc:docs){
indexStrategy.addDocument(doc);
}
}
18. Separate JVM from Solr
Cores
• Step 1: Fire up empty Solr’s on all the
servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
search1.o19s.com:8983/solr/admin?
action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
in urls of Cores. (&property.shards=)
23. Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File
System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
28. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
29. Using Solr as key/value store
Solr Key/
Value Cache
Metadata
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
30. Using Solr as key/value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html
• how fast with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html
31. Push schema definition
to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of fields like
id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
32. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
33. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
37. Beware JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Which SolrJ
Files
version do I
use?
38. No JavaBin
/u
G te
p
iv /
da
e av
m r
e o!
• Avoid Jarmaggeddon
• Reflection? Ugh.
39. Avro!
• Supports serialization of data readable from
multiple languages
• It’s smart XML, w/o the XML!
• Handles forward and reverse versions of an
object
• Compact and fast to read.
41. Tika as a pipeline?
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project.
42. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
43. Upgrade Lucene
Indexes Easily
• Don’t reindex!
• Try out new versions of
Lucene based search engines.
David Lyle
java -cp lucene-core.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-
commits] [-verbose] indexDir
55. Building a Patents Index
300
300
225
Machine Count
150
75
1 5
0
5 days 3 days 30 Minutes
What happens when we want to index 2 million patents in 30 minutes?
56. Amazon AWS is Good but...
• EC2 is costly
• Issues of access to internal data
• Firewall and security
57. Can we Cycle Scavenge?
• Data Center is heavily used 9 to 5 EST.
• Lesser, but significant load 8 to 10 PM
EST
• Minimal CPU load at night.
• Amazon Spot Pricing for EC2
• Seti @HOME
• JavaGenes - Genetics processing
• Condor Platform (http://
research.cs.wisc.edu/condor/)
49
58. Balancing Load
Production Load Batch Jobs
100
75
50
25
0
1 AM 3 AM 5 AM 9AM 3PM 9PM 11 PM
50
59. Do I need Failover?
• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....
60. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
64. Thank you!
Questions?
Nervous about • epugh@o19s.com
speaking up? Ask
me on later! • @dep4b
about ask • www.opensourceconnections.com
Editor's Notes
\n
Search was the original big data problem. Now search is back, but with a new cooler name &#x201C;Big Data&#x201D;, and search is the dominant metaphor for exposing big data sets to business users to make actual decisions. Big Data is rapidly changing fields such as HealthCare, and I maintain that the next revoultion in healtchare won't be via a doctor wielding a scalpel, but via a doctor wielding a mouse.\n
SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
\n
And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
And I don&#x2019;t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
\n
Notice our property style? Made it easy to read in properties in both Bash and Java!\n
Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
\n
\n
We had huge left over &#x201C;enterprise&#x201D; boxes with ginourmous amounts of ram and cpu. We were IO bound.\n\n
\n
\n
\n
\n
The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
\n
\n
i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don&#x2019;t forget to store everything!)\n
\n
\n
\n
You have many fewer Solrs then you do Indexer processors.\n
\n
\n
\n
\n
\n
\n
Jukka did a great presentation yesterday.\n
\n
\n
\n
dollar tree makes crap. Stores are always empty or missing items. You don&#x2019;t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don&#x2019;t want your indexing process to be like launching the space shuttle.\n
\n
\n
\n
runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n