At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.
At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
The ultimate guide for Elasticsearch pluginsItamar
Elasticsearch is a great product - for search, for scale, for analyzing data, and much more. But sometimes you need to do something that is not supported by Elasticsearch out of the box, and that's where plugins come into play.
Join me in this talk to explore the plugins land of Elasticsearch. We will discuss the various ways Elasticsearch can be extended, and the various types of plugins available to do that. By giving concrete examples and browsing the large selection of pre-made plugins, we will see how plugins can help us overcome various challenges. We will also discuss possible issues with plugins, and ways to work around them.
Finally, we will discuss scenarios in which custom plugin development is necessary and can really save the day. By showing a demo of one such scenario, and the way we built and debugged a plugin to solve it, we will complete the picture of the Elasticsearch plugin land, and hopefully inspire you to create your own!
In just a few short years, search has quickly evolved from being a small text box in the nether regions of a website to being front and center in our lives. Increasingly, however, search engine technology is also being used for practical, real time recommendations, events processing, complex spatial functionality and time series analysis capable of not only matching user's queries in text, but also driving real time decision making and analytics. In fact, open source Apache Lucene/Solr can do all of this and more by taking advantage of new data structures and algorithms that complement more traditional IR approaches. In this demo-driven talk, Lucene committer Grant Ingersoll will take a look at some of the new and exciting ways users are leveraging Lucene/Solr and related technology to drive deeper insight into information needs that go beyond keywords in a text box.
Practical Elasticsearch - real world use casesItamar
Elasticsearch - a search and real-time analytics server based on Apache Lucene - is gaining a lot of popularity lately, and is being used world-wide to power many sophisticated systems. While many use it for the "standard" stuff (that is, simple full-text search and real-time log analysis), there are some really interesting usage patterns that can prove useful in many real-world scenarios. In this talk we will briefly talk about Elasticsearch and its common use-cases, and then showcase some less common use-cases leveraging Elasticsearch in an interesting and often times innovating ways.
Elasticsearch Distributed search & analytics on BigData made easyItamar
Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
Getting a Neural Network Up and Running with OpenLabMelvin Hillsman
Access to hardware for AI/ML for the everyday developer wanting to explore this field can be challenging to obtain and maintain for even the most rudimentary applications and testing. Needing to go beyond a single development machine running locally only increases this. OpenLab is curated infrastructure accessible to open source projects and individuals working within and on open source projects designed to help address this use case. Access to GPU, FPGA, IoT, and more, allows HPC, AI/ML, Deep Learning, or other testing and applications. In this presentation, we will walk through getting an account with OpenLab, obtaining resources, and getting a neural network up and running with an app that will bring back great childhood memories.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
The ultimate guide for Elasticsearch pluginsItamar
Elasticsearch is a great product - for search, for scale, for analyzing data, and much more. But sometimes you need to do something that is not supported by Elasticsearch out of the box, and that's where plugins come into play.
Join me in this talk to explore the plugins land of Elasticsearch. We will discuss the various ways Elasticsearch can be extended, and the various types of plugins available to do that. By giving concrete examples and browsing the large selection of pre-made plugins, we will see how plugins can help us overcome various challenges. We will also discuss possible issues with plugins, and ways to work around them.
Finally, we will discuss scenarios in which custom plugin development is necessary and can really save the day. By showing a demo of one such scenario, and the way we built and debugged a plugin to solve it, we will complete the picture of the Elasticsearch plugin land, and hopefully inspire you to create your own!
In just a few short years, search has quickly evolved from being a small text box in the nether regions of a website to being front and center in our lives. Increasingly, however, search engine technology is also being used for practical, real time recommendations, events processing, complex spatial functionality and time series analysis capable of not only matching user's queries in text, but also driving real time decision making and analytics. In fact, open source Apache Lucene/Solr can do all of this and more by taking advantage of new data structures and algorithms that complement more traditional IR approaches. In this demo-driven talk, Lucene committer Grant Ingersoll will take a look at some of the new and exciting ways users are leveraging Lucene/Solr and related technology to drive deeper insight into information needs that go beyond keywords in a text box.
Practical Elasticsearch - real world use casesItamar
Elasticsearch - a search and real-time analytics server based on Apache Lucene - is gaining a lot of popularity lately, and is being used world-wide to power many sophisticated systems. While many use it for the "standard" stuff (that is, simple full-text search and real-time log analysis), there are some really interesting usage patterns that can prove useful in many real-world scenarios. In this talk we will briefly talk about Elasticsearch and its common use-cases, and then showcase some less common use-cases leveraging Elasticsearch in an interesting and often times innovating ways.
Elasticsearch Distributed search & analytics on BigData made easyItamar
Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
Getting a Neural Network Up and Running with OpenLabMelvin Hillsman
Access to hardware for AI/ML for the everyday developer wanting to explore this field can be challenging to obtain and maintain for even the most rudimentary applications and testing. Needing to go beyond a single development machine running locally only increases this. OpenLab is curated infrastructure accessible to open source projects and individuals working within and on open source projects designed to help address this use case. Access to GPU, FPGA, IoT, and more, allows HPC, AI/ML, Deep Learning, or other testing and applications. In this presentation, we will walk through getting an account with OpenLab, obtaining resources, and getting a neural network up and running with an app that will bring back great childhood memories.
General introduction to agile practices like Scrum and Kanban. Also covers what situations Agile is best at, what situations Agile doesn't help with, and what an Agile team should look like. This deck is a general intro to Agile for OpenSource Connections clients.
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize the Spark job. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
After an introduction to the basic tenets of Agile and some Agile practices, this presentation to Richmond SPIN (Software Process Improvement Network) talks about ways to convince your organization or clients to use Agile software development practices. Based on a presentation given at Agile 2009 by Arin Sime, Senior Consultant with OpenSource Connections.
Lucene - 10 ans d'usages plus ou moins classiquesSylvain Wallez
Retour d'expérience sur 4 projets utilisant Lucene dans des contextes très différents : recherche documentaire, ecommerce, moteur de pub et matching d'affinités musicales
Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there’s enough projects weird enough to require custom search scoring and ranking. In this talk, I’ll walk through how to use Lucene to implement your custom scoring and search ranking. We’ll see how you can achieve both amazing power (and responsibility) over your search results. We’ll see the flexibility of Lucene’s data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.
Solr + Hadoop - Fouillez facilement dans votre système Big Datafrancelabs
Un système Hadoop a pour but de facilement gérer le Big Data, que ce soit en termes de stockage comme en termes de calculs. Il ne se focalise pas sur l’exploration des données qu’il héberge. Le moteur de recherche Apache Solr devient l’outil de recherche de référence dans l’écosystème Hadoop, adopté par Cloudera et HortonWorks. Dans cette intervention, ils présentent d’abord un historique des 2 projets, pour bien comprendre leurs liens. Ils expliquent ensuite les différents niveaux d’intégrations possibles, et ils terminent par une démonstration d’intégration, afin de comprendre les avantages d’utiliser Solr pour explorer le big data d’un Hadoop.
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
Mining data requires a deep investment in people and time. How can you be sure you’re building the right models? What tools help you connect with the customer’s needs? With this hands-on presentation, you’ll learn a flexible toolset and methodology for building effective analytics applications. Agile Data (the book) shows you how to create an environment for exploring data, using lightweight tools such as Python, Apache Pig, and the D3.js (Data-Driven Documents) JavaScript library. You’ll learn an iterative approach that allows you to quickly change the kind of analysis you’re doing, as you discover what the data is telling you. All the example code in this book is available as working web applications. We will cover how to: * Build an application to mine your own email inbox * Use different data structures and algorithms to extract multiple features from a single dataset, and learn how different perspectives can yield insight * Rapidly boot your applications as simple front-ends to a document store * Add features driven by descriptive and inferential statistics, machine learning, and data visualization * Gather usage data and talk to real users to help guide your data-driven exploration
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
The economies of scaling software - Abdel Remanijaxconf
You spend your precious time building the perfect application. You do everything right. You carefully craft every piece of code and rigorously follow the best practices and design patterns, you apply the most successful methodologies software engineering has to offer with discipline, and you pay attention to the most minuscule of details to produce the best user experience possible. It all pays off eventually, and you end up with a beautiful code base that is not only reliable but also performs well. You proudly watch your baby grow, as new users come in bringing more traffic your way and craving new features. You keep them happy and they keep coming back. One morning, you wake up to servers crashing under load, and data stores failing to keep up with all the demand. You panic. You throw in more hardware and try optimize, but the hungry crowd that was once your happy user base catches up to you. Your success is slipping through your fingers. You find yourself stuck between having to rewrite the whole application and a hard place. It's frustrating, dreadful, and painful to say the least. Don't be that guy! Save your soul before it's too late, and come to learn how to build, deploy, and maintain enterprise-grade Java applications that scale from day one. Topics covered include: parallelism, load distribution, state management, caching, big data, asynchronous processing, and static content delivery. Leveraging cloud computing, scaling teams and DevOps will also be discuss. P.S. This session is more technical than you might think.
You spend your precious time building the perfect application. You do everything right. You carefully craft every piece of code and rigorously follow the best practices and design patterns, you apply the most successful methodologies software engineering has to offer with discipline, and you pay attention to the most minuscule of details to produce the best user experience possible. It all pays off eventually, and you end up with a beautiful code base that is not only reliable but also performs well. You proudly watch your baby grow, as new users come in bringing more traffic your way and craving new features. You keep them happy and they keep coming back. One morning, you wake up to servers crashing under load, and data stores failing to keep up with all the demand. You panic. You throw in more hardware and try optimize, but the hungry crowd that was once your happy user base catches up to you. Your success is slipping through your fingers. You find yourself stuck between having to rewrite the whole application and a hard place. It's frustrating, dreadful, and painful to say the least. Don't be that guy! Save your soul before it's too late, and come to learn how to build, deploy, and maintain enterprise-grade Java applications that scale from day one. Topics covered include: parallelism, load distribution, state management, caching, big data, asynchronous processing, and static content delivery. Leveraging cloud computing, scaling teams and DevOps will also be discuss. P.S. This session is more technical than you might think.
http://jaxconf.com/sessions/economies-scaling-software
Stackato presentation done at the Nordic Perl Workshop 2012 in Stockholm, Sweden
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Anyone who has tried integrating search in their application knows how good and powerful Solr is but always wished it was simpler to get started and simpler to take it to production.
I will talk about the recent features added to Solr making it easier for users and some of the changes we plan on adding soon to make the experience even better.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Time Series Anomaly Detection with Azure and .NETTMarco Parenzan
f you have any device or source that generates values over time (also a log from a service), you want to determine if in a time frame, the time serie is correct or you can detect some anomalies. What can you do as a developer (not a Data Scientist) with .NET o Azure? Let's see how in this session.
Moving Quickly with Data Services in the CloudMatthew Dimich
How is cloud changing data storage options for development teams at Thomson Reuters? Come hear how projects are changing the way they work with data in the cloud and what role a centralized cloud team can play in helping your business get products to market more quickly without worry about ending up on the front page of the news as the latest data breach. Any storage medium is up for discussion, but we’ll be primarily sticking to relational databases, elastic search, NoSQL and object storage. This will be useful to both teams that are looking to just get started in AWS to teams who already have production workloads in AWS. Although it assumes a basic knowledge of the relational database, elastic search, and NoSQL options in AWS, you will be able to get value if you haven’t used those technologies before.
Similar to Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup (20)
Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
Three aspects of search quality; focusing on relevance; why this is not just a technology problem; measuring search maturity & relevance; open source tools and techniques; Solr and Elasticsearch
Payloads have been a powerful aspect of Lucene for a long time, but have only had limited exposure in Solr. The Tika project has only recently finished integrating the powerful Tesseract OCR library, bringing the prospect of OCR to the masses.
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
For e-commerce applications, matching users with the items they want is the name of the game. If they can't find what they want then how can they buy anything?! Typically this functionality is provided through search and browse experience. Search allows users to type in text and match against the text of the items in the inventory. Browse allows users to select filters and slice-and-dice the inventory down to the subset they are interested in. But with the shift toward mobile devices, no one wants to type anymore - thus browse is becoming dominant in the e-commerce experience.
But there's a problem! What if your inventory is not categorized? Perhaps your inventory is user generated or generated by external providers who don't tag and categorize the inventory. No categories and no tags means no browse experience and missed sales. You could hire an army of taxonomists and curators to tag items - but training and curation will be expensive. You can demand that your providers tag their items and adhere to your taxonomy - but providers will buck this new requirement unless they see obvious and immediate benefit. Worse, providers might use tags to game the system - artificially placing themselves in the wrong category to drive more sales. Worst of all, creating the right taxonomy is hard. You have to structure a taxonomy to realistically represent how your customers think about the inventory.
Eventbrite is investigating a tantalizing alternative: using a combination of customer interactions and machine learning to automatically tag and categorize our inventory. As customers interact with our platform - as they search for events and click on and purchase events that interest them - we implicitly gather information about how our users think about our inventory. Search text effectively acts like a tag and a click on an event card is a vote for that clicked event is representative of that tag. We are able to use this stream of information as training data for a machine learning classification model; and as we receive new inventory, we can automatically tag it with the text that customers will likely use when searching for it. This makes it possible to better understand our inventory, our supply and demand, and most importantly this allows us to build the browse experience that customers demand.
In this talk I will explain in depth the problem space and Eventbrite's approach in solving the problem. I will describe how we gathered training data from our search and click logs, and how we built and refined the model. I will present the output of the model and discuss both the positive results of our work as well as the work left to be done. Those attending this talk will leave with some new ideas to take back to their own business.
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
With an increasing amount of relevancy factors, relevancy fine-tuning becomes more complex as changing the impact of factors produces increasingly more unintended side effects. In recent years, there has been a lot of discussion about how learning algorithms can replace manual relevancy fine-tuning in order to manage this complexity. However, discussions about the challenge of relevancy should additionally consider architectural aspects. Especially microservice-based architectures provide many ways to encapsulate and to separate complexities of search solutions, which facilitates optimizing the search as well as locating and fixing problems.
Generally, relevancy factors can be assigned to three different groups, each handled at a different stage of the search request processing. The first group contains contextual factors that depend on certain characteristics of a query, such as query-related boosts lifting up top-sellers for queries or category-related boosts to distinguish products from their accessories. Such contextual factors can be handled as a step of the preprocessing of queries. The respective boosting information can simply be appended to the query before it is actually sent to the search engine. Ideally, the normalization of the query is done beforehand.
The second group contains factors that are considered for all queries in more or less the same way, e. g. a ranking function basing on keyword occurrences, product topicality or sales in total. Factors related to this group can be handled directly by configuring the search engine.
The third group contains situational factors. For instance, a certain product might be a good match for a certain query in general, but for situational circumstances it should not appear among the top five products (e. g. because it is out of stock). Such situational factors can be handled by resorting result sets, after they were returned by the search engine.
The handling of the different factors within successive stages of search request processing will be discussed from an architectural perspective. Implications for applying learning algorithms and the implementation of a personalized search will be considered.
Does your search application include a custom query syntax with various search operators such as Booleans, proximity, term or phrase frequency, capitalization, quoted text or as-is operator, and other advanced operators? Although most search applications offer a natural language-oriented search box, some advanced applications may also offer a custom query syntax for advanced users or automated tasks. The Lucene "classic" query operators that are supported by the Solr edismax query parser (Boolean, phrase with slop, wildcard, etc.) cover a good amount of use cases, but they only get you so far. In this talk, we will explore various strategies to support a custom and advanced query syntax in Solr, covering a spectrum of options from leveraging the out-of-the-box Solr query DSL, to a custom Solr query parser, and hybrid solutions in between. We will identify the options' pros and cons, discuss relevancy considerations, and illustrate the options in Java.
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
For a relevance engineer one of the most difficult tasks in the tuning process is to convince others in the organization that this is a joint effort. Even the brightest search guru doesn't get very far when working in isolation, so establishing cross-collaboration through the organization is essential. But how to get there?
On top of that, in a large organization a relevance engineer often works on multiple seemingly unrelated search projects. The challenge is not to get drowned in building custom solutions for each project, but to design generic and re-usable strategies which solve many problems at once.
In this session we'll discuss how to build a widely supported basis for search quality improvements in an organization. It is full of practical tips and examples which could help you in establishing a cross-functional culture that is optimal for relevance tuning. It also zooms in on an holistic approach of solving multiple equivalent search issues at once.
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
Relevance metrics like NDGC or ERR require graded judgements to evaluate query relevance performance. But what happens when we don't know what 'good' looks like ahead of time? This talk will look at using click modeling techniques to infer relevance judgements from user interaction logs.
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face.
Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics.
We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
2. Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
9. • First USPTO application in
“the cloud”
• Simple, and discoverable
• Expresses our philosophy of
“Cloud meets Ocean”
!
• Check it out at http://
gpsn.uspto.gov
10. Telling some stories
➡How to inject “Discovery” into your
app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
13. Grok data at gut level
Look for outliers
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstorm
Mockups
Proof of concept
!
!
14. Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
15. Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
17. Boy meets Girl Story
Metadata
Ingest
Pipeline
Discovery
UX
Content
Files
18. Nothing but JS and
Solr!
• Updates are quarterly
• User state in browser
• Solr is the “RESTful” API ;-)
• KISS: EmberJS + Solr
19. How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
20. Yes, Solr is hangout out
there on the Net…
• Using Jetty container security to lock down
everything but the /select handler.
• Yes, the /admin interface appears to load,
but no panels load.
• Go ahead, do a delete query! I dare you.
Actually, please don’t. ;-)
21. Single 550 GB index
• Solr + Index are in a Amazon AMI image.
• Currently running two independent Solrs.
• Optimize works! Still.
• Elastic Load Balancer + AutoScale spins up
more Solr’s if needed.
• Threw lots of “provisioned IOPS” atVM
23. Spyglass
• EmberJS based Widget framework
• List of Results
• Facets
• Autocomplete
• “Deploy” is just .html + .js. S3 bucket!
• Tooling is a pain. EmberJS is complex!
BetterthenAjaxSolr!
29. Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office
32. Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)
• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
• We had too many steps in our pipeline
33. Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
34. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
37. Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI
• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
38. Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project (and
others!).
40. Detector to pick File
public
class
GreenbookDetector
implements
Detector
{
!
private
static
Pattern
pattern
=
Pattern.compile("PATN");
@Override
public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{
!
MediaType
type
=
MediaType.OCTET_STREAM;
InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);
String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");
!
Matcher
matcher
=
pattern.matcher(extract);
!
if
(matcher.find())
{
type
=
GreenbookParser.MEDIA_TYPE;
}
!
lookahead.close();
return
type;
}
}
41. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
42. Your Search solution
isn’t perfect
• Allow users to export data
• Most business users want to work in Excel!
Accept it!
• Allow other applications to build on top of
it.
43. GPSN has
• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or
select specific data sets.
• Permalinks: simple, very
sharable URLs.
• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.
• Need advance querying?
Use Lucene syntax in
search bar.