Presentation on the architecture, scalability concerns, performance bottlenecks, operational characteristics and lessons learned while designing and implementing Yammer distributed real-time search system.
Building a global listening platform with Solr presents technical and global challenges. The speaker will demonstrate a platform they built in 3 months using Solr and Basis Technology products for content acquisition, analysis including language identification and entity extraction, and search visualization. Key aspects include distributed processing pipelines for analysis, language-specific indexing, and dashboard interfaces beyond basic search results.
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
For CareerBuilder, a 1% deviance in search relevancy can mean millions of missed job opportunities for our users. When CareerBuilder moved to Solr from an expensive, proprietary search vendor, our top priorities were maintaining the quality of our search results and drastically improving our agility.
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
Documentum xPlore provides an integrated Search facility for the Documentum Content Server. The standalone search engine is based on EMC's xDB (Native XML database) and Lucene. In this talk we will introduce xPlore and some of its key components and capabilities. These include aspects of a tight integration of Lucene with the XML database: xQuery translation and optimization into Lucene query/API's as well as transactional update Lucene). In addition, xPlore is being deployed aggressively into virtualized environments (both disk I/O and VM). We cover some performance results and tuning tips in these areas.
“Metadata is king!” Thus proclaimed Steve Kearns of Basis Technology, Platinum Sponsor of Lucene Revolution, at the start of this standing-room-only session on Day 1 of the conference. Why? Because it provides a way to enhance otherwise unstructured data with a considerable amount of structure.
If you're user can't find it, they can't buy it right? In this talk, Apache Lucene and Solr committer Grant Ingersoll will discuss architecture, techniques and tips for successfully deploying search tools like Lucene, Solr and LucidWorks Enterprise in eCommerce environments.
Earlier this year, Sensis launched its Business Search API, which allows publishers to develop local search propositions powered by the two million business listings contained in the Australian Yellow Pages® and White Pages® directories.
This document advertises a FETAC Level 5 course with limited places available. It notes the training is taking place in a new state of the art premises and will be delivered by an IT Training Specialist and Experienced Web Designer. The course qualifies for a FETAC Level 5 award.
Building a global listening platform with Solr presents technical and global challenges. The speaker will demonstrate a platform they built in 3 months using Solr and Basis Technology products for content acquisition, analysis including language identification and entity extraction, and search visualization. Key aspects include distributed processing pipelines for analysis, language-specific indexing, and dashboard interfaces beyond basic search results.
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
For CareerBuilder, a 1% deviance in search relevancy can mean millions of missed job opportunities for our users. When CareerBuilder moved to Solr from an expensive, proprietary search vendor, our top priorities were maintaining the quality of our search results and drastically improving our agility.
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
Documentum xPlore provides an integrated Search facility for the Documentum Content Server. The standalone search engine is based on EMC's xDB (Native XML database) and Lucene. In this talk we will introduce xPlore and some of its key components and capabilities. These include aspects of a tight integration of Lucene with the XML database: xQuery translation and optimization into Lucene query/API's as well as transactional update Lucene). In addition, xPlore is being deployed aggressively into virtualized environments (both disk I/O and VM). We cover some performance results and tuning tips in these areas.
“Metadata is king!” Thus proclaimed Steve Kearns of Basis Technology, Platinum Sponsor of Lucene Revolution, at the start of this standing-room-only session on Day 1 of the conference. Why? Because it provides a way to enhance otherwise unstructured data with a considerable amount of structure.
If you're user can't find it, they can't buy it right? In this talk, Apache Lucene and Solr committer Grant Ingersoll will discuss architecture, techniques and tips for successfully deploying search tools like Lucene, Solr and LucidWorks Enterprise in eCommerce environments.
Earlier this year, Sensis launched its Business Search API, which allows publishers to develop local search propositions powered by the two million business listings contained in the Australian Yellow Pages® and White Pages® directories.
This document advertises a FETAC Level 5 course with limited places available. It notes the training is taking place in a new state of the art premises and will be delivered by an IT Training Specialist and Experienced Web Designer. The course qualifies for a FETAC Level 5 award.
This document discusses strategies for marketing technology at different stages of maturity. It begins by outlining the types of companies that can be created from a technology and how that impacts the marketing approach. It then discusses how to assess the maturity of a technology using Technology Readiness Levels (TRLs) and how that maps to potential customers and good outcomes. The document provides frameworks to think about targeting current vs new customers and the 7 Ps of technology marketing as they relate to the stage of maturity. Overall, it aims to help thinkers understand how to effectively market and position their technology based on its current state.
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Marty Kaszubowski
The document discusses several key points about entrepreneurship and new venture formation:
1) Entrepreneurship and new ventures are the primary drivers of economic growth, not small or large existing companies. High-growth startups create the most jobs.
2) Starting a new venture can be a rewarding career path that allows one to create something from nothing, take control of their career, and potentially make a big impact.
3) Successful new ventures focus on proving their business model and solutions before attempting large-scale growth ("Nail it, then scale it"). Having the right founding team and understanding customers are also important success factors.
4) While starting a venture involves significant risks and effort
This document summarizes new features and performance improvements in Solr 3.1. Key highlights include: improved faceting capabilities like numeric range facets; new features like spatial search, a faster highlighter, and extended dismax query parser; distributed search support; and under-the-hood performance optimizations in Lucene like DocumentWriterPerThread. It also previews upcoming Solr Cloud functionality and discusses features not yet included in the 3.1 release.
This document provides guidance on preparing an effective investor presentation. It explains that investors will use the initial presentation meeting to evaluate entrepreneurs and eliminate those they don't believe in or trust. The key elements of a presentation are outlined as the problem being solved, the solution, the market size and growth, revenue model, current and targeted customers, distribution channels, competition, strategic partners, management team, financing needed, financial projections, and exit strategy. Presenters are advised to keep the presentation concise at around 12 slides, use visuals over text, and practice extensively to feel comfortable during the question and answer period.
Solr & Lucene at Etsy provides concise summaries of Gregg Donovan's experience using Solr and Lucene at Etsy and TheLadders, including optimizing Solr out-of-the-box, customizing at a low level, and knowing when each approach is best. The document also shares various techniques for improving relevance, performance, and customization including external file fields, boosting queries, impression tracking, and more.
The document summarizes the Lucene/Solr Revolution conference in 2013. It provides statistics on growth in attendee numbers from 2010-2013 in the US and Europe. It also gives an overview of the attendee profile, countries represented, sponsoring companies, and Lucene/Solr committers attending. The agenda for today and tomorrow is outlined along with opportunities to win prizes through social media engagement and a conference party details.
The document provides information about the Canadian pop punk band Simple Plan, including its members, albums, and the song "Crazy". The song talks about things people try to hide and discusses issues in society. It uses imagery of black and white changing to color to represent how life gets better no matter how grim it seems. The band also started a foundation to help children and teens facing issues like poverty and addiction.
- The document lists 5 students, a teacher, and the subject of English for a class.
- It provides lyrics to the song "Smoke on the Water" by Deep Purple, which describes a fire that burned down a building where the band was recording, forcing them to relocate to finish recording.
- Instructions are given to watch a YouTube video of the song and complete a task after clicking a link. References for additional information about Deep Purple and the song are also listed.
Pangaea providing access to geoscientific data using apache lucene javaLucidworks (Archived)
PANGAEA is a data system that provides access to geoscientific data using Apache Lucene Java. It hosts over 1 million data sets containing over 8 billion data items related to fields like sediments, water, corals, atmosphere, and ice. Lucene is used to index metadata and full text for fast search and retrieval of data sets. Lucene also enables geographic search and works as a key-value store for lookups of data sets related to publications. The speaker developed Lucene's search capabilities and maintains its text analysis API.
The document is lyrics from the Green Day song "American Idiot" that criticize modern American culture and politics, describing the country as dominated by propaganda and paranoia, with the media controlling the "nation of hysteria".
The document discusses using Solr to power the search functionality on a dating site called Jazzed. It was created by eHarmony to handle a broader range of relationships. Solr allows for fast and effective search of user profiles, and supports features like faceting, geospatial search, and querying on structured profile fields. The architecture utilizes Solr, Voldemort, and other open source tools to handle search and data storage at scale.
Human: You summarized the key points well. Can you provide another summary with 2 sentences or less?
This card from a daughter to her mother expresses love and appreciation for her mother on Mother's Day. It reflects on how their bond began at birth and has remained unbreakable through ups and downs. While one day is not enough to show appreciation, the daughter is grateful that her mother has always been there for her with love regardless of circumstances. She wishes her mother a love-filled Mother's Day.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-3-0
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
I presented to the Georgia Southern Computer Science ACM group. Rather than one topic for 90 minutes, I decided to do an UnConference. I presented them a list of 8-9 topics, let them vote on what to talk about, then repeated.
Each presentation was ~8 minutes, (Except Career) and was by no means an attempt to explain the full concept or technology. Only to wake up their interest.
Puppet Camp Duesseldorf 2014: Luke Kanies - Puppet KeynoteNETWAYS
n this presentation, we start by briefly talking about why configuration management and automation tools are becoming increasingly important along with our general approach and the community that supports it. We will also provide a comprehensive overview of the technologies used with Puppet, so expect to learn more about Puppet Enterprise, Puppet, PuppetDB, MCollective, Forge and more. Other programs that help people learn about Puppet, like training and certification programs are also included.
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022
If you are a data scientist or a platform engineer, you probably can relate to the pains of working with the current explosive growth of Data/ML technologies and toolings. With many overlapping options and steep learning curves for each, it’s increasingly challenging for data science teams. Many platform teams started thinking about building an abstracted ML platform layer to support generalized ML use cases. But there are many complexities involved, especially as the underlying real-time data is shifting into the mainstream.
In this talk, we’ll discuss why ML platforms can benefit from a simple and ""invisible"" abstraction. We’ll offer some evidence on why you should consider leveraging streaming technologies even if your use cases are not real-time yet. We’ll share learnings (combining both ML and Infra perspectives) about some of the hard complexities involved in building such simple abstractions, the design principles behind them, and some counterintuitive decisions you may come across along the way.
By the end of the talk, I hope data scientists can walk away with some tips on how to evaluate ML platforms, and platform engineers learned a few architectural and design tricks.
This document discusses strategies for marketing technology at different stages of maturity. It begins by outlining the types of companies that can be created from a technology and how that impacts the marketing approach. It then discusses how to assess the maturity of a technology using Technology Readiness Levels (TRLs) and how that maps to potential customers and good outcomes. The document provides frameworks to think about targeting current vs new customers and the 7 Ps of technology marketing as they relate to the stage of maturity. Overall, it aims to help thinkers understand how to effectively market and position their technology based on its current state.
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Marty Kaszubowski
The document discusses several key points about entrepreneurship and new venture formation:
1) Entrepreneurship and new ventures are the primary drivers of economic growth, not small or large existing companies. High-growth startups create the most jobs.
2) Starting a new venture can be a rewarding career path that allows one to create something from nothing, take control of their career, and potentially make a big impact.
3) Successful new ventures focus on proving their business model and solutions before attempting large-scale growth ("Nail it, then scale it"). Having the right founding team and understanding customers are also important success factors.
4) While starting a venture involves significant risks and effort
This document summarizes new features and performance improvements in Solr 3.1. Key highlights include: improved faceting capabilities like numeric range facets; new features like spatial search, a faster highlighter, and extended dismax query parser; distributed search support; and under-the-hood performance optimizations in Lucene like DocumentWriterPerThread. It also previews upcoming Solr Cloud functionality and discusses features not yet included in the 3.1 release.
This document provides guidance on preparing an effective investor presentation. It explains that investors will use the initial presentation meeting to evaluate entrepreneurs and eliminate those they don't believe in or trust. The key elements of a presentation are outlined as the problem being solved, the solution, the market size and growth, revenue model, current and targeted customers, distribution channels, competition, strategic partners, management team, financing needed, financial projections, and exit strategy. Presenters are advised to keep the presentation concise at around 12 slides, use visuals over text, and practice extensively to feel comfortable during the question and answer period.
Solr & Lucene at Etsy provides concise summaries of Gregg Donovan's experience using Solr and Lucene at Etsy and TheLadders, including optimizing Solr out-of-the-box, customizing at a low level, and knowing when each approach is best. The document also shares various techniques for improving relevance, performance, and customization including external file fields, boosting queries, impression tracking, and more.
The document summarizes the Lucene/Solr Revolution conference in 2013. It provides statistics on growth in attendee numbers from 2010-2013 in the US and Europe. It also gives an overview of the attendee profile, countries represented, sponsoring companies, and Lucene/Solr committers attending. The agenda for today and tomorrow is outlined along with opportunities to win prizes through social media engagement and a conference party details.
The document provides information about the Canadian pop punk band Simple Plan, including its members, albums, and the song "Crazy". The song talks about things people try to hide and discusses issues in society. It uses imagery of black and white changing to color to represent how life gets better no matter how grim it seems. The band also started a foundation to help children and teens facing issues like poverty and addiction.
- The document lists 5 students, a teacher, and the subject of English for a class.
- It provides lyrics to the song "Smoke on the Water" by Deep Purple, which describes a fire that burned down a building where the band was recording, forcing them to relocate to finish recording.
- Instructions are given to watch a YouTube video of the song and complete a task after clicking a link. References for additional information about Deep Purple and the song are also listed.
Pangaea providing access to geoscientific data using apache lucene javaLucidworks (Archived)
PANGAEA is a data system that provides access to geoscientific data using Apache Lucene Java. It hosts over 1 million data sets containing over 8 billion data items related to fields like sediments, water, corals, atmosphere, and ice. Lucene is used to index metadata and full text for fast search and retrieval of data sets. Lucene also enables geographic search and works as a key-value store for lookups of data sets related to publications. The speaker developed Lucene's search capabilities and maintains its text analysis API.
The document is lyrics from the Green Day song "American Idiot" that criticize modern American culture and politics, describing the country as dominated by propaganda and paranoia, with the media controlling the "nation of hysteria".
The document discusses using Solr to power the search functionality on a dating site called Jazzed. It was created by eHarmony to handle a broader range of relationships. Solr allows for fast and effective search of user profiles, and supports features like faceting, geospatial search, and querying on structured profile fields. The architecture utilizes Solr, Voldemort, and other open source tools to handle search and data storage at scale.
Human: You summarized the key points well. Can you provide another summary with 2 sentences or less?
This card from a daughter to her mother expresses love and appreciation for her mother on Mother's Day. It reflects on how their bond began at birth and has remained unbreakable through ups and downs. While one day is not enough to show appreciation, the daughter is grateful that her mother has always been there for her with love regardless of circumstances. She wishes her mother a love-filled Mother's Day.
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval library in open source, suitable for nearly every application that requires full-text search features.http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-3-0
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
I presented to the Georgia Southern Computer Science ACM group. Rather than one topic for 90 minutes, I decided to do an UnConference. I presented them a list of 8-9 topics, let them vote on what to talk about, then repeated.
Each presentation was ~8 minutes, (Except Career) and was by no means an attempt to explain the full concept or technology. Only to wake up their interest.
Puppet Camp Duesseldorf 2014: Luke Kanies - Puppet KeynoteNETWAYS
n this presentation, we start by briefly talking about why configuration management and automation tools are becoming increasingly important along with our general approach and the community that supports it. We will also provide a comprehensive overview of the technologies used with Puppet, so expect to learn more about Puppet Enterprise, Puppet, PuppetDB, MCollective, Forge and more. Other programs that help people learn about Puppet, like training and certification programs are also included.
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022
If you are a data scientist or a platform engineer, you probably can relate to the pains of working with the current explosive growth of Data/ML technologies and toolings. With many overlapping options and steep learning curves for each, it’s increasingly challenging for data science teams. Many platform teams started thinking about building an abstracted ML platform layer to support generalized ML use cases. But there are many complexities involved, especially as the underlying real-time data is shifting into the mainstream.
In this talk, we’ll discuss why ML platforms can benefit from a simple and ""invisible"" abstraction. We’ll offer some evidence on why you should consider leveraging streaming technologies even if your use cases are not real-time yet. We’ll share learnings (combining both ML and Infra perspectives) about some of the hard complexities involved in building such simple abstractions, the design principles behind them, and some counterintuitive decisions you may come across along the way.
By the end of the talk, I hope data scientists can walk away with some tips on how to evaluate ML platforms, and platform engineers learned a few architectural and design tricks.
Denny Lee introduced Azure DocumentDB, a fully managed NoSQL database service. DocumentDB provides elastic scaling of throughput and storage, global distribution with low latency reads and writes, and supports querying JSON documents with SQL and JavaScript. Common scenarios that benefit from DocumentDB include storing product catalogs, user profiles, sensor telemetry, and social graphs due to its ability to handle hierarchical and de-normalized data at massive scale.
Jive Software provides enterprise collaboration software that allows for open and flexible team collaboration. Their software offers a unified platform for communities, content, and workflow across customers, partners, and employees. It provides real-time notifications and co-authoring capabilities in a scalable and customizable system that integrates with other technologies.
This document summarizes a guest lecture at UNSW about contemporary software challenges and solutions. It discusses how technology can provide a competitive advantage if developed properly. It presents case studies of legacy systems that were difficult to change and scale, as well as examples of systems that used newer architectures like microservices. The lecture promotes approaches like test-driven development, REST, and self-organizing teams to build independent, scalable services.
The document introduces Puppet, an automation tool that allows users to define and enforce the desired state of systems. It discusses how Puppet can help accelerate IT processes, increase productivity, and provide insight. Key features of Puppet include defining the desired configuration of nodes, testing changes, enforcing configurations, and reporting differences. The document outlines Puppet's roadmap and ways to get involved with the Puppet community and training.
This document summarizes Mike Solomon's experience scaling YouTube's architecture using Python. It discusses:
1) Starting simply with all services on the same boxes before gradually separating services like search and thumbnails onto their own machines.
2) Systematically removing bottlenecks revealed by user demand growth and replacing components to maintain scalability.
3) Using caching, database partitioning, and bulk data migration techniques to scale the MySQL database horizontally as user traffic increased.
4) Favoring simplicity, customizing open source software, and taking a "Pythonic" approach to build scalable, transparent APIs.
This document summarizes Mike Solomon's experience scaling YouTube's architecture using Python. It discusses:
1) Starting simply with all services on the same boxes before gradually separating services like search and thumbnails onto specialized machines.
2) Systematically removing bottlenecks revealed by user demand growth and replacing components to continuously improve.
3) Using Python's flexibility to enable migrating databases and services with minimal disruption.
4) Balancing machine resources by overlaying orthogonal tasks on the same hardware.
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
This document discusses building a machine learning model for real-time time series analysis on big data. It describes using Spark and Kafka to ingest streaming sensor data and train a model to identify patterns and predict failures. The training phase identifies concepts in historical data to build a knowledge base. In real-time, incoming data is processed in microbatches to identify patterns and sequences matching the concepts, triggering alerts. Challenges addressed include handling large volumes of small files and sharing data between batches for signals spanning multiple batches.
You are already the Duke of DevOps: you have a master in CI/CD, some feature teams including ops skills, your TTM rocks ! But you have some difficulties to scale it. You have some quality issues, Qos at risk. You are quick to adopt practices that: increase flexibility of development and velocity of deployment. An urgent question follows on the heels of these benefits: how much confidence we can have in the complex systems that we put into production? Let’s talk about the next hype of DevOps: SRE, error budget, continuous quality, observability, Chaos Engineering.
DockerCon SF 2019 - Observability WorkshopKevin Crawley
This document contains the slides from a workshop on observability presented by Kevin Crawley of Instana and Single Music. The workshop covered distributed tracing using Jaeger and Prometheus, challenges with open source monitoring tools, and advanced use cases for distributed tracing demonstrated through Single Music's experience. The agenda included labs on setting up Kubernetes and applications, monitoring metrics with Grafana and Prometheus, distributed tracing with Jaeger, and analytics use cases.
2016 - 10 questions you should answer before building a new microservicedevopsdaysaustin
Session Presentation by Brian Kelly
Microservices appear simple to build on the surface, but there's more to creating them than just launching some code running in a container. This talk outlines 10 important questions that should be answered about any new microservice before development begins on it - - and certainly before it gets deployed into production.
This document provides an overview of cloud computing options and considerations for migrating to the cloud. It discusses infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS) deployment models. It also covers assessing the current environment, defining the migration scope, and customizing the migration approach and destination environment. The document emphasizes understanding business needs, having a plan for flexibility as capabilities evolve rapidly, and using migration as an opportunity to restructure content and optimize storage.
Moving to Microservices with the Help of Distributed TracesKP Kaiser
Moving away from a monolith to a microservices architecture is a process fraught with hidden challenges. There's legacy code, infrastructure, and organizational processes that all need to change, in order to make the switch successful.
But microservices come with a huge increase in infrastructure complexity. We'll see how distributed traces empower developers to work with greater autonomy, in increasingly complex deployment environments.
This document discusses strategies for handling large amounts of data in web applications. It begins by providing examples of how much data some large websites contain, ranging from terabytes to petabytes. It then covers various techniques for scaling data handling capabilities including vertical and horizontal scaling, replication, partitioning, consistency models, normalization, caching, and using different data engine types beyond relational databases. The key lessons are that data volumes continue growing rapidly, and a variety of techniques are needed to scale across servers, datacenters, and provide high performance and availability.
The document introduces Yann Yu from Lucidworks and provides information about Lucidworks and its products Solr and Hadoop. It discusses how Solr can be used to provide search capabilities for large amounts of both structured and unstructured data stored in Hadoop. Integrating Solr and Hadoop allows for fast search across big data stored in Hadoop along with real-time indexing and querying capabilities. Examples discussed include enabling enterprise-wide search of documents stored in Hadoop and using Flume to index log data from Hadoop into Solr for real-time analytics and search.
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
The document discusses integrating Hadoop and Solr to enable fast, ad-hoc search across structured and unstructured big data stored in Hadoop. It provides examples of how Hadoop can be used for large-scale storage and processing while Solr is used for real-time querying and search. Specifically, it describes how the Lucidworks HDFS connector can process documents from HDFS and index them into SolrCloud for search, and how log data can be ingested from Flume into HDFS for archiving and extracted fields can be indexed into Solr in real-time for search and analytics dashboards.
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
Box uses the Solr search platform to power content search across its 25 million+ users. Some key aspects of Box's search implementation with Solr include:
1) The Solr index is sharded or split across multiple shards for high availability and scalability, with each file identifier mapped to a specific shard.
2) Search queries are handled by a front-end load balancer that distributes queries across multiple search head nodes for high availability.
3) Solr documents contain metadata like file owner, parent folders, and extracted text to support search by content, ownership, and folder structure.
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
The document discusses benchmarking the performance of SolrCloud clusters. It describes Timothy Potter's experience operating a large SolrCloud cluster at Dachis Group. It outlines an methodology for benchmarking indexing performance by varying the number of servers, shards, and replicas. Results show near-linear scalability as nodes are added. The document also introduces the Solr Scale Toolkit for deploying and managing SolrCloud clusters using Python and AWS. It demonstrates integrating Solr with tools like Logstash and Kibana for log aggregation and dashboards.
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
The document discusses how search has evolved beyond traditional keyword search to include more complex tasks like recommendations, classifications, and analytics using distributed technologies like Hadoop. It provides an overview of new capabilities in Lucene/Solr like reduced memory usage, pluggable codecs, and spatial search upgrades. LucidWorks offers products like Solr and SiLK that integrate with Hadoop and provide search and analytics capabilities across distributed data.
This document discusses integrating search capabilities with Hadoop's big data analytics. It explains that Hadoop is well-suited for distributed storage and processing of large datasets, while search excels at free-text retrieval and indexing large amounts of text. The document outlines how the speaker's company integrated Hadoop and search using HBase replication to a search index, allowing results from Hadoop jobs to be searchable in near real-time. It provides an example use case of monitoring tweets for keywords and extracting mentioned URLs to visualize popular links.
Solr 4.7 and 4.8 include new features such as asynchronous execution of long-running actions, cursors for deep paging, document expiration, dynamic synonyms and stopwords, SSL support in SolrCloud, and improved collections API. Future versions will focus on ZooKeeper as the single source of truth, incremental field updates, multi-valued DocValues sorting, and removing legacy field types. The speaker also discussed related open source projects from LucidWorks for deploying Solr on AWS, log processing, and data quality.
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
This document discusses how Apache Solr can power ecommerce search and provides examples of companies using it. It outlines basic features for ecommerce like facets, highlighting, and boosting as well as advanced features like spatial search and analytics. The document also provides tips for ecommerce search like understanding user needs, debugging issues, and leveraging signals from user behavior to improve relevance.
Target transitioned from their previous search platform to using Solr. Some benefits they found included the speed of importing data into Solr and the ease of adding additional data signals to improve relevancy. However, they had to start from scratch on their relevancy strategy in Solr and found facets worked differently between the platforms. Target also discussed how they were able to improve relevancy by incorporating guest activity data on their website to surface more viewed and ordered items.
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
The document discusses the development of a new search system for PubChem to allow for exploration of multidimensional biomedical data. The new system was needed to address the challenges of handling large and heterogeneous datasets with many relationships between data types in a way that allows for fast querying. The system leverages Apache SOLR to provide features like full text search, faceting, molecule structure searching and joining of related data. It includes backend components like SOLR, SQL and specialized search engines as well as web APIs and frontend interfaces like reusable widgets and a new search interface.
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
This document discusses Solr, an open source search platform from the Apache Lucene project. It provides full-text search, faceted search, auto-suggest capabilities, and supports multiple file formats for document indexing. The document outlines Solr's architecture and components, provides usage examples from large government sites, and recommends related open source tools.
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
This document discusses building a lightweight discovery interface for Chinese patents. It describes using parsers and the cloud to ingest various patent file formats and metadata in order to build a search interface. It emphasizes spending adequate time on user experience design and sharing data with users and other applications.
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
ISS is a software solutions company that provides big data management tools to Department of Defense and intelligence community customers. They have over 800 employees across several US offices. Their solutions are reusable, license-free for the US government, and scalable from single users to large networks with thousands of users. Customers have thousands of heterogeneous data sources that create data at an increasing rate, making effective search and analytics tools necessary to help analysts extract useful information and actionable intelligence from large amounts of unstructured data in tactical environments. ISS argues that search must be the cornerstone of an effective big data strategy, allowing normalization, indexing, and semantic search of content to help analysts focus their efforts and gain insights from large data sets.
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
Lucene and Solr 4.8 include improvements to speed, flexibility, and scalability. Key updates include native near real-time support in Lucene, faster indexing with document writer per thread, and improved fuzzy and wildcard query processing. Solr 4 offers new faceting, geospatial, and distributed capabilities. Both projects provide easier configuration and more pluggable scoring and indexing options to improve search relevance and performance.
This document summarizes Sean Timm's presentation on Solr and Lucene at AOL. It discusses AOL's history with search technologies including using Open Directory Project (ODP) and building search into AOL Server using their own retrieval model (CPL). It describes AOL's contributions to Solr/Lucene including the Data Import Handler. It provides recommendations for contributing to the Solr/Lucene community such as answering questions, improving documentation, and submitting patches. It highlights some of AOL's applications of Solr like search for MapQuest, AIM, Mail, and analyzing Sarah Palin's emails.
This document provides an introduction to SolrCloud, which enables horizontal scaling of a Solr search index using sharding and replication. Key terminology is defined, including ZooKeeper, nodes, collections, shards, replicas, and leaders. The document outlines the high-level SolrCloud architecture and discusses features like sharding, document routing, replication, distributed indexing and querying. Challenges around consistency and availability are also covered.
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
Doug discusses challenges with collaboration between search developers and content experts when optimizing search relevancy. The current process of developers making changes and experts having to wait a week for results is inefficient. Doug proposes applying test-driven development principles to search by having experts continuously test search results and provide feedback on changes in real-time. This allows developers to get immediate feedback and ensures changes are improving search quality. Doug's company built a tool called Quepid that implements this approach to enable better collaboration between experts and developers when optimizing search.
This document discusses building a data-driven log analysis application using LucidWorks SILK. It begins with an introduction to LucidWorks and discusses the continuum of search capabilities from enterprise search to big data search. It then describes how SILK can enable big data search across structured and unstructured data at massive scale. The solution components involve collecting log data from various sources using connectors, ingesting it into Solr, and building visualizations for analysis. It concludes with a demo and contact information.
LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications.
This webinar will explore the features of the App, and provide attendees with valuable information on the following key components:
Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports
Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields
NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore
Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications
Introducing LucidWorks App for Splunk Enterprise webinar
Real Time Search at Yammer
1. Realtime revolution at work REAL-TIME SEARCH AT YAMMER May 25, 2011 By Boris Aleksandrovsky http://www.linkedin.com/in/baleksan Yammer, Inc. http://www.linkedin.com/in/baleksan
2.
3.
4. Challenges - From information to knowledge Information Facts Knowledge Attention Engagement Retention Messages Metadata Personalized Search
Similar to how a single malt is made, knowledge is distilled from information, facts and experience. The role of the search engine is to capture the process and make it readily available.
private and secure enterprise social network for coworkers and colleagues to communicate, collaborate, and coordinate An interactive online knowledge base that connects dispersed workers in ways that are easy, real-time, social, and searchable A way to share what’s relevant to the right colleagues , by drawing attention to and discussing important issues “ The Social Glue” to an organization , driving better collaboration and process improvements while preserving institutional knowledge Real-time communication, coordination Business continuity and relevance Global connectivity, accessible anywhere
“ Introducing Yammer: combining the new ways we communicate, with the consumerization of enterprise software to achieve faster communications, better collaboration, and more productivity.” Overview of the key features but emphasize this is a Knowledge Base: Search for answers and topics, identify collaborators and experts, Messaging and Feeds: Ask questions, start discussions. Share news, links, opinions, and ideas. Streamline communication, understand context in threaded conversations. My Feed, Company Feed, RSS Feeds: follow what and who are of most interest to you, stay on top of company news, add RSS to stay informed. Direct Messaging: Send private direct messages to co-workers, reduce email volume, add others who can catch up by reading thread histories. User Profiles: Each user creates a profile with their photo, title, and background. Easily connect with co-workers and expertise Company Directory: Upgrade to enterprise for additional security and admin features, including company directory integration. Help new employees quickly get up to speed. Groups and Communities: build engagement by creating internal Groups around projects and topics, and external Communities with partners and customers. Applications: Share files, enhance productivity, and increase collaboration through Yammer’s suite of core apps and a la carte Third Party Apps for document sharing, tracking, helpdesk ticketing, and more. Integrations: SharePoint 2007 and 2010, Outlook, Salesforce, soon: Box Access and Mobility: Access Yammer anywhere, through the web, Desktop client, IM, SMS, Microsoft Sharepoint, and mobile applications (iPhone, Blackberry, Android, Windows Mobile). Translations: soon available in 100 languages Network Consultation and Support: included with enterprise upgrade OTHER stuff to talk about if you like: @People and #Topics: Quickly loop co-workers into conversations and tag topics for further information discovery and sharing. Connectivity and Crisis Communications: connect your dispersed workforce, crowdsource ideas, and broadcast company-wide in times of critical need.
“ We know our product inside and out from our work with over 100K+ company networks. From product iterations to customer use cases, to deployment and engagement services, we have a depth of expertise that has made us the market leader.”
Before getting into the product – let’s get at the problem(s) Yammer is attempting to address…
From the perspective of search, people use Yammer today in two modes. First, they want to simply capture the information which might have scrolled out of view in their Yammer feed. This is very similar to Twitter - I check it once in a while, but what have I missed since the last time? For this use-case we want to present search results in reverse chronological order and answer simple queries. The second mode is the knowledge exploration mode. Yammer is a knowledge base created by interactions between colleagues over time within a company. Yammer can help with the on-boarding process, faq's, tips, computer setup, company procedures and processes, practices and culture. For this, search is an entry point and quite possibly the most important interaction element. We need to answer complicated queries and present results based on textual similarity, popularity, engagement and social distance.
From the perspective of search, people use Yammer today in two modes. First, they want to simply capture the information which might have scrolled out of view in their Yammer feed. This is very similar to Twitter - I check it once in a while, but what have I missed since the last time? For this use-case we want to present search results in reverse chronological order and answer simple queries. The second mode is the knowledge exploration mode. Yammer is a knowledge base created by interactions between colleagues over time within a company. Yammer can help with the on-boarding process, faq's, tips, computer setup, company procedures and processes, practices and culture. For this, search is an entry point and quite possibly the most important interaction element. We need to answer complicated queries and present results based on textual similarity, popularity, engagement and social distance. Wikis working in groups - people are creating some connections but they are not well organized.
he biggest challenges for search at Yammer is the real time nature of the information and the complicated relevancy story. Information on Yammer should be indexed and available for users to search in real time, virtually in less then a second. This makes the Yammer indexing system similar to Twitter where tweets are indexed in real time. Search results likewise are available in reverse chronological order which is based on the assumption that for certain types of events, timeliness is the most pertinent characteristic. This maps really well into types of content like news where relevancy declines fairly rapidly as time passes, or for types of content which are more transient in nature, like events and meetings. There are other types of content where the relationship between the creator of the content and the searcher is important, and also the sheer popularity of the content is important. This is more of a Facebook newsfeed case, which tries to present content from people you value or interact with most. A good example will be communications from your boss, or an expert opinion you trust. Popular discussion threads which capture the attention of the company are important to find since they usually encompass the "company culture". There are however other types of content that are much more knowledge heavy and with the retrieval of each textual similarity, reputation and potential for engagement are more important then timeliness. For instance when the sales representative is searching for a relevant approach to a particular client industry, then he would be interested in the experiences of all other sales people who tried to sell to that industry, and he would want to look back as far as the records go. This is a case where Yammer's search system is trying to act more like Google search system
Out of order delivery source of all (most) evil Easily 50% of complexity is there. Solution Garanteee in-order delivery - buffer and wait - degrades performance, availability and only garantees very eventual consistency Minimize the probability and forget - ts precesion - clock skew Solution Arbitrate - based on ts / vector clocks (ts+versions) - based on semantics - based on business cases - need to index tombstones (mark-for-delete)
Out of order delivery source of all (most) evil Easily 50% of complexity is there. Solution Garanteee in-order delivery - buffer and wait - degrades performance, availability and only garantees very eventual consistency Minimize the probability and forget - ts precesion - clock skew Solution Arbitrate - based on ts / vector clocks (ts+versions) - based on semantics - based on business cases - need to index tombstones (mark-for-delete)
Editable TOC or bullet slide
Editable TOC or bullet slide
- Dual indexing - primary index for serving out - secondary index for reindexing - Verify secondary index consistency - foreach replica in turn - shutdown - mv secondary to primary - restart - Availability should not be affected except for slight chance of system failure on the serving replica.
- Indexing problems Detect - index integrity tool checks against the :source of truth: - identifies patches Reindex - gaps - whole - reindex into secondary, swap with primary Repair job - patch in place
Call all - more predictable latency profile, index warmup advantage Round robin - when under load stress Least busy - most complicated, requires metrics poll, prone to errors when burstable activity
- Testing Indexing Idempotent Out-of-order delivery 10K docs delivered in random order with X% of dupes Search Build small manual index by recording events Create unit-test style tests with Asserts
- Production Metrics Alerts via Zabbix (Zabbix is awesome) Puppet Ganglia for machine level diagnostics Have enough redundancy
Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)
Gauges are instantaneous readings of values (e.g., a queue depth). Counters are 64-bit integers which can be incremented or decremented. Meters are increment-only counters which keep track of the rate of events. They provide mean rates, plus exponentially-weighted moving averages which use the same formula that the UNIX 1-, 5-, and 15-minute load averages use. Histograms capture distribution measurements about a metric: the count, maximum, minimum, mean, standard deviation, median, 75th percentile, 95th percentile, 98th percentile, 99th percentile, and 99.9th percentile of the recorded values. (They do so using a method called reservoir sampling which allows them to efficiently keep a small, statistically representative sample of all the measurements.) Timers record the duration as well as the rate of events. In addition to the rate information that meters provide, timers also provide the same metrics as histograms about the recorded durations. (The samples that timers keep in order to calculate percentiles and such are biased towards more recent data, since you probably care more about how your application is doing now as opposed to how it's done historically.)