Sudarshan Gaikaiwari - Lucene @ Yelp

•

0 likes•1,193 views

Lucene @ Yelp provides various search services using Lucene including business search, phone search, list search, review search, and auto-completion. The services were originally too slow due to seeking across hard disks for the large index. The solution was to shard the index across multiple machines ("federation") and have a coordinator retrieve and combine results. Lucy is used for indexing and searching individual shards efficiently. Statistical modeling and simulations showed fewer hits needed to be retrieved from each shard to obtain the overall top results compared to naive approaches.

Technology

Bio

1. Over a decade of experience in information retrieval
2. Used IR techniques at Symantec's DLP group
3. Search Engineer at Yelp

Outline

1. Overview of search services at Yelp
2. Federation Motivation
3. Lucy Indexing
4. Lucy Searching
5. Efficiently Retrieving top k hits

Hard Disk Seek Latency
Disk seek 10,000,000 ns

Source Software Engineering Advice from
Building Large-Scale Distributed Systems
Jeffery Dean

RAM read latency

Main memory
reference
100 ns

Pinning Index in RAM

● vmtouch
● mlock
● http://hoytech.com/vmtouch/

Problem

Index is too large fit in memory on a single machine

Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary
2. No guarantee that a boundary can be found.

Federation

1. Split index across multiple machines
2. Shard on business id
3. TF-IDF scores from different machines should be
comparable

Mapping businesses to shards

1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems
1. Involves re-indexing all the businesses if we want to add a
new shard

Advantages

1. Flexibility (move vbuckets from one shard to another)
2. Split hot spot shards

Lucy Master Slave Architecture

Separate indexing (masters)
A master for each shard of a service

Searching (slaves)
A slave for every replica of a service

Federator: Combining results across
shards
1. Once we distribute an index across shards we need a
component which will search all these shards and combine
their results.
2. Written in Python (runs inside a python web process).
3. Uses Tornado IO loop to send requests to all shards.
4. The transfer protocol for the requests in JSON RPC

Executing queries

1. Gather the top results for a query
2. Collect attribute statitics for attributes like places, categories

Lucene

1. Efficiently executes queries over the index
2. Provides how relevant the business is to the words in the
query (word score)
3. Upgrading lucene to 2.9/3.1 is WIP

Efficiently Retrieving top k hits

1. When user moves through multiple pages the number of
hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to
retrieve 500 hits from each shard and then sort them

Binomial Distribution
Probability (r of top k hits) are in a particular shard

Mean

Variance

Simulation

Formula Hits selected from each Results Missed (%)
shard
k = 100
p = 0.2

24 0.017

32 0.0001407

44 0.00000

Results

1. ~ 50% savings over 100 hits (44 hits requested from each
shard)
2. 77% savings over 1000 hits (228 hits requested from each
shard)

Future work

1. In memory index
2. Move towards real time search

Viewers also liked

Customized Navigation Using SOLR

Lucidworks (Archived)

What’s New in Apache Lucene 3.0

Lucidworks (Archived)

Lady gaga

tanica

What’s new in apache lucene 3.0

Lucidworks (Archived)

Understanding Lucene Search Performance

Lucidworks (Archived)

Web Design Course FETAC Level 5

CMD Training Institute

Hellosong

tanica

Coterie 9 11

LaRue

Ecma 262 5th Edition を読む #5 第9条

彰村地

Building SaaS Solutions for Online Media Using Apache Solr

Lucidworks (Archived)

What’s New in Solr 1.4

Lucidworks (Archived)

Guidelines for Managers: What Lucene and Solr Open Source Search can do for E...

Lucidworks (Archived)

Impact of open source search on the intelligence community

Lucidworks (Archived)

Ingles haiti

tanica

Gaiety Hotel - full version

dummypackages

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

Lucidworks (Archived)

Pangaea providing access to geoscientific data using apache lucene java

Lucidworks (Archived)

$C:\Fakepath\I Love You Mommy$ $C:\Fakepath\I Love You Mommy$

C:\Fakepath\I Love You Mommy

Nyiah

In The Annals Of Rock History The Who

tanica

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...

Lucidworks (Archived)

Viewers also liked (20)

Customized Navigation Using SOLR

What’s New in Apache Lucene 3.0

Lady gaga

What’s new in apache lucene 3.0

Understanding Lucene Search Performance

Web Design Course FETAC Level 5

Hellosong

Coterie 9 11

Ecma 262 5th Edition を読む #5 第9条

Building SaaS Solutions for Online Media Using Apache Solr

What’s New in Solr 1.4

Guidelines for Managers: What Lucene and Solr Open Source Search can do for E...

Impact of open source search on the intelligence community

Ingles haiti

Gaiety Hotel - full version

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

Pangaea providing access to geoscientific data using apache lucene java

$C:\Fakepath\I Love You Mommy$ $C:\Fakepath\I Love You Mommy$

C:\Fakepath\I Love You Mommy

In The Annals Of Rock History The Who

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...

Similar to Sudarshan Gaikaiwari - Lucene @ Yelp

Deep learning is making news across the country as one of the most promising techniques in machine learning research. However, these methods are complex to implement, finicky to tune, and state-of-the-art accuracy is only achieved by a few experts in the field. In this session, we give a beginner-friendly explanation of deep learning using neural networks—what it is, what it does, and how; and introduce the concept of deep features, which allows you to obtain great performance with reduced running times and data set sizes. We then show how these methods can easily be deployed on GPU instances (G2) on Amazon EC2.

(CMP305) Deep Learning on AWS Made EasyCmp305

Amazon Web Services

Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, such as fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing, which can utilize cluster computing to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS.

Spark Based Distributed Deep Learning Framework For Big Data Applications

Humoyun Ahmedov

My Master's Thesis

Humoyun Ahmedov

Jeff Hajewski, Salesforce - There is a wealth of information on building deep learning models with PyTorch or TensorFlow. Anyone interested in building a deep learning model is only a quick search away from a number of clear and well written tutorials that will take them from zero knowledge to having a working image classifier. But what happens when you need to deploy these models in a production setting? At Salesforce, we use TensorFlow models to help us provide customers with insights into their data, and we do this as close to real-time as possible. Designing these systems in a scalable manner requires overcoming a number of design challenges, but the core component is Docker. Docker enables us to design highly scalable systems by allowing us to focus on service interactions, rather than how our services will interact with the hardware. Docker is also at the core of our test infrastructure, allowing developers and data scientists to build and test the system in an end to end manner on their local machines. While some of this may sound complex, the core message is simplicity - Docker allows us to focus on the aspects of the system that matter, greatly simplifying our lives.

Distributed Deep Learning with Docker at Salesforce

Docker, Inc.

Introduction to Azure DocumentDB

Denny Lee

Software systems are growing in size and complexity when the business is growing, and sometimes it is hard to figure out what is going on. Various teams make different changes for different business capabilities. Distributed Tracing is a useful way to look under the hood and see for yourself what operations are being performed, what services are used in a certain use case, and how performant are they. In this talk, I will present what Distributed Tracing is and how we introduced it into our software system with some tips and tricks on what you should focus on if you want to do the same.

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...

Fwdays

Machine learning has its challenges, and understanding the algorithms is not always easy. In this session, you’ll discover methods to make these challenges less daunting. Intended for software engineers who need to understand the requirements and constraints of data scientists, and data scientists who need to implement or help implement production systems, the session will begin with a quick introduction to data quality and a level-set on common vocabulary. You’ll then explore the formats that are required by Spark ML to run its algorithms, and see how to automate the build through user-defined functions and other techniques. Automation will make reproducibility easy, minimize errors and increase the efficiency of data scientists. Key takeaways will include: – How to build the required tool set in Java – Understanding the formats required by Spark ML (a new vocabulary) – Learning fundamentals about data quality and how to make sure the data is usable

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin

Databricks

"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)

Tech in Asia ID

This presentation is focused on the architecture, scalability concerns, performance bottlenecks, operational characteristics and lessons learned while designing and implementing Yammer distributed real-time search system. Yammer is an enterprise social network SaaS offering with over 100,000 networks (including 85% of the Fortune 100) and nearly 2 million users. The search system we developed scales well up to 1B messages and serves a foundation of knowledge base analysis services Yammer is developing.

Realtime search at Yammer

Boris Aleksandrovsky

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011 This talk will be focused on the architecture, scalability concerns, performance bottlenecks, operational characteristics and lessons learned while designing and implementing Yammer distributed real-time search system. Yammer is an enterprise social network SaaS offering with over 100,000 networks (including 85% of the Fortune 100) and nearly 2 million users. The search system we developed scales well up to 1B messages and serves a foundation of knowledge base analysis services Yammer is developing.

Real-time Search at Yammer - By Aleksandrovsky Boris

lucenerevolution

Real Time Search at Yammer

Lucidworks (Archived)

Puppetcamp Melbourne - puppetdb

m_richardson

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB

Puppet

Conf orm - explain

Louise Grandjonc

Functional solid

Matt Stine

This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni. My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi. http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks

AI and Deep Learning

Subrat Panda, PhD

Beyond php - it's not (just) about the code

Wim Godden

SenseiDB

Volodymyr Zhabiuk

Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Vijay Srinivas Agneeswaran, Ph.D

Crm saturday madrid 2017 jordi montaña - test automation

Demian Raschkovan

Similar to Sudarshan Gaikaiwari - Lucene @ Yelp (20)

(CMP305) Deep Learning on AWS Made EasyCmp305

Spark Based Distributed Deep Learning Framework For Big Data Applications

My Master's Thesis

Distributed Deep Learning with Docker at Salesforce

Introduction to Azure DocumentDB

"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin

"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)

Realtime search at Yammer

Real-time Search at Yammer - By Aleksandrovsky Boris

Real Time Search at Yammer

Puppetcamp Melbourne - puppetdb

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB

Conf orm - explain

Functional solid

AI and Deep Learning

Beyond php - it's not (just) about the code

SenseiDB

Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Crm saturday madrid 2017 jordi montaña - test automation

More from Lucidworks (Archived)

Integrating Hadoop & Solr

Lucidworks (Archived)

The Data-Driven Paradigm

Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...

Lucidworks (Archived)

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Lucidworks (Archived)

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

Lucidworks (Archived)

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance

Lucidworks (Archived)

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search

Lucidworks (Archived)

What's new in solr june 2014

Lucidworks (Archived)

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr

Lucidworks (Archived)

Minneapolis Solr Meetup - May 28, 2014: Target.com Search

Lucidworks (Archived)

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...

Lucidworks (Archived)

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Lucidworks (Archived)

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

Lucidworks (Archived)

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Lucidworks (Archived)

Solr At AOL, Presented by Sean Timm at SolrExchage DC

Lucidworks (Archived)

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC

Lucidworks (Archived)

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC

Lucidworks (Archived)

LucidWorks SiLK is an open source stack that combines Lucene/Solr with best in class open source data ingestion and analytics tools such as Flume, LogStash and Kibana. This webinar will explore the features of SiLK, and provide attendees with valuable information on how they can benefit from the following: - A powerful UI to analyze time series data stored in Lucene/Solr - Creating and sharing visualizations, dashboards and reports - Discovery and analysis of data coming from servers, applications, devices and more - Exploration of click, geospatial and social data in ways previously unimaginable

Building a data driven search application with LucidWorks SiLK

Lucidworks (Archived)

LucidWorks App for Splunk Enterprise is the first of its kind, specifically designed to allow companies to analyze and manage the health and availability of their Solr deployments in Splunk software. The solution integrates multi-structured data indexed by Solr directly into Splunk® Enterprise, giving system administrators the ability to look at the intersection of documents, customer records or other unstructured data sources as they relate to machine data. This enables companies to optimize their Solr applications, glean insights from search and usage patterns and spot security concerns to improve end user experiences and derive more business value from data-driven applications. This webinar will explore the features of the App, and provide attendees with valuable information on the following key components: Solr Monitor: Monitor the health and availability and utilization of LucidWorks and/or Solr deployments with pre-defined data inputs, dashboards and reports Search Analytics: Perform user behavior and click-stream analysis with pre-built search analytics reports and fields NoSQL Lookups: Using Splunk’s lookup facility enrich your Splunk reports with data of any structure using Solr’s fully indexed and searchable NoSQL-datastore Search Time Joins: Join Splunk data with human generated and other unstructured data sources stored in Solr at search time for developing data-driven applications

Introducing LucidWorks App for Splunk Enterprise webinar

Lucidworks (Archived)

Solr4 nosql search_server_2013

Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr

The Data-Driven Paradigm

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search

What's new in solr june 2014

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr

Minneapolis Solr Meetup - May 28, 2014: Target.com Search

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Solr At AOL, Presented by Sean Timm at SolrExchage DC

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC

Building a data driven search application with LucidWorks SiLK

Introducing LucidWorks App for Splunk Enterprise webinar

Solr4 nosql search_server_2013

Recently uploaded

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Six Myths about Ontologies: The Basics of Formal Ontology

johnbeverley2021

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

Keynote 2: APIs in 2030: The Risk of Technological Sleepwalk Paolo Malinverno, Growth Advisor - The Business of Technology Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

apidays

Angeliki Cooney has spent over twenty years at the forefront of the life sciences industry, working out of Wynantskill, NY. She is highly regarded for her dedication to advancing the development and accessibility of innovative treatments for chronic diseases, rare disorders, and cancer. Her professional journey has centered on strategic consulting for biopharmaceutical companies, facilitating digital transformation, enhancing omnichannel engagement, and refining strategic commercial practices. Angeliki's innovative contributions include pioneering several software-as-a-service (SaaS) products for the life sciences sector, earning her three patents. As the Senior Vice President of Life Sciences at Avenga, Angeliki orchestrated the firm's strategic entry into the U.S. market. Avenga, a renowned digital engineering and consulting firm, partners with significant entities in the pharmaceutical and biotechnology fields. Her leadership was instrumental in expanding Avenga's client base and establishing its presence in the competitive U.S. market.

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Angeliki Cooney

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Bhuvaneswari Subramani

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Dubai, known for its towering skyscrapers, luxurious lifestyle, and relentless pursuit of innovation, often finds itself in the global spotlight. However, amidst the glitz and glamour, the emirate faces its own set of challenges, including the occasional threat of flooding. In recent years, Dubai has experienced sporadic but significant floods, disrupting normalcy and posing unique challenges to its infrastructure. Among the critical nodes in this bustling metropolis is the Dubai International Airport, a vital hub connecting the world. This article delves into the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Orbitshub

Architecting Cloud Native Applications

WSO2

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Understanding the FAA Part 107 License ..

Christopher Logan Kennedy

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Recently uploaded (20)

presentation ICT roal in 21st century education

Six Myths about Ontologies: The Basics of Formal Ontology

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Elevate Developer Efficiency & build GenAI Application with Amazon Q

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Strategies for Landing an Oracle DBA Job as a Fresher

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Architecting Cloud Native Applications

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

CNIC Information System with Pakdata Cf In Pakistan

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Understanding the FAA Part 107 License ..

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Sudarshan Gaikaiwari - Lucene @ Yelp

1. Lucene @ Yelp Sudarshan Gaikaiwari

2. Bio 1. Over a decade of experience in information retrieval 2. Used IR techniques at Symantec's DLP group 3. Search Engineer at Yelp

3. Outline 1. Overview of search services at Yelp 2. Federation Motivation 3. Lucy Indexing 4. Lucy Searching 5. Efficiently Retrieving top k hits

4. The services we provide

5. Lucy: business search

6. Lucy also powers phone search

7. Cathy: she 'talks' a lot

8. Listsearch: it searches lists....

9. Reviewsearch: it searches reviews....

10. DYM: did you really mean that?

11. Suggest: auto completion

12. Federation Motivation

13. Problem Search is too slow

14. Hard Disk Seek Latency Disk seek 10,000,000 ns Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean

15. RAM read latency Main memory reference 100 ns

16. Pinning Index in RAM ● vmtouch ● mlock ● http://hoytech.com/vmtouch/

17. Problem Index is too large fit in memory on a single machine

18. Geographical sharding

19. Geographical Sharding drawbacks 1. Cumbersome manual process to determine shard boundary 2. No guarantee that a boundary can be found.

20. Federation 1. Split index across multiple machines 2. Shard on business id 3. TF-IDF scores from different machines should be comparable

21. Mapping businesses to shards 1. Assigning businesses to shards shard = shardlist[hash(business_id) % len(shardlist)] Problems 1. Involves re-indexing all the businesses if we want to add a new shard

22. Virtual Nodes

23. Advantages 1. Flexibility (move vbuckets from one shard to another) 2. Split hot spot shards

24. Lucy Master Slave Architecture Separate indexing (masters) A master for each shard of a service Searching (slaves) A slave for every replica of a service

25. Lucy Indexing

26.

27.

28.

29.

30. Lucy Searching

31.

32. Federator: Combining results across shards 1. Once we distribute an index across shards we need a component which will search all these shards and combine their results. 2. Written in Python (runs inside a python web process). 3. Uses Tornado IO loop to send requests to all shards. 4. The transfer protocol for the requests in JSON RPC

33. Lucy Server

34.

35.

36. Tokens to Business Attributes

37. Executing queries 1. Gather the top results for a query 2. Collect attribute statitics for attributes like places, categories

38. Lucene 1. Efficiently executes queries over the index 2. Provides how relevant the business is to the words in the query (word score) 3. Upgrading lucene to 2.9/3.1 is WIP

39.

40. Successive geobounds relaxation

41. Successive geobounds relaxation

42. Federation

43. Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increases num hits = start + count 2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them

44. Distribution of hits in shards

45.

46. Probability a hit is in a shard

47. Binomial Distribution Probability (r of top k hits) are in a particular shard Mean Variance

48. Formula Std Deviation Formula

49. Simulation Formula Hits selected from each Results Missed (%) shard k = 100 p = 0.2 24 0.017 32 0.0001407 44 0.00000

50. Simulation Graph

51. Results 1. ~ 50% savings over 100 hits (44 hits requested from each shard) 2. 77% savings over 1000 hits (228 hits requested from each shard)

52. Future work 1. In memory index 2. Move towards real time search

53. Come Join Us!

54. Thank You smg@yelp.com

Sudarshan Gaikaiwari - Lucene @ Yelp

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Sudarshan Gaikaiwari - Lucene @ Yelp

Similar to Sudarshan Gaikaiwari - Lucene @ Yelp (20)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Recently uploaded

Recently uploaded (20)

Sudarshan Gaikaiwari - Lucene @ Yelp