SlideShare a Scribd company logo
Indexing Stuff &&
Things with Sphinx and Perl
Houston Perl Mongers
May 8th, 2014
Hosted by cPanel, Inc.
Brett Estrade <estrabd@gmail.com>
Sphinx
● full text search indexer and daemon
● indexer - builds indexes
● searchd - services search requests
● very easy to install and configure
Sphinx Data Sources
● Directly from MySQL (MariaDB), PostgreSQL
○ Indexing data from arbitrary SQL
○ Excellent for fast reading of expensive JOINs
● XMLPipe2
○ General intermediate data understood by Sphinx
Search Interface
● Native protocol (e.g., Sphinx::Search)
● Supports MySQL protocol (4.1)
○ Subset of SQL supported is called SphinxQL
indexer data
named index for
searchd
searchd config
Client Example - Sphinx::Search
search term -
empty string
returns “all”
Search Results
Some Common Use Cases
● Rebuild index from database regularly
● Incrementally add to existing index
● Query Sphinx for DB primary keys, make DB
call for related rows
● Query Sphinx for wanted data (no DB at all)
== my use case
Real Life Examples
1. Indexing MariaDB
2. Filtering on string using CRC32
3. Creating sources w/Sphinx::XML::Pipe2
4. Dynamic config w/Sphinx::Config::Builder
Indexing MariaBD ~2.25 Million Rows
● Use case - saving eBay auction data in DB
● Providing search interface to it
● Demo run of indexer
How to Filter on Strings
● Requires CRC32 hashing (strings to ints)
● When indexing, use MySQL’s CRC32 function
● Use Perl’s String::CRC32 to encode string,
○ then set filter
And inside of client, use Perl’s String::CRC32 to encode to the same integer
Transforming Things to XMLPipe2
● XMLPipe2 is Sphinx’s generic data format
● Extract/Transform scripts -> XMLPipe2
● use Sphinx::XML::Pipe2; #’nuff said
Sample XMLPipe2 File
Sample XMLPipe2 Source Conf Entry
Example XMLPipe2 Use Case
● Monitor ephemera,e.g. active eBay listings
● Don’t want to use a database
● Many data partitions (i.e., indexes)
○ e.g., by store, by category, etc
○ > 250 (yikes!)
● Data partitions change over time (slowly)
Dynamic Indexing of XMLPipe2 Stuff
● Fact - Sphinx partitions data by indexes
● Problem - each index uses its own data file
○ data as XMLPipe2
● Challenge - how to manage a changing set
of indexes?
Sphinx’s --config to the Rescue!
● Config files are typically static, right?
● Sphinx can handle executables via --config
● indexer --config ./generate-config.pl --all
Sphinx::Config::Builder
● Module I created specifically for this case
○ uploaded to CPAN
● Why? No Sphinx config builders were a fit
● Module is low level and does what I need
○ i.e., dynamically builds a XMLPipe2 specific config
● A+ 100 Passing
○ http://cpantesters.org/distro/S/Sphinx-Config-Builder.html
Solution
● Expects XML2Pipe data files to already exist
● Iterate over array of indexes to build
● Creates “source” entries for XMLPipe2 data
● Creates “index” entries for each “source”
Demo
Tip of the Iceberg
● Sphinx has TONs of options and modes
● Tons of areas of application
● Many clients, Simple interface
● Super easy to install and maintain
Thank You!
● http://sphinxsearch.com/
● cpan://Sphinx::Search
● cpan://Sphinx::Config::Builder
● http://houston.pm.org

More Related Content

What's hot

Kafka Workshop
Kafka WorkshopKafka Workshop
Kafka Workshop
Alexandre André
 
Prometheus london
Prometheus londonPrometheus london
Prometheus london
wyukawa
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
University of California, Santa Cruz
 
Writing Well-Behaved Unix Utilities
Writing Well-Behaved Unix UtilitiesWriting Well-Behaved Unix Utilities
Writing Well-Behaved Unix Utilities
Rob Miller
 
Node collaboration - sharing information between your systems
Node collaboration - sharing information between your systemsNode collaboration - sharing information between your systems
Node collaboration - sharing information between your systems
m_richardson
 
Containers and Logging
Containers and LoggingContainers and Logging
Containers and Logging
Eduardo Silva Pereira
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Andrii Vozniuk
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
Deep Kapadia
 
What Reika Taught us
What Reika Taught usWhat Reika Taught us
Mongo db admin_20110316
Mongo db admin_20110316Mongo db admin_20110316
Mongo db admin_20110316
radiocats
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Jeremy Zawodny
 
Presto in my_use_case2
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2
wyukawa
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
Romi Kuntsman
 
Fluentd and AWS at classmethod
Fluentd and AWS at classmethodFluentd and AWS at classmethod
Fluentd and AWS at classmethod
Treasure Data, Inc.
 
Deployment of xlwings-powered spreadsheets (webinar)
Deployment of xlwings-powered spreadsheets (webinar)Deployment of xlwings-powered spreadsheets (webinar)
Deployment of xlwings-powered spreadsheets (webinar)
xlwings
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
Jeremy Zawodny
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 

What's hot (20)

Kafka Workshop
Kafka WorkshopKafka Workshop
Kafka Workshop
 
Prometheus london
Prometheus londonPrometheus london
Prometheus london
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 
Writing Well-Behaved Unix Utilities
Writing Well-Behaved Unix UtilitiesWriting Well-Behaved Unix Utilities
Writing Well-Behaved Unix Utilities
 
Node collaboration - sharing information between your systems
Node collaboration - sharing information between your systemsNode collaboration - sharing information between your systems
Node collaboration - sharing information between your systems
 
Containers and Logging
Containers and LoggingContainers and Logging
Containers and Logging
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Logstash
LogstashLogstash
Logstash
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
What Reika Taught us
What Reika Taught usWhat Reika Taught us
What Reika Taught us
 
Mongo db admin_20110316
Mongo db admin_20110316Mongo db admin_20110316
Mongo db admin_20110316
 
DrupalANDElasticsearch
DrupalANDElasticsearchDrupalANDElasticsearch
DrupalANDElasticsearch
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
 
Presto in my_use_case2
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Fluentd and AWS at classmethod
Fluentd and AWS at classmethodFluentd and AWS at classmethod
Fluentd and AWS at classmethod
 
Deployment of xlwings-powered spreadsheets (webinar)
Deployment of xlwings-powered spreadsheets (webinar)Deployment of xlwings-powered spreadsheets (webinar)
Deployment of xlwings-powered spreadsheets (webinar)
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 

Viewers also liked

Pressure Groups and the Millennium Development Goals
Pressure Groups and the Millennium Development GoalsPressure Groups and the Millennium Development Goals
Pressure Groups and the Millennium Development Goalslucyannemorgan
 
Work WIth Redis and Perl
Work WIth Redis and PerlWork WIth Redis and Perl
Work WIth Redis and Perl
Brett Estrade
 
Openmp combined
Openmp combinedOpenmp combined
Openmp combined
Brett Estrade
 
Qore for the Perl Programmer
Qore for the Perl ProgrammerQore for the Perl Programmer
Qore for the Perl Programmer
Brett Estrade
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 

Viewers also liked (6)

Pressure Groups and the Millennium Development Goals
Pressure Groups and the Millennium Development GoalsPressure Groups and the Millennium Development Goals
Pressure Groups and the Millennium Development Goals
 
Work WIth Redis and Perl
Work WIth Redis and PerlWork WIth Redis and Perl
Work WIth Redis and Perl
 
6 Suffering Of Christ
6 Suffering Of Christ6 Suffering Of Christ
6 Suffering Of Christ
 
Openmp combined
Openmp combinedOpenmp combined
Openmp combined
 
Qore for the Perl Programmer
Qore for the Perl ProgrammerQore for the Perl Programmer
Qore for the Perl Programmer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Similar to Sphinx && Perl Houston Perl Mongers - May 8th, 2014

A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
Emanuel Calvo
 
Scaling / optimizing search on netlog
Scaling / optimizing search on netlogScaling / optimizing search on netlog
Scaling / optimizing search on netlog
removed_8e0e1d901e47de676f36b9b89e06dc97
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
bangaloredjangousergroup
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
Ogibayashi
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Mongodb Performance
Mongodb PerformanceMongodb Performance
Mongodb Performance
Jack
 
InfiniFlux Feature perf comp_v1
InfiniFlux Feature perf comp_v1InfiniFlux Feature perf comp_v1
InfiniFlux Feature perf comp_v1
InfiniFlux
 
There is Javascript in my SQL
There is Javascript in my SQLThere is Javascript in my SQL
There is Javascript in my SQL
PGConf APAC
 
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data typePostgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Jumping Bean
 
TYPO3 Transition Tool
TYPO3 Transition ToolTYPO3 Transition Tool
TYPO3 Transition Tool
crus0e
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
Rylan Halteman
 
IniniFlux Feature_Perf_Comparison
IniniFlux Feature_Perf_ComparisonIniniFlux Feature_Perf_Comparison
IniniFlux Feature_Perf_Comparison
InfiniFlux
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
Steffen Staab
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience report
Metosin Oy
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic WebESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Webeswcsummerschool
 

Similar to Sphinx && Perl Houston Perl Mongers - May 8th, 2014 (20)

A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
 
Scaling / optimizing search on netlog
Scaling / optimizing search on netlogScaling / optimizing search on netlog
Scaling / optimizing search on netlog
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Mongodb Performance
Mongodb PerformanceMongodb Performance
Mongodb Performance
 
InfiniFlux Feature perf comp_v1
InfiniFlux Feature perf comp_v1InfiniFlux Feature perf comp_v1
InfiniFlux Feature perf comp_v1
 
There is Javascript in my SQL
There is Javascript in my SQLThere is Javascript in my SQL
There is Javascript in my SQL
 
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data typePostgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
 
TYPO3 Transition Tool
TYPO3 Transition ToolTYPO3 Transition Tool
TYPO3 Transition Tool
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
IniniFlux Feature_Perf_Comparison
IniniFlux Feature_Perf_ComparisonIniniFlux Feature_Perf_Comparison
IniniFlux Feature_Perf_Comparison
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience report
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
 
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic WebESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
ESWC SS 2013 - Tuesday Keynote Steffen Staab: Programming the Semantic Web
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Sphinx && Perl Houston Perl Mongers - May 8th, 2014

  • 1. Indexing Stuff && Things with Sphinx and Perl Houston Perl Mongers May 8th, 2014 Hosted by cPanel, Inc. Brett Estrade <estrabd@gmail.com>
  • 2. Sphinx ● full text search indexer and daemon ● indexer - builds indexes ● searchd - services search requests ● very easy to install and configure
  • 3. Sphinx Data Sources ● Directly from MySQL (MariaDB), PostgreSQL ○ Indexing data from arbitrary SQL ○ Excellent for fast reading of expensive JOINs ● XMLPipe2 ○ General intermediate data understood by Sphinx
  • 4. Search Interface ● Native protocol (e.g., Sphinx::Search) ● Supports MySQL protocol (4.1) ○ Subset of SQL supported is called SphinxQL
  • 8. Client Example - Sphinx::Search search term - empty string returns “all”
  • 10. Some Common Use Cases ● Rebuild index from database regularly ● Incrementally add to existing index ● Query Sphinx for DB primary keys, make DB call for related rows ● Query Sphinx for wanted data (no DB at all) == my use case
  • 11. Real Life Examples 1. Indexing MariaDB 2. Filtering on string using CRC32 3. Creating sources w/Sphinx::XML::Pipe2 4. Dynamic config w/Sphinx::Config::Builder
  • 12. Indexing MariaBD ~2.25 Million Rows ● Use case - saving eBay auction data in DB ● Providing search interface to it ● Demo run of indexer
  • 13. How to Filter on Strings ● Requires CRC32 hashing (strings to ints) ● When indexing, use MySQL’s CRC32 function ● Use Perl’s String::CRC32 to encode string, ○ then set filter
  • 14. And inside of client, use Perl’s String::CRC32 to encode to the same integer
  • 15. Transforming Things to XMLPipe2 ● XMLPipe2 is Sphinx’s generic data format ● Extract/Transform scripts -> XMLPipe2 ● use Sphinx::XML::Pipe2; #’nuff said
  • 18. Example XMLPipe2 Use Case ● Monitor ephemera,e.g. active eBay listings ● Don’t want to use a database ● Many data partitions (i.e., indexes) ○ e.g., by store, by category, etc ○ > 250 (yikes!) ● Data partitions change over time (slowly)
  • 19. Dynamic Indexing of XMLPipe2 Stuff ● Fact - Sphinx partitions data by indexes ● Problem - each index uses its own data file ○ data as XMLPipe2 ● Challenge - how to manage a changing set of indexes?
  • 20. Sphinx’s --config to the Rescue! ● Config files are typically static, right? ● Sphinx can handle executables via --config ● indexer --config ./generate-config.pl --all
  • 21. Sphinx::Config::Builder ● Module I created specifically for this case ○ uploaded to CPAN ● Why? No Sphinx config builders were a fit ● Module is low level and does what I need ○ i.e., dynamically builds a XMLPipe2 specific config ● A+ 100 Passing ○ http://cpantesters.org/distro/S/Sphinx-Config-Builder.html
  • 22. Solution ● Expects XML2Pipe data files to already exist ● Iterate over array of indexes to build ● Creates “source” entries for XMLPipe2 data ● Creates “index” entries for each “source”
  • 23. Demo
  • 24. Tip of the Iceberg ● Sphinx has TONs of options and modes ● Tons of areas of application ● Many clients, Simple interface ● Super easy to install and maintain
  • 25. Thank You! ● http://sphinxsearch.com/ ● cpan://Sphinx::Search ● cpan://Sphinx::Config::Builder ● http://houston.pm.org