# Evolving Search Relevancy: Presented by James Strassburg, Direct Supply

Presented at Lucene/Solr Revolution 2014

Published in: Software
1. 1. Evolving Search Relevancy James Strassburg Senior Architect -­‐ Direct Supply @jstrassburg
2. 2. Agenda • An Optimization Problem • Genetic Algorithm Overview • Modeling Solr Parameters • Fitness Function
3. 3. sir can you help me… ???? "iam from indonesia want to build search engine like a Google and i want to build the system using Genetic Algorithm but iam confused what will i do first. Thanks before."
4. 4. Search Algorithm Parameters /select?q=foo&defType=dismax &qf=name^20+desc^10 &pf=name^10&ps=3&mm=2 &bf=”ord(popularity)^0.05” and many more
5. 5. Where did those numbers come from? I made them up… shhhhhhh. Then we tweaked them after testing.
6. 6. An Optimization Problem So, how do we know we have the best set of numbers? Or even a good set? We have an optimization problem.
7. 7. Sample Schema <field name="name" type="text_en" indexed="true" stored="true" required="true" multiValued="false" omitNorms="true"/> <field name="description" type="text_en" indexed="true" stored="true" multiValued="false" omitNorms="true"/>
8. 8. Sample Data Set [{ "name":"Red Lobster", "description":"We deliver the freshest caught seafood every day." },{ "name":"Joe's Crab Shack", "description":"We serve delicious red crabs, rock crabs, large lobsters, and other delicious seafood. Our lobsters are our specialty."}] http://localhost:8983/solr/restaurantsCollection/select?q=red+lobster&defType=dismax&qf=name +description&indent=true&fl=name+description
9. 9. Genetic Algorithms • A tool for solving optimization problems • Based on ideas from genetics, evolution, and natural selection • DEAP – Distributed Evolutionary Algorithms in Python
10. 10. Genetic Algorithms • Define candidate solution encoding • Define a fitness function • Generate random solutions • Select candidates for reproduction • Use crossover and mutation to create a new generation • Repeat until some criteria is met
11. 11. Crossover and Mutation Parent 1: [1,0,1,1,1,0,1,1] Parent 2: [0,0,0,0,1,1,1,1] Child: [1,0,0,1,1,0,1,0]
12. 12. Encoding Parameters >>> sys.float_info sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
13. 13. Encoding Parameters >>> import numpy >>> single = numpy.float32(3.4) >>> single 3.4000001 >>> half_single = numpy.float16(3.4) >>> half_single 3.4004
14. 14. Encoding Parameters /select?q=foo&qf=field^35.2 versus /select?q=foo&qf=field^35.3
15. 15. Decimal / Fibonacci Encoding • 0, 0.2, 0.4, 0.8, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 • 16 values encode into 4-bits • Supports fast evolution • Avoids relative maxima
16. 16. Decimal / Fibonacci Encoding 0.0 => [0, 0, 0, 0] 0.2 => [0, 0, 0, 1] 0.4 => [0, 0, 1, 0] … 1 => [0, 1, 0, 1] 2 => [0, 1, 1, 0] … 144 => [1, 1, 1, 1]
17. 17. Candidate Solution Encoding /select?q=foo&qf=name^0.4+desc^13 0.4 => [0, 0, 1, 0] 13 => [1, 0, 1, 0] Candidate Solution: [0, 0, 1, 0, 1, 0, 1, 0]
18. 18. Fitness Function • Measure how well a candidate solution solves the problem • Should be very fast
19. 19. Normalized Discounted Cumulative Gain • Very relevant > relevant > not relevant • Relevant results are more useful if they appear earlier • Results should be irrelevant of the query
20. 20. Precision and Recall Precision – Likelihood that a returned result was correct Recall – Likelihood that a relevant result was returned
21. 21. F-measure • Harmonic mean of precision and recall • Punishes outliers
22. 22. Analytics in Schema <field name="searchTermInteractions" type="lowercase" indexed="true" stored="true" multiValued="true"/>
23. 23. Demo
24. 24. Resources • DEAP - https://code.google.com/p/deap/ • My github repo for this example - https://github.com/jstrassburg/evolving-search-relevancy • @jstrassburg