Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vespa, A Tour


Published on

A tour of the recently open sourced Vespa search and data engine from Oath

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Vespa, A Tour

  1. 1. Vespa, a tour
  2. 2. Me Ma# Overstreet OpenSource Connec2ons Stuff I do: * Solr/Elas1cSearch/Searchy-stuff * DataStax Cassandra * So;ware Development
  3. 3. What is it? "Big data. Real -me. The open big data serving engine: Store, search, rank and organize big data at user serving 8me."1 1 h$p://
  4. 4. What does it do? Use Vespa to build: • Search applica,ons • Personalized recommenda,on • Naviga,on pages computed on demand • Real,me data displays - tag clouds, maps, graphs
  5. 5. Configuring
  6. 6. Applica'on Packages A Vespa applica+on package is the set of configura+on files and Java plugins that together define the behavior of a Vespa system
  7. 7. Services.xml Primary config file for an applica1on package. • <search> sets up the search endpoint for Vespa queries. The default port is 8080. • <nodes> defines the nodes required per service. (See the reference for more on container cluster setup.) • <content> defines how documents are stored and searched
  8. 8. Search Defini,on Field Defini*ons: • index: Create a search index for this field • a4ribute: Store this field in memory as an a4ribute — for sor;ng, searching and grouping • summary: Let this field be part of the document summary in the result set
  9. 9. Stopwords, Synonyms and Query Rewri4ng [stopword] -> ; # (Replace them by nothing) [stopword] :- and, or, the, be; lotr -> lord of the rings; [brand] -> company:[brand]; [brand] :- sony, dell, ibm, hp; [category] +> $category:[category]; [category] :- laptop, digital camera, camera; [destination] (in, by, at, on) [place] +> $name:[destination]
  10. 10. Linguis'cs
  11. 11. Default Linguis.cs • Tokeniza*on on whitespace • Kstemmer for stemming • Changing linguis*cs means wri*ng code • Only English for stemming, wai*ng for community support to extend See h%p:// for more informa5on.
  12. 12. Custom Linguis,cs Start here: h)ps:// engine/vespa/tree/master/linguis9cs/src/ main/java/com/yahoo/language/simple
  13. 13. Ranking
  14. 14. First, Querying and a Li1le YQL http://localhost:8080/search/ ?yql=select * from sources * where userQuery() &query=trees
  15. 15. Other YQL Examples Numerics select * from sources * where 500 >= price; Grouping and aggregates select * from sources * where sddocname contains 'purchase' | all(group(customer) each(output(sum(price))));
  16. 16. Na#veRank "Out of the box" ranking for Vespa combines1 : • Field/A)ribute Match • Proximity Good for text ranking, but should be combined with other features for even be9er relevancy. 1 h$p://
  17. 17. Ranking Expressions Built with query features: nativeRank + query(deservesFreshness) * freshness(timestamp)
  18. 18. More Features Feature Descrip,on term(n).significance normalized number (between 0.0 and 1.0 describing the significance of the term term(n).connectedness normalized strength with which this term is connected to the previous term queryTermCount number of terms in this query fieldLength(name) number of terms in this field fieldMatch(name) normalized measure of degree to which query and field matched fieldMatch(name).queryCompleteness normalized raCo of query tokens matched in the field fieldMatch(name).fieldCompleteness normalized raCo of query tokens which was matched distanceToPath(name).distance euclidian distance from a path through 2d space Full list: h*p:// features.html
  19. 19. Two Phase Ranking search myapp { … rank-profile default inherits default { first-phase { expression: nativeRank + query(deservesFreshness) * freshness(timestamp) } second-phase { expression { 0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) + 0.3 * attributeMatch(keywords) } rerank-count: 200 } } }
  20. 20. Side Note: Literal boos0ng Vespa stems by default, but allows access to the literal value. field title type string { indexing: index rank: literal } You can write this ranking expression: 0.9*fieldMatch(title) + 0.1*fieldMatch(title_literal)
  21. 21. Tensors Mul$-dimensional arrays of values. { "user_id": 270, "user_item_cf": { "user_item_cf:0": -1.750116e-05, "user_item_cf:1": 9.730623e-05, "user_item_cf:2": 8.515047e-05, "user_item_cf:3": 6.9297894e-05, "user_item_cf:4": 7.343942e-05, "user_item_cf:5": -0.00017635927, "user_item_cf:6": 5.7642872e-05, "user_item_cf:7": -6.6685796e-05, "user_item_cf:8": 8.5506894e-05, "user_item_cf:9": -1.7209566e-05 } }
  22. 22. Searching With Tensors rank-profile tensor { first-phase { expression: sum(query(user_item_cf) * attribute(user_item_cf)) } }
  23. 23. Ranking with TensorFlow models search tf { document tf { field document_tensor type tensor(d0[1],d1[784]) { indexing: attribute | summary attribute: tensor(d0[1],d1[784]) } } rank-profile default inherits default { macro input_tensor() { expression: attribute(document_tensor) } first-phase { expression: sum(tensorflow("my_model/saved", "serving_default", "output")) } } }
  24. 24. Try it
  25. 25. With Docker git clone export VESPA_SAMPLE_APPS=`pwd`/sample-apps docker run --detach --name vespa --hostname vespa-container --privileged --volume $VESPA_SAMPLE_APPS:/vespa-sample-apps --publish 8080:8080 vespaengine/vespa h"p://