H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Building an easy to use search solution (for different languages) Ivo Lukač @ J.Boye Aarhus 13: Web & Intranet Conference
1. Building an easy to
use search solution
(for different languages)
Ivo Lukač @ J.Boye Aarhus 13: Web & Intranet Conference
!
“Making search work” track
2. Speaker
• Co-owner of Netgen - web development
agency, Zagreb, Croatia
• Started as developer 11 years ago
• Now I do variety of things, but can be best
described as International Business Developer
www.netgenlabs.com
3. So I am still a developer! :)
www.netgenlabs.com
4. Use case
• Regulatory reform project: cutting of unneeded
legislative, laws and/or procedures
• Netgen is the technology implementation partner
• Project lead by Sense Consulting
• Croatia, Egypt, Vietnam, Armenia, Iraq - mostly
“exotic” countries
www.netgenlabs.com
5. We would rather work in
Denmark, but seems that
it doesn’t need such a
solution :(
www.netgenlabs.com
7. Solution
• In 2006. simple filter
• Today eZ Publish CMS powered flexible information
architecture with Solr for search
• Usually 70% common features, 30% customisation
• Aiming for 90%/10%
• If you interested in tech specifics ask me later…
www.netgenlabs.com
8. Search features
•
•
•
•
•
•
Simple (default) and advanced search (with filters)
Full text search on complex data, boosting on attribute level
Filtering with multilevel tags/taxonomies
Stopwords
Search time spelling based on indexed data
Sometimes using faceting on result set
www.netgenlabs.com
11. Characters
• At the beginning we didn’t have Unicode it was a mess!
• Unicode solved a lot of problems but not all
• Same characters can have more byte codes
which is not being normalised by default
www.netgenlabs.com
12. Indexing
• Indexing files like Word, PDF or similar proved
to be problematic due to character problems
• token delimiter configuration could be
language specific
• stemming sometimes supported, sometimes
not
www.netgenlabs.com
14. Blind work
• the biggest challenge is that developers don’t know the
language
• first level of testing is very hard
• still can’t trust Google Translate
www.netgenlabs.com
15. What vehicle would you
use to transport 10 cases
of Heineken?
www.netgenlabs.com
18. Main idea
• lets try to assess search result quality
• use editors for rating (not the public)
• use most frequently searched terms (we
can’t test all)
• rate results above the fold
www.netgenlabs.com
19. The tool
• integrated in the public site
• added thumbs up/down buttons for first X
results and only shown to editors
www.netgenlabs.com
20. Demo
• imported articles to test instance form various
sources about CMS topic
• rating result quality of 7 search terms
• Thumbs up/down for suggested 3 search results
• Test periods are used for framing test data
www.netgenlabs.com
23. Rate measures
• Discounted Cumulative Gain (DCG) - rate sum
discounted based on position in search results
• Normalised Discounted Cumulative Gain (NDCG) -
discounted rate sum normalised against best possible
outcome (to get percentage as the unit)
• Popularity based NDCG - takes into account the
popularity of the search form
http://en.wikipedia.org/wiki/Discounted_cumulative_gain
www.netgenlabs.com
24. Known problems
• What if good results are not showing? - something bad
is going on with the search engine
• what if there is no good result?
• what about new content added in time?
• at the end of the day measurements are good for
comparing between test periods, not meaningful by
itself
www.netgenlabs.com
25. Improvements
• opening rating to public users
• using clicks as rates
• implement “did you find what you have looking for?”
feature
• integrate with analytics
• use rate data to boost particular item in search!
www.netgenlabs.com
26. Questions now or later
ivo@netgen.hr
ilukac.com/twitter
ilukac.com/facebook
ilukac.com/gplus
ilukac.com/linkedin