Search Analytics Business Value & NoSQL Backend

Search Analytics

Business Value
&
NoSQL Backend

Otis Gospodnetić – Sematext International
@otisg ◦ @sematext ◦ sematext.com

sematext.com/search-analytics

About Otis Gospodnetić
• ASF Member: Lucene, Solr, Nutch, Mahout

• Author: Lucene in Action 1 & 2

• Entrepreneur: Sematext, Simpy

2
Copyright 2011 Sematext Int'l. All rights reserved.

Sematext Metrics
● 100% organic: no GMO, no VC
● 4 years old
● < 10 people
● 7 countries
● 3 timezones
● 2 continents
● > 100 customers

3

About Sematext
Products & Services
Consulting, Development, Tech Support:

● Search (Lucene, Solr, ElasticSearch...)
● Big Data (Hadoop, HBase, Voldemort...)
● Web Crawling (Nutch, Droids)
● Machine Learning (Mahout)

4

Agenda

● What is Search Analytics and why it matters
● Example reports and their value
● What we built, why, and how

5

Communication
● twitter.com/sematext
● twitter.com/otisg
● hash tags: #stsa or #stanalytics
● http://sematext.com/search-analytics/index.html
● Raise your hand!
● otis@sematext.com

6

The Compass

Search logs are your Map
Search Analytics is your Compass

7

High Level Why

search
users

search
experience

search
providers

8

High Level Why
This search sucks!
It takes 17 tries to find anything here!
F!?@#$%^&?!?

search
users

search
experience

search
providers
Cool, the latest search tweaks
made our site really sticky!
Awesome!

9

Don't Be Like This Dude

10

Got Clue?

Performance Monitoring

Tuning Search Analytics UI

Quality Assurance

11

More Concrete Why
● Measure and monitor everything. Introspection.
● Supports (re)design, navigation choices
● Helps with content acquisition & enhancement
● Improve search experience
● Mula

12

The Moment of Truth
Question for the audience #1

What do you use for Search Analytics?

a) Home grown stuff
b) Google Analytics
c) Omniture
d) Webtrends
e) Other
f ) Nothing

13

Search Analytics Outline
● Collect: queries & clicks & interactions & ...
● Analyze: actions / xactions / conversions
● Output: reports – over time
● Output++: feedback loop remember this

● The means, not the goal
● Ongoing, not one-off

14

Search vs. Web Analytics
● User intent and information needs vs. inferring
● Hand in hand
● Ideally you can relate data from both or even
unify it

15

Example Core Reports
● Rate & Volume, Latency (mean, avg, 90%)
● Click Through Rate, Mean Reciprocal Rank
● Top Queries by count, clicks, 0 hits...
● Query Trending
● Top Seen Docs, Top Clicked Docs (msft)
● Page & Click Depth
● Facet & Sort Usage
● ...
16

More Reports in More Detail
● See Search Analytics What? Why?
How?

http://blog.sematext.com/tag/analytics/

17

Part Dos
Switching gears... Juno digs NoSQL

18

What We've Built
● Search Analytics SaaS
● Numerous reports (e.g. query volume,
rate, latency, term frequencies /
comparisons, hit buckets, search origins,
etc.)
● Trending over time
● Comparisons of time periods
● Top N reports
● Filter, slice and dice

19

Who Needs a Compass?
● We need it
● search-hadoop.com & search-lucene.com

● Our customers need it!

● You?

20

Sematext Search Analytics

21

Big Dreams
● SaaS
● Multitenant
● Large Scale – Massive Data
● Cloud

22

Storage Choices
● RDBMS: MySQL, PostgreSQL
● HDFS
● Hive
● HBase
● Cassandra

23

SaaS vs. In-House
Question for the audience #2

SaaS vs in-house Search Analytics?

a) SaaS
b) in-house

24


25


26


27


28

Data Flow
● See Search Analytics with Flume and HBase
http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/

29

Data Collection
● See Search Analytics with Flume and HBase
http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/

30

Core Tech
● JavaScript Beacons
● Metric Capture Web App aka Receiver
● Flume Agents, Collectors, Sinks
● HBase
● MapReduce Aggregations
● Search Analytics Reporting Web App

31

What is Flume
● Distributed data/log collection service
● Scalable, configurable, extensible
● Centrally manageable, open source

● Agents get data from app, Collectors save it
● Abstractions: Source → Decorator(s) → Sink

32

What is HBase
● Scalable, reliable, distributed, column-oriented DB
● On top of HDFS
● MapReducable

33

Data Flow, Detailed

34

Why Flume
● Reliable delivery
● e.g. queue msgs locally if destination unreachable
● Easy, centralized management via Web UI or
console
● Good community, good progress, now @ASF
● But: more complex, more moving parts
● On Flume: slideshare.net/cloudera/inside-flume
● Alternatives: Kafka, Scribe...

35

Why HBase
● Scalable raw & aggregate data storage
● MapReduce data input
● Fast scans for time ranges, fast key lookups
● Easy storage and compute power expansion
● Good looking roadmap, community, progress

36

Open Sourcing
● 2 open-source projects:
github.com/sematext/HBaseWD
github.com/sematext/HBaseHUT
● See sematext.com/open-source/index.html

● Patches for Flume and HBase
blog.sematext.com/tag/flume/

37

Challenges
● Data size. Solutions:
● Compression (4-5x smaller with lzo)
● Data pruning (variable levels)
● Query string distribution: very long-tail
● Lots of data to process, update, aggregate
● Young tools: Flume, HBase
● Poor IO on EC2
● Hadoop distributions

38

Output++
● AutoComplete - $MM improvement
● Better DYM Spellchecker
● Related Searches
● Recommendations
● Relevance Feedback
● ...

39

Closing the Loop

search
users

search
experience

search
providers

40

Resource
Search Analytics for Your Site
Louis Rosenfeld

http://rosenfeldmedia.com/books/searchanalytics/

41

We're Hiring
Dig Search?
Dig Analytics?
Dig Big Data?
Dig Performance?
Dig working with and in open-source?
We're hiring world-wide!
http://sematext.com/about/jobs.html

42

Contact
sematext.com
blog.sematext.com
@sematext
@otisg
otis@sematext.com

Want SA? Grab me or go to:
sematext.com/search-analytics

Hash tags: #stsa or #stanalytics
43

Search Analytics Business Value & NoSQL Backend

Recommended

Recommended

More Related Content

Similar to Search Analytics Business Value & NoSQL Backend

Similar to Search Analytics Business Value & NoSQL Backend (20)

More from Sematext Group, Inc.

More from Sematext Group, Inc. (20)

Recently uploaded

Recently uploaded (20)

Search Analytics Business Value & NoSQL Backend