SlideShare a Scribd company logo
1 of 59
Download to read offline
BúsquedaS Full Text
con esteroides
Diego Sapriza
Senior Soft. Engineer
PHPer
~ DevOps
uruguaSHo
Uruguay

el país de los
repechajes
por cada habitante
@AV4TAr
.	
  
.uy	
  

PHP.meetup.uy	
  
DevOps.meetup.uy	
  

http://AV4TAr.com
Big
Data
Proyecto
•  Buscador

•  Millones de registros

• Relevancia

•  Velocidad
•  Escalamiento

•  Búsqueda Facetada
•  Tags
•  Geo-búsquedas

•  Simplicidad.-
Los datos NO sirven…

si no puedo encontrar nada
relevante
y rápido
RÁPIDO
obtener los resultados que

necesitamos
en vez de los resultados que

solo coinciden
con nuestra consulta.
2.2.1 beta!
¿Qué es?

motor de búsqueda Full Text
indexa Bases de Datos (y xmls)
diseñado para escalar fácilmente
¿Porqué usarlo?

velocidad de indexación y búsqueda
mejor relevancia
escalabilidad
búsquedas Facetadas
geo-búsqueda
morfología
HTML Stripping
…
VUELA!!!
Idea básica

configurar índice
indexar
consultar el índice
repetir
indexer

searchd
Componentes

base de datos
orígen de datos

aplicación
cliente
¿De dónde saco los datos?

SQL
mysql, pgsql, mssql, odbc,…

base de datos
orígen de datos

XMLpipes
indexer
¿Cómo y dónde indexo los datos?

stopwords, wordforms, …
ejecución períodica
hola “cliente”
procesa consultas
utilizando índices

searchd
Sphinx API
php, python, ruby, java, c#, nodejs, haskell…

SphinxQL
mysql

SphinxSE
storage engine

aplicación
cliente
Sphinx API
php, python, ruby, java, c#, nodejs, haskell…

<?php!
require('/path/to/sphinxapi.php');!
$cl = new SphinxClient();!
$cl->SetServer('10.1.1.4', 3312);!
$cl->SetFilter('author_id', array (123));!
$cl->SetSortMode(SPH_SORT_ATTR_DESC, 'post_date');!
$cl->Query('test', 'main delta');!
SphinxQL
mirá mamá sin Base de Datos!!!

mysql_connect() a sphinx!
Sphinx SE
storage engine

SELECT * !
FROM sphinx_table s!
JOIN products p ON p.id = s.id!
WHERE s.query = ‘@title iPad’!
ORDER BY p.price ASC!
indexer

searchd
Flujo
datos

base de datos
orígen de datos

aplicación
cliente
indexer

searchd

?

interacción
base de datos
orígen de datos

aplicación
cliente
source users_index!
{!
!type = mysql!
!sql_user = sphinx!
!sql_pass = sph.09$!
!sql_db = wby_beta!
!sql_host = 127.0.0.1!
!
!sql_query = SELECT u.id, u.id as users_id, CONCAT( u.name, ' ',
u.lastname ) AS name, u.profession, IF(u.gender='m',1,IF(u.gender='f',2,3)) as
numeric_gender, u.city, u.state, u.country, c.email FROM users u, credentials c
WHERE c.userHash = u.credentials_userHash AND u.temporal = 'n'!
!
!sql_attr_uint = users_id!
!sql_attr_uint = numeric_gender!
}!
!
index users_index!
{!
!source = users_index!
!path = /wby/sphinx/data/usersindex!
!docinfo = extern!
!min_word_len = 2!
!charset_type
= sbcs!
!min_infix_len
= 3!
!enable_star
= 0!
}!
!

data source
índice

indexer!
{!
!mem_limit!= 4096MB!
!max_iops != 0!
!write_buffer
!= 12M!
!max_iosize
!= 1048576!
!
}!
!
searchd!
{!
!#listen = 127.0.0.1:3312!
!listen = 0.0.0.0:3312!
!log
!
!
!= /wby/sphinx/searchd.log!
!query_log
= /wby/sphinx/query.log!
!read_timeout
= 5!
!client_timeout
= 300!
!max_children
= 30!
!pid_file
= /wby/sphinx/searchd.pid!
!max_matches
= 1000
!!
}!
!

indexer
searchd
data sources
source users_src!
{!
!type
= mysql!
!sql_user
!sql_pass
!sql_db
!sql_host

=
=
=
=

DBUSER!
******!
DB1!
127.0.0.1!

pgsql	
  
odbc	
  

mysql	
  

!

!sql_query = !

id,

nombre, edad, ciudad, !
fecha_edit FROM users!

SELECT
!

= edad!
!sql_attr_timestamp = fecha_edit!
!sql_attr_uint

}!
Sphinx devuelve
“solo”
ids y atributos
índice
disk-based
index users_index!
{!
!source = users_src!
!path
= /data/usrs_index!
!min_word_len = 2!
!charset_type = utf-8!
}!

mysql	
  
índice
disk-based
index users_index!
{!
source = users_src!
source = users_src1!
source = users_src2!
!!

!...!

multiples orígenes

mysql	
  
odbc	
  
pgsql	
  
índice
Distribuído

index users_index_dist!
{!
type = distributed!
local = archive!
agent = srv1.net:9312:src2!
agent = srv2.net:9312:src3!
}!

mysql	
  

mysql	
  

agent	
  
agent	
  
xml	
  
índice
Real Time
index rt_users_index!
{!
!type = rt!
!path = /sph/data/rt_usersindex!
!rt_field
= name!
!rt_field
= city!
!rt_attr_uint
= id!
!rt_attr_timestamp = date_added!
!rt_mem_limit
= 256MB!
}!
# ./indexer users_index!
!
indexar main
# ./indexer user_timelines --rotate
Sphinx 2.0.3-release (r3043)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/sphinx/etc/sphinx.conf'...

indexing index 'user_timelines'...

collected 1.303.297 docs,

4631.5 MB

sorted 769.8 Mhits, 100.0% done
total 1.303.297 docs, 4631519329 bytes

total

1463.481 sec,

3164727 bytes/sec, 890.54 docs/sec
total 1665 reads, 62.531 sec, 1639.9 kb/call avg, 37.5 msec/call avg
total 5302 writes, 12.536 sec, 1022.3 kb/call avg, 2.3 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=22994).

~24 minutos, 4.5GB.
matching modes
• 
• 
• 
• 
• 
• 

SPH_MATCH_ALL*
SPH_MATCH_ANY
SPH_MATCH_PHRASE
SPH_MATCH_BOOLEAN
SPH_MATCH_EXTENDED
SPH_MATCH_FULLSCAN
extended sintaxis
•  y / o:
hola | mundo, hola & mundo!

•  No:
hola –mundo!

•  Búsqueda por campo:
@title hola @body mundo!
extended sintaxis
•  x Frase:
“Hola mundo”!

•  x Proximidad:
“Hola mundo”~10!

•  Distancia:
hola NEAR/10 mundo!
mucho más
• 
• 
• 
• 
• 
• 

aaa << bbb << ccc!
^hello world$!
”Chile" PARAGRAPH ”Mundial”!
@* hello!
@!(title,body) hello world!
@body[50] hello!
"hello world" @title "example
program"~5 @body python -(php|perl)
@* code!
cta1sfter:/srv/sphinx/bin#	
  mysql	
  -­‐P9306	
  -­‐-­‐protocol=tcp	
  -­‐-­‐prompt='sphinxQL>	
  ’	
  
	
  
Welcome	
  to	
  the	
  MySQL	
  monitor.	
  	
  Commands	
  end	
  with	
  ;	
  or	
  g.	
  
Your	
  MySQL	
  connection	
  id	
  is	
  1	
  
Server	
  version:	
  2.0.3-­‐release	
  (r3043)	
  
	
  
Type	
  'help;'	
  or	
  'h'	
  for	
  help.	
  Type	
  'c'	
  to	
  clear	
  the	
  buffer.	
  

	
  

sphinxQL>	
  SELECT	
  *	
  from	
  user_timelines	
  WHERE	
  MATCH	
  ('superbowl');	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  id	
  	
  	
  	
  	
  	
  	
  	
  |	
  weight	
  |	
  twitter_id	
  |	
  tweets_id	
  |	
  link_id	
  	
  |	
  	
  created	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  109531197	
  |	
  	
  	
  4675	
  |	
  	
  	
  24488771	
  |	
  	
  57371370	
  |	
  35471785	
  |	
  1359858567	
  |	
  	
  
|	
  109492540	
  |	
  	
  	
  4673	
  |	
  	
  	
  56690354	
  |	
  	
  57351558	
  |	
  35459063	
  |	
  1359843568	
  |	
  	
  
|	
  109493484	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57351953	
  |	
  35459063	
  |	
  1359843239	
  |	
  	
  
|	
  109496715	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57353282	
  |	
  35459063	
  |	
  1359843352	
  |	
  	
  
|	
  109496743	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57353292	
  |	
  35459063	
  |	
  1359843241	
  |	
  	
  
|	
  109496779	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57353305	
  |	
  35459063	
  |	
  1359842932	
  |	
  	
  
...	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
20	
  rows	
  in	
  set	
  (0.04	
  sec)	
  
¿Cómo mantengo los índices
actualizados?
Sobre todos los grandes!!!
the DELTA, you
must use.
merge
Cuidado con el espacio en disco!!!
Geodistancia
mysql> SELECT *,
CONTAINS(GEOPOLY2D(40.95164274496,-76.88583678218
,41.188446201688,-73.203723511772,!
39.900666261352,-74.171833538046,40.059260979044,
-76.301076056469),latitude_deg,longitude_deg) AS
inside FROM geodemo WHERE inside=1 LIMIT 0,100 ;!
TIP: shpinx.conf.php
#!/usr/bin/php
<?php for ($i=1; $i<=4; $i++): ?>
source chunk<?= $i ?>
{
sql_host = localhost
sql_user = sphinx_usr
sql_pass = ****
sql_db
= dbchunk<?=$i?>
. . .
}
<?php endfor; // end source loop ?>
facts
• 
• 
• 
• 
• 
• 

standalone
múltiples BDS
no actualiza los índices solo
sphinx solo devuelve ids
Gran consumo de disco
Fácil de integrar

•  órden por relevancia
•  exact search / boolean
search...
•  API en varios lenguajes
•  implementa protocolo
MySQL
•  Fácil de escalar
Preguntas?
@AV4TAr
http://AV4TAr.com
Gracias, nos vemos en...
cta1sfter:/srv/sphinx/bin#	
  mysql	
  -­‐P9306	
  -­‐-­‐protocol=tcp	
  -­‐-­‐prompt='sphinxQL>	
  ’	
  
	
  
Welcome	
  to	
  the	
  MySQL	
  monitor.	
  	
  Commands	
  end	
  with	
  ;	
  or	
  g.	
  
Your	
  MySQL	
  connection	
  id	
  is	
  1	
  
Server	
  version:	
  2.0.3-­‐release	
  (r3043)	
  
	
  
Type	
  'help;'	
  or	
  'h'	
  for	
  help.	
  Type	
  'c'	
  to	
  clear	
  the	
  buffer.	
  
	
  
sphinxQL>	
  SELECT	
  *	
  from	
  user_timelines	
  WHERE	
  MATCH	
  ('superbowl');	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  id	
  	
  	
  	
  	
  	
  	
  	
  |	
  weight	
  |	
  twitter_id	
  |	
  tweets_id	
  |	
  link_id	
  	
  |	
  tld_id	
  |	
  extracted	
  |	
  created_stamp	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  109531197	
  |	
  	
  	
  4675	
  |	
  	
  	
  24488771	
  |	
  	
  57371370	
  |	
  35471785	
  |	
  132427	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  |	
  	
  	
  	
  1359858567	
  |	
  	
  
|	
  109492540	
  |	
  	
  	
  4673	
  |	
  	
  	
  56690354	
  |	
  	
  57351558	
  |	
  35459063	
  |	
  	
  	
  	
  685	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  |	
  	
  	
  	
  1359843568	
  |	
  	
  
|	
  109493484	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57351953	
  |	
  35459063	
  |	
  	
  	
  	
  685	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  |	
  	
  	
  	
  1359843239	
  |	
  	
  
|	
  109496715	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57353282	
  |	
  35459063	
  |	
  	
  	
  	
  685	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  |	
  	
  	
  	
  1359843352	
  |	
  	
  
|	
  109496743	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57353292	
  |	
  35459063	
  |	
  	
  	
  	
  685	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  |	
  	
  	
  	
  1359843241	
  |	
  	
  
|	
  109496779	
  |	
  	
  	
  4673	
  |	
  	
  	
  24488771	
  |	
  	
  57353305	
  |	
  35459063	
  |	
  	
  	
  	
  685	
  |	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  |	
  	
  	
  	
  1359842932	
  |	
  	
  
...	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
20	
  rows	
  in	
  set	
  (0.04	
  sec)	
  
	
  
sphinxQL>	
  show	
  meta;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Variable_name	
  |	
  Value	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  total	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  1000	
  	
  	
  	
  	
  	
  |	
  	
  
|	
  total_found	
  	
  	
  |	
  6302	
  	
  	
  	
  	
  	
  |	
  	
  
|	
  time	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  0.034	
  	
  	
  	
  	
  |	
  	
  
|	
  keyword[0]	
  	
  	
  	
  |	
  superbowl	
  |	
  	
  
|	
  docs[0]	
  	
  	
  	
  	
  	
  	
  |	
  6302	
  	
  	
  	
  	
  	
  |	
  	
  
|	
  hits[0]	
  	
  	
  	
  	
  	
  	
  |	
  12189	
  	
  	
  	
  	
  |	
  	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
6	
  rows	
  in	
  set	
  (0.00	
  sec)	
  
source	
  user_timelines	
  :	
  base	
  
{	
  
	
  sql_query_pre	
  =	
  SELECT	
  @tt_id:=id	
  FROM	
  `tweets_timelines`	
  WHERE	
  `created`	
  <=	
  
DATE_SUB(CURDATE(),INTERVAL	
  8	
  DAY)	
  ORDER	
  BY	
  created	
  DESC	
  LIMIT	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  
	
  sql_query_pre	
  =	
  REPLACE	
  INTO	
  sph_counter	
  SET	
  counter_id	
  =	
  "user_timelines",	
  modif=NOW(),	
  	
  
max_doc_id	
  =	
  (	
  SELECT	
  MAX(id)	
  max	
  FROM	
  tweets_timelines),	
  last_doc_id	
  =	
  max_doc_id	
  	
  
	
  
	
  sql_query	
  =	
  SELECT	
  tt.id,	
  tt.twitter_id,	
  tt.tweets_id,	
  lm.id	
  AS	
  link_id,	
  lm.expanded_link,	
  
lm.title,	
  lm.description,	
  lm.body,	
  lm.tld_id,	
  lm.extracted,	
  UNIX_TIMESTAMP(tt.created)	
  AS	
  
created_stamp	
  FROM	
  links_metadata	
  lm,	
  tweets_timelines	
  tt	
  WHERE	
  tt.id	
  >=	
  @tt_id	
  AND	
  lm.extracted	
  =	
  1	
  
AND	
  tt.links_id	
  =	
  lm.id	
  AND	
  tt.id	
  <=	
  (SELECT	
  max_doc_id	
  FROM	
  sph_counter	
  WHERE	
  
counter_id="user_timelines")	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  sql_attr_uint	
  =	
  twitter_id	
  
	
  	
  	
  	
  	
  	
  	
  	
  sql_attr_uint	
  =	
  tweets_id	
  
	
  	
  	
  	
  	
  	
  	
  	
  sql_attr_uint	
  =	
  link_id	
  
	
  	
  	
  	
  	
  	
  	
  	
  sql_attr_uint	
  =	
  tld_id	
  
	
  	
  	
  	
  	
  	
  	
  	
  sql_attr_timestamp	
  =	
  created_stamp	
  
	
  	
  	
  	
  	
  	
  	
  	
  sql_attr_uint	
  =	
  extracted	
  
}	
  
	
  
index	
  user_timelines	
  
{	
  
	
  	
  	
  	
  	
  	
  	
  	
  source	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  user_timelines	
  
	
  	
  	
  	
  	
  	
  	
  	
  html_strip	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  html_remove_elements	
  	
  	
  	
  =	
  a,	
  img	
  
	
  	
  	
  	
  	
  	
  	
  	
  path	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  /sphinx/data/user_timelines_index	
  
	
  	
  	
  	
  	
  	
  	
  	
  docinfo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  extern	
  
	
  	
  	
  	
  	
  	
  	
  	
  charset_type	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  utf-­‐8	
  
}	
  
source	
  delta_user_timelines	
  :	
  user_timelines{	
  
	
  
	
  sql_query_pre	
  =	
  SET	
  NAMES	
  utf8	
  
	
  	
  
	
  sql_query_pre	
  =	
  SELECT	
  @tt_id:=id	
  FROM	
  `tweets_timelines`	
  WHERE	
  `created`	
  <=	
  	
  
	
  
	
  
	
  	
  	
  	
  	
  	
  DATE_SUB(CURDATE(),INTERVAL	
  8	
  DAY)	
  ORDER	
  BY	
  created	
  DESC	
  LIMIT	
  1	
  
	
  
	
  sql_query_pre	
  =	
  SELECT	
  @max:=max(tt.id)	
  FROM	
  links_metadata	
  lm,	
  tweets_timelines	
  tt	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  WHERE	
  lm.extracted	
  =	
  1	
  AND	
  tt.links_id	
  =	
  lm.id	
  	
  
	
  
	
  	
  	
  	
  	
  sql_query	
  =	
  SELECT	
  tt.id,	
  tt.twitter_id,	
  tt.tweets_id,	
  lm.id	
  AS	
  link_id,	
  lm.expanded_link, 	
  
	
  
	
  	
  	
  	
  	
   	
  
	
  	
  	
  lm.title,	
  lm.description,	
  lm.body,	
  lm.tld_id,	
  lm.extracted,	
  	
  	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  	
  	
  UNIX_TIMESTAMP(tt.created)	
  AS	
  created_stamp	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  links_metadata	
  lm,	
  tweets_timelines	
  tt	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  WHERE	
  tt.id	
  >=	
  @tt_id	
  	
  AND	
  lm.extracted	
  =	
  1	
  AND	
  tt.links_id	
  =	
  lm.id	
  AND	
  	
  
	
  
	
  
	
  	
  tt.id>(	
  SELECT	
  max_doc_id	
  FROM	
  sph_counter	
  WHERE	
  counter_id="user_timelines"	
  )	
  	
  
	
  
	
  sql_query_post	
  =	
  UPDATE	
  sph_counter	
  SET	
  last_doc_id=@max	
  WHERE	
  counter_id="user_timelines"	
  
}	
  
	
  
index	
  delta_user_timelines	
  :	
  user_timelines{	
  
	
  	
  	
  	
  	
  	
  	
  	
  source	
  =	
  delta_user_timelines	
  
	
  	
  	
  	
  	
  	
  	
  	
  html_strip	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  html_remove_elements	
  	
  	
  	
  =	
  a,	
  img	
  
	
  	
  	
  	
  	
  	
  	
  	
  path	
  =	
  /sphinx/data/delta_user_timelines_index	
  
	
  	
  	
  	
  	
  	
  	
  	
  docinfo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  extern	
  
	
  	
  	
  	
  	
  	
  	
  	
  charset_type	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  utf-­‐8	
  
}	
  
Links
• 
• 
• 
• 
• 
• 

http://sphinxsearch.com/docs/current.html
http://AV4TAr.com
http://bit.ly/sphinx-autosuggest
http://bit.ly/sphinx-query-builder
http://bit.ly/sphinx-zfconf-011
http://bit.ly/sphinx-high-performance
Búsquedas Full Text con esteroides - Sphinx Search

More Related Content

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Búsquedas Full Text con esteroides - Sphinx Search

  • 2. Diego Sapriza Senior Soft. Engineer PHPer ~ DevOps
  • 4. Uruguay el país de los repechajes
  • 6.
  • 7.
  • 8. @AV4TAr .   .uy   PHP.meetup.uy   DevOps.meetup.uy   http://AV4TAr.com
  • 10. Proyecto •  Buscador •  Millones de registros • Relevancia •  Velocidad •  Escalamiento •  Búsqueda Facetada •  Tags •  Geo-búsquedas •  Simplicidad.-
  • 11. Los datos NO sirven… si no puedo encontrar nada relevante y rápido
  • 13. obtener los resultados que necesitamos en vez de los resultados que solo coinciden con nuestra consulta.
  • 15. ¿Qué es? motor de búsqueda Full Text indexa Bases de Datos (y xmls) diseñado para escalar fácilmente
  • 16. ¿Porqué usarlo? velocidad de indexación y búsqueda mejor relevancia escalabilidad búsquedas Facetadas geo-búsqueda morfología HTML Stripping …
  • 20. ¿De dónde saco los datos? SQL mysql, pgsql, mssql, odbc,… base de datos orígen de datos XMLpipes
  • 21. indexer ¿Cómo y dónde indexo los datos? stopwords, wordforms, … ejecución períodica
  • 23. Sphinx API php, python, ruby, java, c#, nodejs, haskell… SphinxQL mysql SphinxSE storage engine aplicación cliente
  • 24. Sphinx API php, python, ruby, java, c#, nodejs, haskell… <?php! require('/path/to/sphinxapi.php');! $cl = new SphinxClient();! $cl->SetServer('10.1.1.4', 3312);! $cl->SetFilter('author_id', array (123));! $cl->SetSortMode(SPH_SORT_ATTR_DESC, 'post_date');! $cl->Query('test', 'main delta');!
  • 25. SphinxQL mirá mamá sin Base de Datos!!! mysql_connect() a sphinx!
  • 26. Sphinx SE storage engine SELECT * ! FROM sphinx_table s! JOIN products p ON p.id = s.id! WHERE s.query = ‘@title iPad’! ORDER BY p.price ASC!
  • 29. source users_index! {! !type = mysql! !sql_user = sphinx! !sql_pass = sph.09$! !sql_db = wby_beta! !sql_host = 127.0.0.1! ! !sql_query = SELECT u.id, u.id as users_id, CONCAT( u.name, ' ', u.lastname ) AS name, u.profession, IF(u.gender='m',1,IF(u.gender='f',2,3)) as numeric_gender, u.city, u.state, u.country, c.email FROM users u, credentials c WHERE c.userHash = u.credentials_userHash AND u.temporal = 'n'! ! !sql_attr_uint = users_id! !sql_attr_uint = numeric_gender! }! ! index users_index! {! !source = users_index! !path = /wby/sphinx/data/usersindex! !docinfo = extern! !min_word_len = 2! !charset_type = sbcs! !min_infix_len = 3! !enable_star = 0! }! ! data source índice indexer! {! !mem_limit!= 4096MB! !max_iops != 0! !write_buffer != 12M! !max_iosize != 1048576! ! }! ! searchd! {! !#listen = 127.0.0.1:3312! !listen = 0.0.0.0:3312! !log ! ! != /wby/sphinx/searchd.log! !query_log = /wby/sphinx/query.log! !read_timeout = 5! !client_timeout = 300! !max_children = 30! !pid_file = /wby/sphinx/searchd.pid! !max_matches = 1000 !! }! ! indexer searchd
  • 30. data sources source users_src! {! !type = mysql! !sql_user !sql_pass !sql_db !sql_host = = = = DBUSER! ******! DB1! 127.0.0.1! pgsql   odbc   mysql   ! !sql_query = ! id, nombre, edad, ciudad, ! fecha_edit FROM users! SELECT ! = edad! !sql_attr_timestamp = fecha_edit! !sql_attr_uint }!
  • 32. índice disk-based index users_index! {! !source = users_src! !path = /data/usrs_index! !min_word_len = 2! !charset_type = utf-8! }! mysql  
  • 33. índice disk-based index users_index! {! source = users_src! source = users_src1! source = users_src2! !! !...! multiples orígenes mysql   odbc   pgsql  
  • 34. índice Distribuído index users_index_dist! {! type = distributed! local = archive! agent = srv1.net:9312:src2! agent = srv2.net:9312:src3! }! mysql   mysql   agent   agent   xml  
  • 35. índice Real Time index rt_users_index! {! !type = rt! !path = /sph/data/rt_usersindex! !rt_field = name! !rt_field = city! !rt_attr_uint = id! !rt_attr_timestamp = date_added! !rt_mem_limit = 256MB! }!
  • 37. indexar main # ./indexer user_timelines --rotate Sphinx 2.0.3-release (r3043) Copyright (c) 2001-2011, Andrew Aksyonoff Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/sphinx/etc/sphinx.conf'... indexing index 'user_timelines'... collected 1.303.297 docs, 4631.5 MB sorted 769.8 Mhits, 100.0% done total 1.303.297 docs, 4631519329 bytes total 1463.481 sec, 3164727 bytes/sec, 890.54 docs/sec total 1665 reads, 62.531 sec, 1639.9 kb/call avg, 37.5 msec/call avg total 5302 writes, 12.536 sec, 1022.3 kb/call avg, 2.3 msec/call avg rotating indices: succesfully sent SIGHUP to searchd (pid=22994). ~24 minutos, 4.5GB.
  • 39. extended sintaxis •  y / o: hola | mundo, hola & mundo! •  No: hola –mundo! •  Búsqueda por campo: @title hola @body mundo!
  • 40. extended sintaxis •  x Frase: “Hola mundo”! •  x Proximidad: “Hola mundo”~10! •  Distancia: hola NEAR/10 mundo!
  • 41. mucho más •  •  •  •  •  •  aaa << bbb << ccc! ^hello world$! ”Chile" PARAGRAPH ”Mundial”! @* hello! @!(title,body) hello world! @body[50] hello!
  • 42. "hello world" @title "example program"~5 @body python -(php|perl) @* code!
  • 43. cta1sfter:/srv/sphinx/bin#  mysql  -­‐P9306  -­‐-­‐protocol=tcp  -­‐-­‐prompt='sphinxQL>  ’     Welcome  to  the  MySQL  monitor.    Commands  end  with  ;  or  g.   Your  MySQL  connection  id  is  1   Server  version:  2.0.3-­‐release  (r3043)     Type  'help;'  or  'h'  for  help.  Type  'c'  to  clear  the  buffer.     sphinxQL>  SELECT  *  from  user_timelines  WHERE  MATCH  ('superbowl');   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  id                |  weight  |  twitter_id  |  tweets_id  |  link_id    |    created      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  109531197  |      4675  |      24488771  |    57371370  |  35471785  |  1359858567  |     |  109492540  |      4673  |      56690354  |    57351558  |  35459063  |  1359843568  |     |  109493484  |      4673  |      24488771  |    57351953  |  35459063  |  1359843239  |     |  109496715  |      4673  |      24488771  |    57353282  |  35459063  |  1359843352  |     |  109496743  |      4673  |      24488771  |    57353292  |  35459063  |  1359843241  |     |  109496779  |      4673  |      24488771  |    57353305  |  35459063  |  1359842932  |     ...   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   20  rows  in  set  (0.04  sec)  
  • 44. ¿Cómo mantengo los índices actualizados? Sobre todos los grandes!!!
  • 46.
  • 47. merge
  • 48. Cuidado con el espacio en disco!!!
  • 51. TIP: shpinx.conf.php #!/usr/bin/php <?php for ($i=1; $i<=4; $i++): ?> source chunk<?= $i ?> { sql_host = localhost sql_user = sphinx_usr sql_pass = **** sql_db = dbchunk<?=$i?> . . . } <?php endfor; // end source loop ?>
  • 52. facts •  •  •  •  •  •  standalone múltiples BDS no actualiza los índices solo sphinx solo devuelve ids Gran consumo de disco Fácil de integrar •  órden por relevancia •  exact search / boolean search... •  API en varios lenguajes •  implementa protocolo MySQL •  Fácil de escalar
  • 55. cta1sfter:/srv/sphinx/bin#  mysql  -­‐P9306  -­‐-­‐protocol=tcp  -­‐-­‐prompt='sphinxQL>  ’     Welcome  to  the  MySQL  monitor.    Commands  end  with  ;  or  g.   Your  MySQL  connection  id  is  1   Server  version:  2.0.3-­‐release  (r3043)     Type  'help;'  or  'h'  for  help.  Type  'c'  to  clear  the  buffer.     sphinxQL>  SELECT  *  from  user_timelines  WHERE  MATCH  ('superbowl');   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  id                |  weight  |  twitter_id  |  tweets_id  |  link_id    |  tld_id  |  extracted  |  created_stamp  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  109531197  |      4675  |      24488771  |    57371370  |  35471785  |  132427  |                  1  |        1359858567  |     |  109492540  |      4673  |      56690354  |    57351558  |  35459063  |        685  |                  1  |        1359843568  |     |  109493484  |      4673  |      24488771  |    57351953  |  35459063  |        685  |                  1  |        1359843239  |     |  109496715  |      4673  |      24488771  |    57353282  |  35459063  |        685  |                  1  |        1359843352  |     |  109496743  |      4673  |      24488771  |    57353292  |  35459063  |        685  |                  1  |        1359843241  |     |  109496779  |      4673  |      24488771  |    57353305  |  35459063  |        685  |                  1  |        1359842932  |     ...   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   20  rows  in  set  (0.04  sec)     sphinxQL>  show  meta;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Variable_name  |  Value          |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  total                  |  1000            |     |  total_found      |  6302            |     |  time                    |  0.034          |     |  keyword[0]        |  superbowl  |     |  docs[0]              |  6302            |     |  hits[0]              |  12189          |     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   6  rows  in  set  (0.00  sec)  
  • 56. source  user_timelines  :  base   {    sql_query_pre  =  SELECT  @tt_id:=id  FROM  `tweets_timelines`  WHERE  `created`  <=   DATE_SUB(CURDATE(),INTERVAL  8  DAY)  ORDER  BY  created  DESC  LIMIT  1                    sql_query_pre  =  REPLACE  INTO  sph_counter  SET  counter_id  =  "user_timelines",  modif=NOW(),     max_doc_id  =  (  SELECT  MAX(id)  max  FROM  tweets_timelines),  last_doc_id  =  max_doc_id        sql_query  =  SELECT  tt.id,  tt.twitter_id,  tt.tweets_id,  lm.id  AS  link_id,  lm.expanded_link,   lm.title,  lm.description,  lm.body,  lm.tld_id,  lm.extracted,  UNIX_TIMESTAMP(tt.created)  AS   created_stamp  FROM  links_metadata  lm,  tweets_timelines  tt  WHERE  tt.id  >=  @tt_id  AND  lm.extracted  =  1   AND  tt.links_id  =  lm.id  AND  tt.id  <=  (SELECT  max_doc_id  FROM  sph_counter  WHERE   counter_id="user_timelines")                    sql_attr_uint  =  twitter_id                  sql_attr_uint  =  tweets_id                  sql_attr_uint  =  link_id                  sql_attr_uint  =  tld_id                  sql_attr_timestamp  =  created_stamp                  sql_attr_uint  =  extracted   }     index  user_timelines   {                  source                                    =  user_timelines                  html_strip                            =  1                  html_remove_elements        =  a,  img                  path                                        =  /sphinx/data/user_timelines_index                  docinfo                                  =  extern                  charset_type                        =  utf-­‐8   }  
  • 57. source  delta_user_timelines  :  user_timelines{      sql_query_pre  =  SET  NAMES  utf8        sql_query_pre  =  SELECT  @tt_id:=id  FROM  `tweets_timelines`  WHERE  `created`  <=                    DATE_SUB(CURDATE(),INTERVAL  8  DAY)  ORDER  BY  created  DESC  LIMIT  1      sql_query_pre  =  SELECT  @max:=max(tt.id)  FROM  links_metadata  lm,  tweets_timelines  tt                                                WHERE  lm.extracted  =  1  AND  tt.links_id  =  lm.id                sql_query  =  SELECT  tt.id,  tt.twitter_id,  tt.tweets_id,  lm.id  AS  link_id,  lm.expanded_link,                      lm.title,  lm.description,  lm.body,  lm.tld_id,  lm.extracted,                                  UNIX_TIMESTAMP(tt.created)  AS  created_stamp                                      FROM  links_metadata  lm,  tweets_timelines  tt                                      WHERE  tt.id  >=  @tt_id    AND  lm.extracted  =  1  AND  tt.links_id  =  lm.id  AND            tt.id>(  SELECT  max_doc_id  FROM  sph_counter  WHERE  counter_id="user_timelines"  )        sql_query_post  =  UPDATE  sph_counter  SET  last_doc_id=@max  WHERE  counter_id="user_timelines"   }     index  delta_user_timelines  :  user_timelines{                  source  =  delta_user_timelines                  html_strip                            =  1                  html_remove_elements        =  a,  img                  path  =  /sphinx/data/delta_user_timelines_index                  docinfo                                  =  extern                  charset_type                        =  utf-­‐8   }