@nicolas_frankel
Adding search to a legacy application
without hassle
@nicolas_frankel
2022
Merci à nos sponsors
@nicolas_frankel
Me, myself and I
 Developer
 Developer advocate
 Interested in system
architecture
@nicolas_frankel
@nicolas_frankel
David Pilato
@nicolas_frankel
The problem with search
“All applications evolve to the
point that they need to add a
search capability”
-- Anonymous
@nicolas_frankel
First decision step
 Do you keep your SQL
database?
• Queries using LIKE
• Dedicated search database
@nicolas_frankel
Second decision step
 Choosing the search engine
datastore
• Elasticsearch
• SolR
• Algolia
• Cloud-based search engines
@nicolas_frankel
Third decision step
 Choosing the “right” architecture
• Index data in application code
• Index data via ORM framework
• Custom
• Hibernate Search
• Index data from database via a
batch job
• Something else?
@nicolas_frankel
The problem of dual writes
 No transaction between the
database and Elasticsearch
 What operation shall we
execute first:
1. Update database and hope
Elasticsearch succeeds?
2. The other way around?
@nicolas_frankel
https://imgflip.com/i/5vm482
@nicolas_frankel
Source data from the database to Elasticsearch
@nicolas_frankel
Databas
e
The problem with batches
Elasticsearc
h
Job
INSERT x NOT FOUND
SELECT *
INDEX x
FOUND
@nicolas_frankel
The problem with batches
 Run too frequently
• Waste of resources
 Run not frequently enough
• Data not searchable
@nicolas_frankel
Change-Data-Capture
“In databases, Change Data Capture
is a set of software design patterns
used to determine and track the
data that has changed so that action
can be taken using the changed data.
CDC is an approach to data
integration that is based on the
identification, capture and delivery
of the changes made to enterprise
data sources.”
-- https://en.wikipedia.org/wiki/Change_data_capture
@nicolas_frankel
CDC implementation options
1. Polling + Timestamps on rows
2. Polling + Version numbers on
rows
3. Polling + Status indicators on
rows
4. Triggers on tables
5. Log scanners
-- https://en.wikipedia.org/wiki/Change_data_capture
@nicolas_frankel
What is a transaction/binary/etc. log?
“The binary log contains ‘events’
that describe database changes
such as table creation
operations or changes to table
data.”
-- https://dev.mysql.com/doc/refman/8.0/en/binary-log.html
@nicolas_frankel
What if we “hacked” the log?
@nicolas_frankel
Sample MySQL binlog
### UPDATE `test`.`t`
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2='apple' /* VARSTRING(20) meta=20
nullable=0 is_null=0 */
### @3=NULL /* VARSTRING(20) meta=0 nullable=1
is_null=1 */
### SET
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2='pear' /* VARSTRING(20) meta=20 nullable=0
is_null=0 */
### @3='2009:01:01' /* DATE meta=0 nullable=1
is_null=0 */
# at 569
#150112 21:40:14 server id 1 end_log_pos 617 CRC32
0xf134ad89
#Table_map: `test`.`t` mapped to number 251
# at 617
#150112 21:40:14 server id 1 end_log_pos 665 CRC32
0x87047106
#Delete_rows: table id 251 flags: STMT_END_F
@nicolas_frankel
But…
 Implementation-dependent
 Fragile
 Who maintains/debugs it?
@nicolas_frankel
Debezium to the rescue
 Java-based abstraction layer
for CDC
 Provided by Red Hat
 Apache v2 licensed
@nicolas_frankel
Debezium
“Debezium records all row-level
changes within each database
table in a change event stream”
-- https://debezium.io/
@nicolas_frankel
Debezium connector plugins
 Production-ready
• MongoDB
• MySQL
• PostrgreSQL
• SQL Server
• DB2 (!)
• Oracle
 Incubating
• Cassandra
• Vitess
@nicolas_frankel
Stream instead of batch
 Propagate changes as soon
as they occur
 Still eventual consistency
• Because of duplicating data
in distributed systems!
 But gap reduced to the
smallest window possible
@nicolas_frankel
Thanks for your attention!
 https://blog.frankel.ch/
 @nicolas_frankel
 https://bit.ly/legacy-search
 https://apisix.apache.org/

SnowCamp - Adding search to a legacy application

Editor's Notes

  • #13 @startuml database "\nDatabase\n" as db database "\nElasticsearch\n" as es interface API as api interface SQL as sql component " Batch " as batch << job >> << <&clock> scheduled >> db -up-() sql es -up-() api batch .left.> sql: read batch .right.> api: write @enduml