This document discusses strategies for importing data from external systems into PostgreSQL and Elixir applications. It describes using the file_fdw foreign data wrapper to treat CSV files as database tables, allowing the use of SQL to synchronize and import only changed data. Sample Elixir code shows how to implement periodic synchronization by querying for differences and importing batches of records. The strategy is 30x faster than previous queue-based approaches and leverages the database's capabilities rather than treating it as simple storage.
20. Create an extension!
defmodule MyImporter.Repo.Migrations.AddFileFdwExtension do
use Ecto.Migration
def up do
execute("CREATE extension file_fdw;" )
end
def down do
execute("DROP extension file_fdw;" )
end
end
21. Create a virtual file server!
defmodule MyImporter.Repo.Migrations.AddForeignFileServer do
use Ecto.Migration
@server_name "files"
def up do
execute("CREATE SERVER #{ @server_name} FOREIGN DATA WRAPPER file_fdw;" )
end
def down do
execute("DROP SERVER #{ @server_name};")
end
end
22. Create one Table per file
defmodule MyImporter.Repo.Migrations.AddForeignCompaniesTable do
use Ecto.Migration
@up ~s"""
CREATE FOREIGN TABLE companies (
company_id text,
name text
) SERVER files
OPTIONS ( filename '/files/Company.csv', format 'csv');
"""
def change do
execute(@up, “DROP FOREIGN TABLE companies;“)
end
end
23. Create one Table per file
defmodule MyImporter.Repo.Migrations.AddForeignCompaniesTable do
use Ecto.Migration
@up ~s"""
CREATE FOREIGN TABLE companies (
company_id text,
name text
) SERVER files
OPTIONS ( filename '/files/Company.csv', format 'csv');
"""
def change do
execute(@up, “DROP FOREIGN TABLE companies;“)
end
end
27. Give me some options!
defmodule MyImporter.Repo.Migrations.AddForeignCompaniesTable do
use Ecto.Migration
@up ~s"""
CREATE FOREIGN TABLE companies (
company_id text,
name text
) SERVER files
OPTIONS ( filename '/files/Company.csv', format 'csv', header ‘on’,
delimiter ‘|’);
"""
def change do
execute(@up, “DROP FOREIGN TABLE companies;“)
end
end
29. Give me some (more) options!
defmodule MyImporter.Repo.Migrations.AddForeignCompaniesTable do
use Ecto.Migration
@up ~s"""
CREATE FOREIGN TABLE companies (
company_id text,
name text
) SERVER files
OPTIONS ( filename '/files/Company.csv',
format 'csv', header ‘on’, delimiter ‘|’, quote E‘x01’);
"""
def change do
execute(@up, “DROP FOREIGN TABLE companies;“)
end
end
31. Selecting from CSV Files directly
> SELECT name, company_id FROM companies
name company_id
ACME Corp. 11012
Wonka Industries 22133
Stark Industries 55251
[...] [...]
32. Using PostgreSQL’s built in functions
> SELECT TRIM(name), company_id, MD5(CONCAT(TRIM(name), company_id)) AS hash
FROM companies
name company_id hash
ACME Corp. 11012 b4fa5d3e03248e285c6cc57ac4f4862e
Wonka Industries 22133 9256bbfa403aee8a35bf3bb4c08f3500
Stark
Industries
55251 91c113bef46e20ab167a8d4633bc0901
33. name company_id hash
ACME Corp.1 11012 b4fa5dge03238e285c6cc57ac4f3822e
Wonka Industries 22133 9256bbfa403aee8a35bf3bb4c08f3500
Stark
Industries
55251 91c113bef46e20ab167a8d4633bc0901
name company_id hash
ACME Corp. 11012 b4fa5d3e0348e285c6cc57ac4f4862e2
Wonka Industries 22133 9256bbfa403aee8a35bf3bb4c08f3500
Stark
Industries
55251 91c113bef46e20ab167a8d4633bc0901
External CSV as table Internal
name company_id
ACME Corp.1 11012
LEFT JOIN
34. Using JOINs
> SELECT external.company_id, external.name
FROM companies external
LEFT JOIN imported_companies imported
ON MD5(CONCAT(external.company_id, TRIM(external.name)))
= MD5(CONCAT(imported.external_id, TRIM(imported.name)))
WHERE imported.external_id IS NULL
name company_id
ACME Corp.1 11012
36. Show me some Elixir, already!
defmodule Synchronize.Companies.SQLModule do
def sync do
find_companies() |> upsert()
end
def upsert(companies) do
Importer.run(companies, &map/ 1, &import_batch/ 1)
end
end
37. Show me some Elixir code, already!
defmodule Synchronize.Companies.SQLModule do
def find_companies do
SQL.stream(Repo,
"""
SELECT external.company_id, external.name
FROM companies external
LEFT JOIN internal_companies internal
ON MD5(CONCAT(external.company_id, TRIM(external.name)))
= MD5(CONCAT(internal.external_id, TRIM(internal.name)))
WHERE internal.external_id IS NULL
"""
)
end
end
38. Show me some elixir code - SQL Module (cont.)
defmodule MyImporter.Companies.SQLModule do
defp map([external_id, name]) do
%{
name: String.trim(name),
external_id: external_id,
inserted_at: DateTime.utc_now(),
updated_at: DateTime.utc_now()
}
end
defp import_batch(batch) do
Repo.insert_all(Company, batch, on_conflict: :replace_all,
conflict_target: :external_id)
end
end
39. Show me some elixir code - Import Module 2
defmodule MyImporter.Companies.ImportModule do
defp sync(source, item_mapper, batch_processor) do
processor = fn batch ->
batch_processor.(batch)
batch
End
Repo.transaction(fn ->
source
|> Stream.flat_map(source, fn %{rows: rows} -> rows end)
|> Stream.map(item_mapper) |> Stream.chunk_every(2000)
|> Stream.flat_map(processor)
|> Enum.count()
end)
end
end
40. Trigger mechanism
SQL
Module
Runs SQL periodically
Import
Module
Imports the changed data
Data
Trigger
Check a timestamp, save the
state in a GenServer
Supervisor
42. Why not to use
this strategy?
Like… all the time?
● Business logic lives in the
database
○ Harder to change
○ Database Server needs to know
about the files
● Foreign Table is tied to the file
● Requires SQL knowledge*
● It’s actually a bit harder to test
than a single import script
○ It doesn’t make sense for single
imports
43. Why to use this
strategy?
Like… some of the time?
● It’s fast
○ The database itself is the limiting
factor
○ We effectively run COPY on query
● It’s really nice for comparing
state (thank you, SQL!)
○ Easy to get diffs
○ Easy to join files in memory to
create a substate
● It’s straightforward to
implement a synching
mechanism
44. 30x faster
By switching strategies for importing customer data
(Full disclosure: By switching queue based strategy to synching strategy)
45. Learnings
● Do not treat your database as dumb storage
○ Leverage its capabilities!
○ Read the docs
● There is more than one way to do things
● “Making it go fast” == “Not doing a lot of things”