This presentation will be held by Dan Belibov and Daniel Toader, developers in eMAG's Recommendation engine team. Find out how they combined PHP with GoLang, Kafka and Neo4J to achieve a good mix between business requirements and team goals.
3. Who we are
eMag is the biggest online shop and marketplace in
Romania and one of the biggest in Hungary, Bulgaria and
Poland.
Our team works on internal Recommendation engine,
that provides relevant products for customers in the
site, mobile applications, emails and showrooms.
9. Common approach
What we want: ~200ms - max response time
● Symfony framework
● Guzzle HTTP Client
Result: max response time exceeded due to >120ms - SSL validation
Potential solution: pfsockopen *
* when connection is broken because of physical net failure, pfsockopen() returns handle as if connection was working.
10. Our final approach
Go is an open source programming
language that makes it easy to build
simple, reliable, and efficient
software.
11. Our final approach
● Golang compiled tiny webserver and CLI command
● Native shared connections pool usage
● SSL validation is executed once per connection pool
● No external libraries needed
● Connection errors are handled by internal logic
15. Common approach
In order to do this we needed to process more than a million of events per day (like visits,
orders, ratings, favourites, etc.).
Simple:
● Process the event as it happens
● A lot of events generates high load and service unavailability
Better:
● Queues system (ex. Rabbit)
● Each message needed to be fetched, processed and acknowledged
● Full queues can lead to data loss
16. Our final approach
Apache Kafka® is a distributed streaming
platform. It is used for building real-time
data pipelines and streaming apps. It is
horizontally scalable, fault-tolerant,
wicked fast, and runs in production in
thousands of companies.
17. Our final approach
● Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system
● Store streams of records in a fault-tolerant durable way
● Keep stream available for certain time
● You can have unlimited number of stream cursors
18. Our final approach
● Using Apache Kafka, messages should only be fetched and processed, stream
cursos is moved at read operations.
● Instead of using a lot of processes in PHP, we used less processes in Golang
with goroutines.
● We built a custom connector for Golang, which we want to make public in
future.
21. Common approach
Datas: ~5 mil customers, ~20 mil users, ~5 mil products and >200 mil of relations
Using a relational database:
● Needs associative entity tables - further increase join operation costs
● A lot of updates - generate big load
● Inconsistent data - hard to detect
● Complex queries - require processing power
22. Our final approach: use graph database
Bulk write 2nd place 2nd place 1st place
Single write 2nd place 3rd place 1st place
Read speed
(single read)
2nd place 2nd place 1st place
Similar query
(graph func)
1st place (100ms - 10s) 3rd place (>25s) 2nd place(20-70s)
DB Size 16 GB 22 GB 22 GB
23. Our final approach
Neo4j is a graph database
management system and is the
most popular graph database
according to DB-Engines ranking,
and the 22nd most popular
database overall.
24. Our final approach
Using Neo4J for native graph storage:
● Keeps relations as entities with information attached
● Easy to find dependencies and orphan nodes
● Uses Cypher as query language, Bolt as TCP driver and Java Core API for low-level
graph handling
● Drivers are available for a lot of languages, including PHP
25. Comparison: relational schema vs graph schema
SELECT rec.*
FROM Customer c
JOIN CustomerVisit cv1 ON c.Id =
cv1.CustomerId
JOIN Product p ON p.Id = cv1.ProductId
JOIN CustomerVisit cv2 ON p.Id =
cv2.ProductId
JOIN Customer cs ON cs.Id =
cv2.CustomerId
JOIN CustomerVisit cv3 ON cs.Id =
cv3.CustomerId
JOIN Product rec ON rec.Id =
cv3.ProductId
WHERE c.Id = x
GROUP BY rec.Id
ORDER BY count(rec.Id) DESC
LIMIT 10
26. Comparison: relational schema vs graph schema
MATCH
(:Customer{id:x})-[:VISITED]->(:Product)<-[:VISITED]-
(:Customer)-[o:VISITED]->(rec:Product)
WITH rec, count(o) AS freq
ORDER BY freq DESC
LIMIT 10
RETURN rec