An Insight Data Engineering Project by Emmanuel Awa, 2016a cohort.
A real time scalable deals serving platform. It provides Insights and searching capabilities, maximizing time and profit for end users.
For three weeks, I worked on www.exstreamlycheap.club using open source tools to create a single platform for inspiration, search, shopping and serving of deals.
16. Real Time pipeline
Async Hybrid
Distributed Query
Engine
restful api
QueriesServing LayerSpeed Layer
Ingestion Layer
Engineered user data
~ 200k events / min
18. project challenges.
BAD API DESIGN
1. Pagination.
2. Max #100 per page.
3. Dynamic api without
firehose or sockets.
4. Duplicate deals with sync
api calls.
ROBUST PLUGINS.
PyKafka vs Kafka-Python
1. Balanced consumer.
2. Topic to Partition
assignment - HASH
PARTITIONING.
GENERAL ENGR. CONSTRAINTS.
1. Design and architecture
choice.
2. Tools deep dive - Tweak
source code.
3. Constant Cassandra
crashes. - Real time
writes.
4. DevOps
20. PROPER DB INDEXES
Partition and Clustering keys
CREATE TABLE trend_with_price PRIMARY KEY (price,
discount)) WITH CLUSTERING ORDER BY (discount DESC);
Secondary indexes
CREATE INDEX trend_with_price_category_idx ON
trend_with_price (category);
secret to answering the
questions?
27. Benchmarking exercises
Elasticsearch and cassandra
METRICS - 20 GB of dirty data
1. I/O - Read and Writes.
2. EC2 four (4) m4.xlarge
clusters
CONSIDERATIONS.
1. ElasticSearch vs
Cassandra.
2. ElasticSearch on
Cassandra.
GENERAL PROCESS FLOW.
1. Read dirty python dictionary
from DB.
2. Parse and process
3. Write back to DB.
4. Profile process.
ElasticSearch Advantages
1. Good for preserving data indexes.
2. Great for more reads than writes.
3. Analytics and text search.
Cassandra Advantages.
1. Good for fast writes.
2. Preserving data schemas.
3. Uptime critical and Time series data.
29. BIGGEST PROJECT
CHALLENGE - api constraints
API Pagination and max per page:
http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100
Freezing time for real-time non-fire-hose
data source is hard
1. Three queries done at the same time.
2. Not fun – Inconsistent.
3. Application depends largely on total counts.
Page #1 loads first time Page #1 refresh in millisecs Page #2 loads