ExStreamlycheap Final Slides

•

0 likes•256 views

An Insight Data Engineering Project by Emmanuel Awa, 2016a cohort. A real time scalable deals serving platform. It provides Insights and searching capabilities, maximizing time and profit for end users. For three weeks, I worked on www.exstreamlycheap.club using open source tools to create a single platform for inspiration, search, shopping and serving of deals.

Software

hello!
I am Emmanuel Awa
Insight Data Engineering Fellow 2016 Spring cohort.
You can find me at @awaemma

content.
1. project motivation.
2. Engineering solution.
3. Challenges.
4. Takeaways

1.
Project Motivation
...what did I spend three weeks working on?

“I have learnt to seek my
happiness in limiting my desires,
rather than attempting to satisfy
them.
~ John Stuart Mill

Big concept
1. Real time scalable deals serving
platform
2. Insights and searching capabilities
3. Maximizing time and profit.

sample queries
User’s holistic view - real time trends visualization.

sample queries
One highly scalable search platform: price or discount options.

sample queries
Engineer User purchase interaction…

2.
Engineering Solution
...Finding the right tools for the job.

data source - sqoot api
Rich Merchant Info.
○ Location based queries.

data source - sqoot api
Scaled to all categories
○ ~ 11 million served.
○ Over 80 categories.

pipeline - λ architecture
Ingestion
Batch /
Speed
layer
Serving
layer
Queries

batch pipelineHybrid
Streaming
API Interaction
Async Hybrid
Distributed Query
Engine
restful api
QueriesServing LayerBatch Layer
Ingestion Layer

Real Time pipeline
Async Hybrid
Distributed Query
Engine
restful api
QueriesServing LayerSpeed Layer
Ingestion Layer
Engineered user data
~ 200k events / min

3.
Challenges
...now it was easy, right?

project challenges.
BAD API DESIGN
1. Pagination.
2. Max #100 per page.
3. Dynamic api without
firehose or sockets.
4. Duplicate deals with sync
api calls.
ROBUST PLUGINS.
PyKafka vs Kafka-Python
1. Balanced consumer.
2. Topic to Partition
assignment - HASH
PARTITIONING.
GENERAL ENGR. CONSTRAINTS.
1. Design and architecture
choice.
2. Tools deep dive - Tweak
source code.
3. Constant Cassandra
crashes. - Real time
writes.
4. DevOps

PROPER DB INDEXES
Partition and Clustering keys
CREATE TABLE trend_with_price PRIMARY KEY (price,
discount)) WITH CLUSTERING ORDER BY (discount DESC);
Secondary indexes
CREATE INDEX trend_with_price_category_idx ON
trend_with_price (category);
secret to answering the
questions?

KAFKA CONSUMPTION OFFSETS
Topic Offsets
Set the right start offset per partition
secret to answering the
questions?

About me
- Masters in
CS
- 2 ½ yrs SE
in Travel
- Nigerian
- Hobbyist
Photographer

thanks!
Any questions?
You can find me at
@awaemma
awaemmanuel1@gmail.com
github.com/awaemmanuel
linkedin.com/in/emmanuelawa

benchmarking
pipeline
Async Hybrid
Distributed Query
Engine
restful api
QueriesServing LayerBatch Layer
Ingestion Layer

BIGGEST PROJECT
CHALLENGE - api constraints
API Pagination and max per page:
http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100
Freezing time for real-time non-fire-hose
data source is hard
1. Three queries done at the same time.
2. Not fun – Inconsistent.
3. Application depends largely on total counts.
Page #1 loads first time Page #1 refresh in millisecs Page #2 loads

Async distributed query engine - (Async DQE)
1. First Stage Master
Producer (FSM)
2. Intermediate Hybrid
Consumer-Producer
3. Final Stage Consumer

Viewers also liked

MapMyCab PresentationPreetika Kulshrestha

Bird FeedEamon Kavanagh

Hyperloglog ProjectKendrick Lo

Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA

Api design best practices from a hacker's viewhe sicong

Insight Data Engineering projectHoa Nguyen

On Short NoticeLory Nunez

LoryfelNunezLory Nunez

Paul singman insightPaul Singman

Insight Data Engineering: Open source data ingestionTreasure Data, Inc.

StormCrawler in the wildJulien Nioche

Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Stanford GSB Corporate Governance Research Initiative

Viewers also liked (12)

MapMyCab Presentation

Bird Feed

Hyperloglog Project

Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...

Api design best practices from a hacker's view

Insight Data Engineering project

On Short Notice

LoryfelNunez

Paul singman insight

Insight Data Engineering: Open source data ingestion

StormCrawler in the wild

Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?

Recently uploaded

EY_Graph Database Powered SustainabilityNeo4j

Asset Management Software - InfographicHr365.us smith

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

React Server Component in Next.js by Hanief UtamaHanief Utama

software engineering Chapter 5 System modeling.pptxnada99848

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

What is Fashion PLM and Why Do You Need ItWave PLM

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Professional Resume Template for Software DevelopersVinodh Ram

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Recently uploaded (20)

EY_Graph Database Powered Sustainability

Asset Management Software - Infographic

Intelligent Home Wi-Fi Solutions | ThinkPalm

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Unveiling Design Patterns: A Visual Guide with UML Diagrams

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Der Spagat zwischen BIAS und FAIRNESS (2024)

Salesforce Certified Field Service Consultant

Cloud Data Center Network Construction - IEEE

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Automate your Kamailio Test Calls - Kamailio World 2024

Implementing Zero Trust strategy with Azure

React Server Component in Next.js by Hanief Utama

software engineering Chapter 5 System modeling.pptx

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

What is Fashion PLM and Why Do You Need It

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Professional Resume Template for Software Developers

The Evolution of Karaoke From Analog to App.pdf

ExStreamlycheap Final Slides

1. ExStreamlyCheap.club

2. hello! I am Emmanuel Awa Insight Data Engineering Fellow 2016 Spring cohort. You can find me at @awaemma

3. content. 1. project motivation. 2. Engineering solution. 3. Challenges. 4. Takeaways

4. 1. Project Motivation ...what did I spend three weeks working on?

5. “I have learnt to seek my happiness in limiting my desires, rather than attempting to satisfy them. ~ John Stuart Mill

6. Big concept 1. Real time scalable deals serving platform 2. Insights and searching capabilities 3. Maximizing time and profit.

7. sample queries User’s holistic view - real time trends visualization.

8. sample queries One highly scalable search platform: price or discount options.

9. sample queries Engineer User purchase interaction…

10. sample queries … and reaction.

11. 2. Engineering Solution ...Finding the right tools for the job.

12. data source - sqoot api Rich Merchant Info. ○ Location based queries.

13. data source - sqoot api Scaled to all categories ○ ~ 11 million served. ○ Over 80 categories.

14. pipeline - λ architecture Ingestion Batch / Speed layer Serving layer Queries

15. batch pipelineHybrid Streaming API Interaction Async Hybrid Distributed Query Engine restful api QueriesServing LayerBatch Layer Ingestion Layer

16. Real Time pipeline Async Hybrid Distributed Query Engine restful api QueriesServing LayerSpeed Layer Ingestion Layer Engineered user data ~ 200k events / min

17. 3. Challenges ...now it was easy, right?

18. project challenges. BAD API DESIGN 1. Pagination. 2. Max #100 per page. 3. Dynamic api without firehose or sockets. 4. Duplicate deals with sync api calls. ROBUST PLUGINS. PyKafka vs Kafka-Python 1. Balanced consumer. 2. Topic to Partition assignment - HASH PARTITIONING. GENERAL ENGR. CONSTRAINTS. 1. Design and architecture choice. 2. Tools deep dive - Tweak source code. 3. Constant Cassandra crashes. - Real time writes. 4. DevOps

19. 4. Takeaways ...some lessons learned.

20. PROPER DB INDEXES Partition and Clustering keys CREATE TABLE trend_with_price PRIMARY KEY (price, discount)) WITH CLUSTERING ORDER BY (discount DESC); Secondary indexes CREATE INDEX trend_with_price_category_idx ON trend_with_price (category); secret to answering the questions?

21. KAFKA CONSUMPTION OFFSETS Topic Offsets Set the right start offset per partition secret to answering the questions?

22. secret to answering the questions?

23. that’s all folks!

24. About me - Masters in CS - 2 ½ yrs SE in Travel - Nigerian - Hobbyist Photographer

25. thanks! Any questions? You can find me at @awaemma awaemmanuel1@gmail.com github.com/awaemmanuel linkedin.com/in/emmanuelawa

26. BACKUP SLIDES

27. Benchmarking exercises Elasticsearch and cassandra METRICS - 20 GB of dirty data 1. I/O - Read and Writes. 2. EC2 four (4) m4.xlarge clusters CONSIDERATIONS. 1. ElasticSearch vs Cassandra. 2. ElasticSearch on Cassandra. GENERAL PROCESS FLOW. 1. Read dirty python dictionary from DB. 2. Parse and process 3. Write back to DB. 4. Profile process. ElasticSearch Advantages 1. Good for preserving data indexes. 2. Great for more reads than writes. 3. Analytics and text search. Cassandra Advantages. 1. Good for fast writes. 2. Preserving data schemas. 3. Uptime critical and Time series data.

28. benchmarking pipeline Async Hybrid Distributed Query Engine restful api QueriesServing LayerBatch Layer Ingestion Layer

29. BIGGEST PROJECT CHALLENGE - api constraints API Pagination and max per page: http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100 Freezing time for real-time non-fire-hose data source is hard 1. Three queries done at the same time. 2. Not fun – Inconsistent. 3. Application depends largely on total counts. Page #1 loads first time Page #1 refresh in millisecs Page #2 loads

30. Async distributed query engine - (Async DQE) 1. First Stage Master Producer (FSM) 2. Intermediate Hybrid Consumer-Producer 3. Final Stage Consumer

31. THE END...

ExStreamlycheap Final Slides

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to ExStreamlycheap Final Slides

Similar to ExStreamlycheap Final Slides (20)

Recently uploaded

Recently uploaded (20)

ExStreamlycheap Final Slides