SlideShare a Scribd company logo
1 of 22
{…where the best deals find you in real time.
Emmanuel Awa
 For the love of deals, we all just love it.
 Real world engineering challenge.
MOTIVATION
 ONE platform : User’s preference Inspired
Searches and Shopping..
MOTIVATION
Sqoot API.
 Scaled to all categories offered by API
Sample Data
 User Interaction – Engineered 1B users
Current Data Source
 Any trending deals?
 Top selling providers
 Categorize deals based on price and discount
percentages.
 Friends purchase pattern
Sample Queries.
 Complex queries? Real time response?
Sample Queries.
Current Pipeline
API
INGESTION
BATCH LAYER
SERVING LAYER
Hybrid
Streaming
API
Interaction
and deals
collection
 API DESIGN
 Bad or Good?
Biggest Engineering
Challenges
 Pagination limits and constant API updates.
http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=
home_goods;page=1;per_page=100
 Freezing time for real-time non-fire-hose
data source is hard
Data Source Constraints
Biggest Project Challenge
Three queries done at the same time.
Not fun – Inconsistent. Pagination depends on total largely.
New Page refresh New
 ASYNC DISTRIBUTED QUERYING ENGINE
 First Stage Master Producer (FSM)
 Intermediate Hybrid Consumer-Producer
 Final Stage Consumer
Design to solve this?
.
Architecture
 FIRST STAGE MASTER
Compute page chunks
Leaky bucket approach
 FIRST STAGE MASTER Cont’d
 HYBRID CONSUMER-PRODUCER
Fetch and produce actual
data.
 FINAL STAGE CONSUMER
Persist data - HDFS
 Nigerian.
 Masters’ in Computer Science – Brandeis
University MA
 Software Engineer 2 ½ years.
 Hobbyist Photographer.
About Me.
 PyKafka vs. Kafka-Python.
 Balanced consumer.
 Topic to partition assignment – Hash partitioning.
 Engineering architecture to handle complex real world data source.
 Deep dive. Tweak source code for use case.
 DevOps
 General learning curves.
Other Challenges
CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at
timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id
bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text,
merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text,
merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY
KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC);
Sample tables
 Elasticsearch or Cassandra or Elasticsearch on Cassandra
 Elasticsearch –
 Good with preserving indexes data.
 Great for more reads than writes.
 Analytics.
 Search
 Cassandra –
 Good for fast writes.
 Preserving data schema
 Uptime critical
 Time series
Elastic Search vs Cassandra
Benchmarking Pipeline
API
INGESTION
BATCH LAYER
SERVING LAYER
Hybrid
Streaming
API
Interaction
and deals
collection

More Related Content

Similar to ExStreamly Cheap - Insight Data Engineering 2016a Project

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
Pronovix
 

Similar to ExStreamly Cheap - Insight Data Engineering 2016a Project (20)

API and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAPI and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local Markets
 
How to Choose Your Tech Stack?
How to Choose Your Tech Stack?How to Choose Your Tech Stack?
How to Choose Your Tech Stack?
 
AI/ML Powered Personalized Recommendations in Gaming Industry
AI/ML PoweredPersonalized Recommendations in Gaming IndustryAI/ML PoweredPersonalized Recommendations in Gaming Industry
AI/ML Powered Personalized Recommendations in Gaming Industry
 
The Cloud - What's different
The Cloud - What's differentThe Cloud - What's different
The Cloud - What's different
 
ExStreamlycheap Final Slides
ExStreamlycheap Final SlidesExStreamlycheap Final Slides
ExStreamlycheap Final Slides
 
Power
PowerPower
Power
 
29.4 mb
29.4 mb29.4 mb
29.4 mb
 
29.4 Mb
29.4 Mb29.4 Mb
29.4 Mb
 
Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...
Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...
Databases on AWS: The Right Tool for the Right Job (DAT205-R1) - AWS re:Inven...
 
SaaS Pricing
SaaS PricingSaaS Pricing
SaaS Pricing
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
 
presentation slides
presentation slidespresentation slides
presentation slides
 
AppSync and GraphQL on iOS
AppSync and GraphQL on iOSAppSync and GraphQL on iOS
AppSync and GraphQL on iOS
 
Z Enterprise.Optimization And Security
Z Enterprise.Optimization And SecurityZ Enterprise.Optimization And Security
Z Enterprise.Optimization And Security
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
 
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
 
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
NEW LAUNCH! Realtime and Offline application development using GraphQL with A...
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
 

Recently uploaded

Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital Businesses
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 

ExStreamly Cheap - Insight Data Engineering 2016a Project

  • 1. {…where the best deals find you in real time. Emmanuel Awa
  • 2.  For the love of deals, we all just love it.  Real world engineering challenge. MOTIVATION
  • 3.  ONE platform : User’s preference Inspired Searches and Shopping.. MOTIVATION
  • 4. Sqoot API.  Scaled to all categories offered by API Sample Data
  • 5.  User Interaction – Engineered 1B users Current Data Source
  • 6.  Any trending deals?  Top selling providers  Categorize deals based on price and discount percentages.  Friends purchase pattern Sample Queries.
  • 7.  Complex queries? Real time response? Sample Queries.
  • 8. Current Pipeline API INGESTION BATCH LAYER SERVING LAYER Hybrid Streaming API Interaction and deals collection
  • 9.  API DESIGN  Bad or Good? Biggest Engineering Challenges
  • 10.  Pagination limits and constant API updates. http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug= home_goods;page=1;per_page=100  Freezing time for real-time non-fire-hose data source is hard Data Source Constraints
  • 11. Biggest Project Challenge Three queries done at the same time. Not fun – Inconsistent. Pagination depends on total largely. New Page refresh New
  • 12.  ASYNC DISTRIBUTED QUERYING ENGINE  First Stage Master Producer (FSM)  Intermediate Hybrid Consumer-Producer  Final Stage Consumer Design to solve this?
  • 14.  FIRST STAGE MASTER Compute page chunks Leaky bucket approach
  • 15.  FIRST STAGE MASTER Cont’d
  • 16.  HYBRID CONSUMER-PRODUCER Fetch and produce actual data.
  • 17.  FINAL STAGE CONSUMER Persist data - HDFS
  • 18.  Nigerian.  Masters’ in Computer Science – Brandeis University MA  Software Engineer 2 ½ years.  Hobbyist Photographer. About Me.
  • 19.  PyKafka vs. Kafka-Python.  Balanced consumer.  Topic to partition assignment – Hash partitioning.  Engineering architecture to handle complex real world data source.  Deep dive. Tweak source code for use case.  DevOps  General learning curves. Other Challenges
  • 20. CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text, merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text, merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC); Sample tables
  • 21.  Elasticsearch or Cassandra or Elasticsearch on Cassandra  Elasticsearch –  Good with preserving indexes data.  Great for more reads than writes.  Analytics.  Search  Cassandra –  Good for fast writes.  Preserving data schema  Uptime critical  Time series Elastic Search vs Cassandra
  • 22. Benchmarking Pipeline API INGESTION BATCH LAYER SERVING LAYER Hybrid Streaming API Interaction and deals collection

Editor's Notes

  1. Engineering challenge of utilizing external data sources with vast technical constraints you have no control over.
  2. Choice of tools and reasons for taking that into consideration.
  3. The velocity of change with such APIs can cause terrible behaviors in your app. Getting a snapshot to fetch unique data Time to crawl and API changes was large.
  4. Crawling api synchronously? Duplicates and dead. Deals are pushed down other pages constantly. Engineered a bespoke solution for that. My project largely depends on the total in order to fetch the complete deals.
  5. 1. An Asynchronous distributed engine that queries the API and tries to compute what pages to fetch. 2. Sends it to multiple consumers in a LEAKY Bucket fashion, and then synchronously writes output using Bounded Semaphores to try and maintain consistency. 3. Order of fetch wasn’t important. Aggregation and sorting done in Spark. 4. Main point is UNIQUENESS as much as possible.
  6. One producer per category Communicate with Sqoot API Compute intelligently page number to fetch also considering time deltas Produces urls with page chunks to a kafka topic queue Consumer producers quickly fetch the data and further produce the actual data to another topic for further processing.
  7. Compute with API server Determine what categories to fetch
  8. Computes page chunks for available consumers to fetch in a leaky bucket fashion.
  9. Consumers defined URLS and page chunks list from FSM Non-blocking spin up multiple threads == length of page chunk lists The producer defined URLS are consumed and for aggregation of data. Syncing consumer output? Bounded Semaphores
  10. Hash partitions. Started building mine but found a more robust tool that handled that. Kafka-python vs pykafka
  11. Elasticsearch – Loaded 15GB of data Read and processed Profiled each stage
  12. Hash partitions. Started building mine but found a more robust tool that handled that. Kafka-python vs pykafka