Cataloging: The Art and Science of it

886 views
896 views

Published on

Tech Talk at the 1st Flipkart Tech Conference: Slash N

Creating a Catalog Mgmt System is non trivial when you talk of scale. Scale which operates at all levels: volume of data, size of catalog, and the flux of changes. Besides, an ideal model to represent the relationships and the elasticity of data is a non trivial science. In this talk lets try and figure out what part of it is science, and where we cross boundary and think art.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
886
On SlideShare
0
From Embeds
0
Number of Embeds
41
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cataloging: The Art and Science of it

  1. 1. Cataloging The Art & Science of it... Utkarsh Principal Architect @ Flipkart.comSunday 3 March 13
  2. 2. Art vs Science Imaginative Free Form Creative Measurable Formulative Methodical Set PatternsSunday 3 March 13
  3. 3. What is Cataloging? • Catalog A list or itemized display usually including descriptive information or illustrations. • Cataloging a. To list or include in a catalog b. To classify according to a categorical system We define it as: Cataloging is the process of managing the inventory of products through the entire lifecycle of creating, updating, de- provisioning/re-provisioning and deletion. 3Sunday 3 March 13
  4. 4. Why is the problem interesting? • Ever growing - “size” • Dynamic nature of the Metadata - “elasticity” • Association(s) between data elements - “flexibility” • Flux of changes - “variability” • De-coupled systems & Data Ownership - “data duplication” 4Sunday 3 March 13
  5. 5. How do we solve it? • Be Comprehensive & Imaginative • Be Methodical & Flexible • Work with Patterns & Create new Patterns • Be a Composer, be an artist (blend where required) 5Sunday 3 March 13
  6. 6. What do we solve? • Identify Data Elements • Identify Relationships b/w Data Elements • Identify Data Usage patterns (Query patterns) • Create an ideal representation: Logical Model • Characterize the Data Store(s) • Architect the Catalog Data Cluster • Define Views/Interface(s) 6Sunday 3 March 13
  7. 7. Identify Data Elements Product Stock Sellers Biblio Product Category Product Variants SLAs Supplier Product Taxation Images Pricing Contributors ? Be Comprehensive ; Be Imaginative !! 7Sunday 3 March 13
  8. 8. Identify Relationships ? Compilation Physical 1 Product has A is A Compilation 2 Book has A belongs to belongs to belongs to Year Author Genre Be Comprehensive ; Be Imaginative !! 8Sunday 3 March 13
  9. 9. Identify Data Query Patterns • Is the querying real-time or offline (customer perspective) • Is the query “Id” based or use of filters (adhoc or pre-defined) • Is the query linking multiple data elements • Understand: Query SLAs at ever increasing scale • Question: why is the client writing such a query Eg: a. Book with a specific title Secret of the Nagas b. Books by Chetan Bhagat published in 2012 c. Books which are Thrillers, published post 2005 written in Hindi and published by Rupa Publications 9Sunday 3 March 13
  10. 10. Identification is Non Trivial Example “Book” Identification --> “Title” 10Sunday 3 March 13
  11. 11. Identification is Non Trivial Example “Book” Identification --> “Title” “Title” + “Publisher” 11Sunday 3 March 13
  12. 12. Identification is Non Trivial Example “Book” Identification --> “Title” “Title” + “Publisher” “Title” + “Publisher” + “Edition” 12Sunday 3 March 13
  13. 13. Identification is Non Trivial Example “Book” Identification --> “Title” “Title” + “Publisher” “Title” + “Publisher” + “Edition” “Title” + “Publisher” + “Edition” + “Variant” 13Sunday 3 March 13
  14. 14. Identification is Non Trivial Example “Book” Identification --> “Title” “Title” + “Publisher” “Title” + “Publisher” + “Edition” “Title” + “Publisher” + “Edition” + “Variant” “Title” + “Publisher” + “Edition” + “Variant” + ?? Be Imaginative - an Artist’s brush stroke !! 14Sunday 3 March 13
  15. 15. Logical Model Schema Entities as Tables + Rich Query Support Relational Databases: + Built-in support for Relationships * MySQL, Relationships as Oracle, Postgres Constraints + Indexes et al Queries supported - Elasticity through indexes * Frequent addition/ and joins deletion of columns * Growing secondary indexes - Not optimized for some use-cases * Key-Values *Data Blobs/ Graphs 15Sunday 3 March 13
  16. 16. Logical Model Semi-Schema + Flexibility: Blobs (Documents) Document Stores: “Documents” are of Data less rigid * MongoDB, CouchBase et al + Query Language Linkages between to retrieve based Documents on content of “Document” Queries supported through document - Complex identifiers and Relationships are document non-trivial references - “Linked” Document Queries may not be optimized 16Sunday 3 March 13
  17. 17. Logical Model No Schema Data Blobs + Elasticity Other NoSQL * Variability of Stores: data format * HBase, RIAK, Rules/Relationship Cassandra, et al definitions * Secondary Indices + Tunable Queries supported performance through data “views”, indexes, search based on - Relational data is reverse indexing a force-fit (sub- etc ... optimal) +/- Querying models are specific to Stores 17Sunday 3 March 13
  18. 18. Catalog Data Cluster Catalog Biblio Product Data Data Data UGC Compliance on Data Products - “View”/”Data” Partitions - Blend multiple data stores - Interfaces provide view to ? Pricing/ the underlying data Accounting - Scale uniformly for data elements 18Sunday 3 March 13
  19. 19. Data Store Characterization • Data characteristics: • Elasticity - Reliability (availability - increase in scale and redundancy) - evolving catalog - Consistency definitions • Querying capability - Support for indexes • SLAs - Filters; secondary - Volumes indexes - Throughput - linkages/relationships - Latencies Be Comprehensive; be Methodical but be unbounded by choices - a Scientist who has a palet of colors in hand !! 19Sunday 3 March 13
  20. 20. Data Store Characterization • CAP: which 2 we pick? can data store help configure any 2? A C P • Operational ease (monitoring, reporting, config mgmt ..) • Pluggability with Distributed Computing platforms 20Sunday 3 March 13
  21. 21. Define Views & Interfaces • Cataloging has multiple use-cases which are business centric View Layer Precomputed View(s) • Use-cases evolve; and so do the “view” to the data Dynamic View(s) • “Views” as multiple interpretations Data Access Interface of the data; • De-coupled with the underlying data Data 1 Data 2 • Underlying data form has to be elastic Data 3 Data 4 • Overlayed views have to be adaptive 21Sunday 3 March 13
  22. 22. Architect for Scale & Performance Identify Usage Patterns Right Tools for Job Right Abstractions Pluggable Solution Stacks Decoupled Data Offline Processing 22Sunday 3 March 13
  23. 23. Measure, Monitor & Evolve • SLAs change; system has to be adaptive • Start off with established goals; benchmark and meet the initial set goals • Changes are gradual; plan at the first symptom • Listen for system(s) not coping up • Always work towards incremental changes; entire overhaul of the systems will be counter productive Be Curious, have doubts, deeply introspect - be the ultimate Scientist !! 23Sunday 3 March 13
  24. 24. Change is constant ... adapt • Requirements evolve • Business introduces flux • Data interpretations grow • Be flexible, adaptive, imaginative...... work as a Scientist who appreciates Art !! 24Sunday 3 March 13
  25. 25. Thank you ! My Co-ordinates: utkarsh@flipkart.com 25Sunday 3 March 13

×