CatalogingThe Art & Science of it...             UtkarshPrincipal Architect @ Flipkart.com
Art vs Science                   Imaginative                 Free Form                                    CreativeMeasurab...
What is Cataloging?• CatalogA list or itemized display usually including descriptive information or   illustrations.• Cata...
Why is the problem interesting?• Ever growing - “size”• Dynamic nature of the Metadata - “elasticity”• Association(s) betw...
How do we solve it?• Be Comprehensive & Imaginative• Be Methodical & Flexible• Work with Patterns & Create new Patterns• B...
What do we solve?• Identify Data Elements• Identify Relationships b/w Data Elements• Identify Data Usage patterns (Query p...
Identify Data ElementsProduct                    Stock                     Sellers Biblio           Product               ...
Identify Relationships                                  ?                                                        Compilati...
Identify Data Query Patterns•   Is the querying real-time or offline (customer perspective)•   Is the query “Id” based or ...
Identification is Non TrivialExample “Book”Identification -->“Title”                                    10
Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”                                 ...
Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”“Title” + “Publisher” + “Edition”...
Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”“Title” + “Publisher” + “Edition”...
Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”“Title” + “Publisher” + “Edition”...
Logical ModelSchema                       + Rich Query Support               RelationalEntities as Tables                 ...
Logical ModelSemi-Schema                       + Flexibility:Blobs (Documents) of                            Document Stor...
Logical ModelNo Schema                        + Elasticity               Other NoSQL Stores:Data Blobs                    ...
Catalog Data Cluster          Catalog                   Biblio     Product           Data                     Data        ...
Data Store Characterization• Data characteristics:               • Elasticity       - Reliability (availability        - i...
Data Store Characterization• CAP: which 2 we pick? can data store help configure any  2?                          A       ...
Define Views & Interfaces• Cataloging has multiple use-cases                                                 View Layer  w...
Architect for Scale &         PerformanceIdentify UsagePatterns                     Right Tools                           ...
Measure, Monitor & Evolve• SLAs change; system has to be adaptive• Start off with established goals; benchmark and meet th...
Change is constant ... adapt• Requirements evolve• Business introduces flux• Data interpretations grow• Be flexible, adapt...
Thank you !  My Co-ordinates:utkarsh@flipkart.comutkarsh@flipkart.com                       25                       25
Upcoming SlideShare
Loading in …5
×

Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

2,956 views

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,956
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
142
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Questions: Largest catalog size you have handled? FK - 18 million products How often do u add metadata? How often you make changes to aspects like pricing, stocks?
  • Question: Any one here who as an engineer has used some sort of a CMS system? Home grown or Commercially available? --- regardless a mental exercise in building a CMS system helps you choose the best fit
  • Question: Which is the most important data element you see in the blocks above? “?”
  • Focus on “??”
  • “ Documents” are less rigid - columns can grow Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty 'fields' in either document (record) in this case. This system allows new information to be added and it does not require explicitly stating if other pieces of information are left out.
  • schema driven; no-schema (KV, columnar, graphs ...) search and indexing static vs dynamic queries
  • Views can be pre-computed or entirely dynamic (optimize as per use cases) “ Views” have to be dynamically supported
  • Identify usage pattern(s) for every piece of data in the Catalog Data Cluster One solution doesn't fit all - use right tools for the job Use the right abstractions; flux around logical definitions should be much lesser than the physical components Application Solutions & Hardware(s) should be pluggable/replace-able Keep Data de-coupled from business logic Based on usage patterns consider offline pre-processing & distributed computations
  • Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

    1. 1. CatalogingThe Art & Science of it... UtkarshPrincipal Architect @ Flipkart.com
    2. 2. Art vs Science Imaginative Free Form CreativeMeasurable Formulative Methodical Set Patterns
    3. 3. What is Cataloging?• CatalogA list or itemized display usually including descriptive information or illustrations.• Cataloginga. To list or include in a catalogb. To classify according to a categorical systemWe define it as:Cataloging is the process of managing the inventory of products through the entire lifecycle of creating, updating, de- provisioning/re-provisioning and deletion. 3
    4. 4. Why is the problem interesting?• Ever growing - “size”• Dynamic nature of the Metadata - “elasticity”• Association(s) between data elements - “flexibility”• Flux of changes - “variability”• De-coupled systems & Data Ownership - “data duplication” 4
    5. 5. How do we solve it?• Be Comprehensive & Imaginative• Be Methodical & Flexible• Work with Patterns & Create new Patterns• Be a Composer, be an artist (blend where required) 5
    6. 6. What do we solve?• Identify Data Elements• Identify Relationships b/w Data Elements• Identify Data Usage patterns (Query patterns)• Create an ideal representation: Logical Model• Characterize the Data Store(s)• Architect the Catalog Data Cluster• Define Views/Interface(s) 6
    7. 7. Identify Data ElementsProduct Stock Sellers Biblio Product Category Product SLAs VariantsSupplier Product Taxation Images Pricing Contributors ?Be Comprehensive ; Be Imaginative !! 7
    8. 8. Identify Relationships ? Compilation 1 Physical Product has A is A Compilation 2 Book has A belongs to belongs to belongs to Year Author GenreBe Comprehensive ; Be Imaginative !! 8
    9. 9. Identify Data Query Patterns• Is the querying real-time or offline (customer perspective)• Is the query “Id” based or use of filters (adhoc or pre-defined)• Is the query linking multiple data elements• Understand: Query SLAs at ever increasing scale• Question: why is the client writing such a queryEg:• Book with a specific title Secret of the Nagas• Books by Chetan Bhagat published in 2012• Books which are Thrillers, published post 2005 written in Hindi and published by Rupa Publications 9
    10. 10. Identification is Non TrivialExample “Book”Identification -->“Title” 10
    11. 11. Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher” 11
    12. 12. Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”“Title” + “Publisher” + “Edition” 12
    13. 13. Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”“Title” + “Publisher” + “Edition”“Title” + “Publisher” + “Edition” + “Variant” 13
    14. 14. Identification is Non TrivialExample “Book”Identification -->“Title”“Title” + “Publisher”“Title” + “Publisher” + “Edition”“Title” + “Publisher” + “Edition” + “Variant”“Title” + “Publisher” + “Edition” + “Variant” + ?? Be Imaginative - an Artist’s brush stroke !! 14
    15. 15. Logical ModelSchema + Rich Query Support RelationalEntities as Tables Databases: + Built-in support for Relationships * MySQL, Oracle,Relationships as Postgres et alConstraints + Indexes - ElasticityQueries supportedthrough indexes and * Frequent addition/deletionjoins of columns * Growing secondary indexes - Not optimized for some use- cases * Key-Values *Data Blobs/ Graphs 15
    16. 16. Logical ModelSemi-Schema + Flexibility:Blobs (Documents) of Document Stores: “Documents” are lessData rigid * MongoDB, CouchBase et al + Query Language toLinkages between retrieve based onDocuments content of “Document”Queries supported - Complexthrough document Relationships are non-identifiers and trivialdocument references - “Linked” Document Queries may not be optimized 16
    17. 17. Logical ModelNo Schema + Elasticity Other NoSQL Stores:Data Blobs * Variability of data * HBase, RIAK, format Cassandra, et alRules/Relationship * Secondary Indicesdefinitions + Tunable performanceQueries supportedthrough data “views”,indexes, search based - Relational data is aon reverse indexing force-fit (sub-optimal)etc ... +/- Querying models are specific to Stores 17
    18. 18. Catalog Data Cluster Catalog Biblio Product Data Data Data UGC Compliance on Data Products- “View”/”Data” Partitions- Blend multiple data stores- Interfaces provide view to theunderlying data ? Pricing/Ac- Scale uniformly for data countingelements 18
    19. 19. Data Store Characterization• Data characteristics: • Elasticity - Reliability (availability - increase in scale and redundancy) - evolving catalog definitions - Consistency• Querying capability •SLAs - Support for indexes - Volumes - Filters; secondary - Throughput indexes - Latencies - linkages/relationships Be Comprehensive; be Methodical but be unbounded by choices - a Scientist who has a palet of colors in hand !! 19
    20. 20. Data Store Characterization• CAP: which 2 we pick? can data store help configure any 2? A C P• Operational ease (monitoring, reporting, config mgmt ..)• Pluggability with Distributed Computing platforms 20
    21. 21. Define Views & Interfaces• Cataloging has multiple use-cases View Layer which are business centric Precomputed View(s)• Use-cases evolve; and so do the “view” to the data Dynamic View(s)• “Views” as multiple interpretations of Data Access Interface the data;• De-coupled with the underlying data Data 1 Data 2• Underlying data form has to be elastic Data 3 Data 4• Overlayed views have to be adaptive 21
    22. 22. Architect for Scale & PerformanceIdentify UsagePatterns Right Tools for Job Right Abstractions Pluggable Solution Stacks Decoupled Data Offline Processing 22
    23. 23. Measure, Monitor & Evolve• SLAs change; system has to be adaptive• Start off with established goals; benchmark and meet the initial set goals• Changes are gradual; plan at the first symptom• Listen for system(s) not coping up• Always work towards incremental changes; entire overhaul of the systems will be counter productive Be Curious, have doubts, deeply introspect - be the ultimate Scientist !! 23
    24. 24. Change is constant ... adapt• Requirements evolve• Business introduces flux• Data interpretations grow• Be flexible, adaptive, imaginative...... work as a Scientist who appreciates Art !! 24
    25. 25. Thank you ! My Co-ordinates:utkarsh@flipkart.comutkarsh@flipkart.com 25 25

    ×