4. From engineering point of view
• Multiple independent apps/modules (data stores)
• Data is not consistent across apps
• Heterogeneous storage types
• Different historical data storage policies
• Each module has separate team which owns it
5. From business point of view
• Same domain
• Same customer base (at least partially)
• We have all the necessary data to provide more sophisticated
reporting
6. That results in requests like
• Can I build a couple
of custom reports
for sales demo?
• We need more
visibility across the
board
• Hospital X is our
major customer, lets
give them integral
insight
9. Major issues with ETL-based approach
• Long feedback loop for customer-facing development (demo, custom
reports, on-demand requests)
• Consumes a lot of engineering resources
• Code-base growth
• Unnecessary load on data storages in production
• High cost
10. Seems that we need generic data source
• Should contain all of the data
• SQL-like interface
• Fast enough for on-demand analysis
• Cheap
11. Solution options – Data Warehouse
Pros
• Storage with predefined
structure
• One version of truth
• Contains all the data
• SQL-based interface
Cons
• Long time to implement
• High cost
• Requires new development to
support new data
13. Solution options – Data lake
Pros
• Fast to implement
• Cheap
• Contains all the data and more
• SQL-based interface
• Easy to add new data source
Cons
• Semi-structured data
• Can’t be used as a source of truth
• Data consistency is poor
14. Product
Domain: Healthcare
Product: Provide deep insight into patient experience
Project goal:
1. Gather patients feedback
2. Provide deep insight into feedback by means of reporting
20. Stage 1 - Results
Pros
• Took about a month to implement
by a team of 2 people
• Data is there
• Cheap
• Fast enough
• Analysts team can start playing
with it
Cons
• Data quality
• Data is up to yesterday
• Overall pipeline performance
21. Stage 2 – Make it pleasant to swim in
Operational storages One version of the Truth Reporting oriented
22. Stage 2 - Results
Cons
• Data is up to yesterday
• Overall pipeline performance
Pros
• Implemented by Analysts team
with support of 1 engineer
• Data quality is there
• Doesn’t require engineering
involvement to produce new data
sets