Schema on read is obsolete. Welcome metaprogramming..pdf
Big Data LDN 2017: Unleash Data Science Upon Your Organisation
1. Unleash Data Science
Across Your Organisation
Simon Ricketts
Customer Engagement Director
SYNTASA
Dimitris Pertsinis
Head of Data Science
Telegraph Media Group
8. The Key Non Points
- An in Depth Comparison of Infrastructures (Hadoop Vs GCP Vs AWS)
- Yet Another Overview of the DS Hierarchy of Needs or Maturity Model
- Prescriptive Success
- Complaining
(Maybe Some)
8
10. Pre Lake
- Interest in Data Science, Team assembled.
- Getting any Significant Project Required Weeks of Data Herding
- Hard to Offer Value with Disparate Data Sets and Lack of Clarity on Schemas
- Investment is Required to Supercharge Returns
- Bring That Data In – Design it so it is not as Disparate
- Day 0 – Date Lake Delivered
12. Post Lake
In at the Deep End
- Business Invested Now Wants Return
- New Data Sources in to Lake at Rate of
1-2 p/w
- Design Allowed for Deterministic
joining of Most Sources
- Team of Data Engineers Assembled
- New Kinds of Silos
- Lack of Documentation
- Technology Start Building Products on
Top
- Data Engineers Become Resource Gold
- Prototyping Faster – Lack of
Familiarity With Datasets Adding Time
12
14. Deploying Syntasa
From a Naïve Observer
- Pre Lake Decision to go With GCP and BQ. Reasons:
- Lack of Maintenance Overhead
- Resource Allocation – Speed (~1 Minute to Setup Cluster and Deploy Code)
- Cost
- Previous Employer on Premise Hadoop (on lockdown)
- Difference in Speed of Deployment and Processing Massive
- Access to Outside World - Edge Node on Lockdown
16. Unleash Your Team
- Automate Data Consolidation, Data Validation, Data Transformation
- Consolidate Front End, Back End System and Service Data
- Minimize Your Data Exploration and Data Cleaning Day to Hours
- Free Team’s Time to Allow Better Prototyping
- Optimise the Time Your Data Engineers Need to be Involved