The Evolving Landscape of
Data Engineering
Andrei Savu - Event Co-organizer
Staff Engineer @ Twitter
Follow me @andreisavu
Andrei Savu
Staff Engineer @ Twitter:
* MoPub Backend & Data Pipelines
* Mobile App Monetization
Co-organizer of the Data Engineering
Club.
Previously Tech Lead at Cloudera via
the Axemblr acquisition. Started the
Cloud engineering team.
The Past:
● OSS communities
● AWS history
● Google Cloud history
The Present: Patterns
The Future: Wish List
Topics
Weeks of Provisioning
Static Infrastructure
Commodity Hardware
Commodity Networking
Data Locality Important
Running in the Public
Cloud was unusual
CAPEX
The Past - OSS
Visionary Business
Fast iterations
Data Management as a
key platform use case
Incredible Scale
Transition to “serverless”
OPEX & Elastic
The Past - AWS
Visionary Products
Fast iterations
Machine Learning as a key
use case
State of the Art data
platform
Last 3 years on fast
forward
Intelligent Billing
OPEX & Elastic
The Past - Google Cloud
The Present: Patterns
Weeks to Minutes to Seconds
Hadoop/Spark ecosystem is mature.
We have a broad set of options.
Big Data is much Bigger (e.g. x1e.32xlarge: 3TB
mem, 128 vCPUs, 14Gbps network)
Scale continues to be hard.
Cloud economics can be very disruptive
(especially for data workloads)
High-performance networks are common.
Storage can be decoupled from compute.
Cluster locality is important.
Service Endpoints (not clusters, aka serverless,
aka managed etc.).
Sophisticated Auto-scaling (batch & streaming,
spot vs. on-demand, multi-az).
Multi-DC and Multi-Region from Day 1.
Various flavors of containers.
The Future: Wish List
A Data Catalog product as the center of the
universe.
Data Monitoring Systems:
* statistical properties, anomaly detection,
schema changes, consumption patterns etc.
More intelligence at the data infrastructure level:
* data format migrations, intelligent caching
based on access patterns.
Declarative data transformation vs. explicit ETL.
Intelligent data sampling products. Scalability
has a cost.
Thanks!
Join the community on Meetup.com!
www.meetup.com/Data-Engineering-Club
www.dataeng.club
Do you want to present? Get in touch.
Feedback #dataengclub

The Evolving Landscape of Data Engineering

  • 1.
    The Evolving Landscapeof Data Engineering Andrei Savu - Event Co-organizer Staff Engineer @ Twitter Follow me @andreisavu
  • 2.
    Andrei Savu Staff Engineer@ Twitter: * MoPub Backend & Data Pipelines * Mobile App Monetization Co-organizer of the Data Engineering Club. Previously Tech Lead at Cloudera via the Axemblr acquisition. Started the Cloud engineering team.
  • 3.
    The Past: ● OSScommunities ● AWS history ● Google Cloud history The Present: Patterns The Future: Wish List Topics
  • 4.
    Weeks of Provisioning StaticInfrastructure Commodity Hardware Commodity Networking Data Locality Important Running in the Public Cloud was unusual CAPEX The Past - OSS
  • 5.
    Visionary Business Fast iterations DataManagement as a key platform use case Incredible Scale Transition to “serverless” OPEX & Elastic The Past - AWS
  • 6.
    Visionary Products Fast iterations MachineLearning as a key use case State of the Art data platform Last 3 years on fast forward Intelligent Billing OPEX & Elastic The Past - Google Cloud
  • 7.
    The Present: Patterns Weeksto Minutes to Seconds Hadoop/Spark ecosystem is mature. We have a broad set of options. Big Data is much Bigger (e.g. x1e.32xlarge: 3TB mem, 128 vCPUs, 14Gbps network) Scale continues to be hard. Cloud economics can be very disruptive (especially for data workloads) High-performance networks are common. Storage can be decoupled from compute. Cluster locality is important. Service Endpoints (not clusters, aka serverless, aka managed etc.). Sophisticated Auto-scaling (batch & streaming, spot vs. on-demand, multi-az). Multi-DC and Multi-Region from Day 1. Various flavors of containers.
  • 8.
    The Future: WishList A Data Catalog product as the center of the universe. Data Monitoring Systems: * statistical properties, anomaly detection, schema changes, consumption patterns etc. More intelligence at the data infrastructure level: * data format migrations, intelligent caching based on access patterns. Declarative data transformation vs. explicit ETL. Intelligent data sampling products. Scalability has a cost.
  • 9.
    Thanks! Join the communityon Meetup.com! www.meetup.com/Data-Engineering-Club www.dataeng.club Do you want to present? Get in touch. Feedback #dataengclub