Data Engineering is a relatively new, but fast evolving discipline that spans multiple environments and technologies, from traditional data centers to hyper-scale cloud providers, a discipline that combines closed-source, homegrown and open source software to create scalable data pipelines and power incredible new product features.
In this presentation, we will go over the last 5-10 years of technology trends and advancements and bring all of that together in a story about modern day Data Engineering and the magic behind it.
Driving Behavioral Change for Information Management through Data-Driven Gree...
The Evolving Landscape of Data Engineering
1. The Evolving Landscape of
Data Engineering
Andrei Savu - Event Co-organizer
Staff Engineer @ Twitter
Follow me @andreisavu
2. Andrei Savu
Staff Engineer @ Twitter:
* MoPub Backend & Data Pipelines
* Mobile App Monetization
Co-organizer of the Data Engineering
Club.
Previously Tech Lead at Cloudera via
the Axemblr acquisition. Started the
Cloud engineering team.
3. The Past:
● OSS communities
● AWS history
● Google Cloud history
The Present: Patterns
The Future: Wish List
Topics
4. Weeks of Provisioning
Static Infrastructure
Commodity Hardware
Commodity Networking
Data Locality Important
Running in the Public
Cloud was unusual
CAPEX
The Past - OSS
6. Visionary Products
Fast iterations
Machine Learning as a key
use case
State of the Art data
platform
Last 3 years on fast
forward
Intelligent Billing
OPEX & Elastic
The Past - Google Cloud
7. The Present: Patterns
Weeks to Minutes to Seconds
Hadoop/Spark ecosystem is mature.
We have a broad set of options.
Big Data is much Bigger (e.g. x1e.32xlarge: 3TB
mem, 128 vCPUs, 14Gbps network)
Scale continues to be hard.
Cloud economics can be very disruptive
(especially for data workloads)
High-performance networks are common.
Storage can be decoupled from compute.
Cluster locality is important.
Service Endpoints (not clusters, aka serverless,
aka managed etc.).
Sophisticated Auto-scaling (batch & streaming,
spot vs. on-demand, multi-az).
Multi-DC and Multi-Region from Day 1.
Various flavors of containers.
8. The Future: Wish List
A Data Catalog product as the center of the
universe.
Data Monitoring Systems:
* statistical properties, anomaly detection,
schema changes, consumption patterns etc.
More intelligence at the data infrastructure level:
* data format migrations, intelligent caching
based on access patterns.
Declarative data transformation vs. explicit ETL.
Intelligent data sampling products. Scalability
has a cost.
9. Thanks!
Join the community on Meetup.com!
www.meetup.com/Data-Engineering-Club
www.dataeng.club
Do you want to present? Get in touch.
Feedback #dataengclub