Successfully reported this slideshow.
You’ve unlocked unlimited downloads on SlideShare!
Big Data Analytics: From
ETL to Data Engineering
By Dmitry Tolpeko – November 2016
Is Data Engineering Just a New Term?
Is there any difference between ETL and Data Engineering? Or do they mean
the same thing?
I think they are very different, and I will try to explain.
What is ETL?
ETL means Extract-Transform-Load, and it is a very important part of building
and maintaining a Data Warehouse
Extract: Export data from one or many sources.
Transform (at various stages): Clean, transform, aggregate etc. data
Load (at various stages): Load data into the data warehouse, data marts i.e.
make data available for BI tools.
Transform: Mostly SQL and Procedural SQL
SQL Analytic (Window) functions such as LEAD/LAG for complex calculations
Data in Data Warehouse: Often redundant, duplicated and pre-aggregated
for performance reasons. But in most cases data is in first normal form i.e.
columns contain atomic values.
Also data is in Star schema: a set of dimension and fact tables.
Data Consumers: BI tools that mostly visualize data in interactive (filter,
drill-down etc.) reports and dashboards.
Big Data Analytics
Data Scientists are the first consumers of data now.
Feature engineering is the major step of building statistical and machine
learning models for advanced analytics:
Feature vectors containing hundreds of elements with each element
containing a list, map, nested structs, buckets, list of maps, map of maps
and so on.
Complex Calculations to create feature vectors within a data window (reduce
phase) require a non-SQL approach.
Skills: ETL Developer vs Data Engineer
SQL and procedural SQL
Analytic SQL functions
Data warehouse design and modelling
Python (Scala, Java), Spark, Pig, SQL
Distributed data processing concepts (MapReduce, Spark)
Statistics and machine learning concepts