Introduction to
Data Engineering
Vivek A. Ganesan
vivganes@gmail.com
Agenda
Copyright 2013, Vivek A. Ganesan, All rights reserved 1
o Introduction
o What is data engineering?
o Why data engineering?
o Required Skills
o Questions?
Introduction
Copyright 2013, Vivek A. Ganesan, All rights reserved 2
o What’s with the name?
o All other names were taken 
o Gods = Geeks on Data
o Well, it is now Geeking out on Data
o Why a Data Geek?
o Geeks are cool
o Data Geeks are way cool
Partial Omniscience (Super power of Prediction)
Data, Data, Data!
Copyright 2013, Vivek A. Ganesan, All rights reserved 3
• Significant increase in data (Volume)
• Social Networks
• Transaction Logs
• Fast streams of data (Velocity)
• Sensor data
• Machine-to-machine data
• Different kinds of data (Variety)
• Text
• Audio
• Video
• This trend is only going to grow!
Note : EB = Exabyte = 1 million Petabytes
Big Data Trends
Before Big Data
Copyright 2013, Vivek A. Ganesan, All rights reserved 4
• Life was simple … well mostly
• The ETL engineers managed data
pipelines
• The Data Scientists (they weren’t
called that, btw, they were
mostly Statisticians who
programmed in SAS, SPSS or S)
did the analysis
• Data Warehouses, Data marts
and OLAP cubes were the
platforms
• Data Analysts mostly generated
reports but they were proficient
in SQL, Excel, Pivot Tables etc.
• Data Architects …
well, they architected

• They managed :
• Data models
• Star Schemas
• Data Governance
• Master Data
Management
(MDM)
• Data Security
• For the most part, they
had to coax different
groups to share data
Big Data – What Changed?
Copyright 2013, Vivek A. Ganesan, All rights reserved 5
• Life … got interesting
• Huge data volumes – ETL became
a problem
• Traditional Statistical tools
couldn’t handle the volume
• Data Warehouses, Data marts
and OLAP cubes not primary
analytical means – “in situ”
analysis preferred i.e. no moving
data to an analytics platform
• Data Analysts still on point for
reports but now they no longer
had SQL interfaces (thanks to
NoSQL and Map Reduce)
• Data Architects …
well, they still need to
architect 
• Still need :
• Data models
• Data Governance
• Data Security
• For the most part, they
had to coax different
groups to share data
• They have to do all of
this when the
technology is rapidly
evolving
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 6
• The Good
• Data recognized as an asset
• Data Driven Products more
common
• Working with Data is cool
• The Bad
• Complexity is overwhelming
• No sophisticated toolset yet
• Technology is fast changing
• The Ugly
• No SQL!
• Security
• Governance
• Performance
• The Opportunity
• Solve for :
• SQL semantics
• Data Governance
• Data Security
• Benchmarking, Pro
filing and
Performance
measurement tools
• Build :
• Real-time solutions
• Data Marts/Data
Warehouses on top
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 7
Data Scientist Data AnalystData Engineer
• Building Models
• Validation/Testing
• Algorithms
• Continuous
Improvement
• Knowledge of :
• Statistics
• Linear Algebra
• Machine
Learning
• R,Matlab etc.
• Deep Domain
Knowledge
• Report Generation
• Data Exploration
• Hypotheses Testing
• Pattern Discovery
• Correlations
• Serendipitous
Discovery
• Data Pipelines
• Manage Platforms
• Productionalize
Algorithms
• Agile Development
• Knowledge of :
• Platforms
• Algorithms
• Java, C++ etc.
• Scripting
languagues
like python
Data Engineering
Copyright 2013, Vivek A. Ganesan, All rights reserved 8
• Strong CS Background
• Algorithms
• Database theory
• Scripting languages
• Server side languages
• Distributed Systems Background
• Clusters
• Networking
• Monitoring/Performance
• Data Science/Machine Learning
• Search/IR
• Text Analytics
• Classification
• Clustering
• Infrastructure
• Hadoop
• Cassandra
• Mongo DB
• Platforms
• Solr
• Hive
• HBase
• Mahout
• Applications
• Recommendation
Engines
• Fraud Prevention
• Disease Prevention
Data Engineer’s Role
Copyright 2013, Vivek A. Ganesan, All rights reserved 9
• Data Dialysis – Cleaning up Data
• Hard to do at Scale
• Newer tools in this space
• Great scope for innovation
• ETL -> ELT
• Distributed Bulk loading
• Full-fledged data pipelines
• Supporting both data scientists
and data analysts
• Productionalizing algorithms
• Production support
• Optimization
• A/B Testing and Continuous
Improvement
About this Meetup : Structure
Copyright 2013, Vivek A. Ganesan, All rights reserved 10
• Agile teams
• Monthly Scrum
• Week 1 : Introduction to Problem
• Week 2 : Algorithm + Platform
• Week 3 : Technical help
(Algorithm, Platform, Testing and
Deployment)
• Week 4 : Panel + Demo
• Showcase Startups/Experts in
the space
• Teams show demos
• Panel judges winners
• We might have prizes (needs
to be figured out)
• Weekly Meetup (on
Mondays)
• Might move to a bigger
venue if there is
enough demand
About this Meetup : Schedule
Copyright 2013, Vivek A. Ganesan, All rights reserved 11
• May 29th : Kickoff
• Scrum 1
• June 3rd – Collaborative
Filtering Introduction
• June 10th – Mongo DB
Introduction
• June 17th – Analytics on
Mongo DB
• June 24th – Panel + Demo
• Scrum 2 (TBD)
• Come along now, it will
be fun!
• Oh, the name 
Questions? Comments?
Thank You!
E-mail: vivganes@gmail.com
Twitter : onevivek
Copyright 2013, Vivek A. Ganesan, All rights
reserved
12

Introduction to Data Engineering

  • 1.
    Introduction to Data Engineering VivekA. Ganesan vivganes@gmail.com
  • 2.
    Agenda Copyright 2013, VivekA. Ganesan, All rights reserved 1 o Introduction o What is data engineering? o Why data engineering? o Required Skills o Questions?
  • 3.
    Introduction Copyright 2013, VivekA. Ganesan, All rights reserved 2 o What’s with the name? o All other names were taken  o Gods = Geeks on Data o Well, it is now Geeking out on Data o Why a Data Geek? o Geeks are cool o Data Geeks are way cool Partial Omniscience (Super power of Prediction)
  • 4.
    Data, Data, Data! Copyright2013, Vivek A. Ganesan, All rights reserved 3 • Significant increase in data (Volume) • Social Networks • Transaction Logs • Fast streams of data (Velocity) • Sensor data • Machine-to-machine data • Different kinds of data (Variety) • Text • Audio • Video • This trend is only going to grow! Note : EB = Exabyte = 1 million Petabytes Big Data Trends
  • 5.
    Before Big Data Copyright2013, Vivek A. Ganesan, All rights reserved 4 • Life was simple … well mostly • The ETL engineers managed data pipelines • The Data Scientists (they weren’t called that, btw, they were mostly Statisticians who programmed in SAS, SPSS or S) did the analysis • Data Warehouses, Data marts and OLAP cubes were the platforms • Data Analysts mostly generated reports but they were proficient in SQL, Excel, Pivot Tables etc. • Data Architects … well, they architected  • They managed : • Data models • Star Schemas • Data Governance • Master Data Management (MDM) • Data Security • For the most part, they had to coax different groups to share data
  • 6.
    Big Data –What Changed? Copyright 2013, Vivek A. Ganesan, All rights reserved 5 • Life … got interesting • Huge data volumes – ETL became a problem • Traditional Statistical tools couldn’t handle the volume • Data Warehouses, Data marts and OLAP cubes not primary analytical means – “in situ” analysis preferred i.e. no moving data to an analytics platform • Data Analysts still on point for reports but now they no longer had SQL interfaces (thanks to NoSQL and Map Reduce) • Data Architects … well, they still need to architect  • Still need : • Data models • Data Governance • Data Security • For the most part, they had to coax different groups to share data • They have to do all of this when the technology is rapidly evolving
  • 7.
    Life in theBig Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 6 • The Good • Data recognized as an asset • Data Driven Products more common • Working with Data is cool • The Bad • Complexity is overwhelming • No sophisticated toolset yet • Technology is fast changing • The Ugly • No SQL! • Security • Governance • Performance • The Opportunity • Solve for : • SQL semantics • Data Governance • Data Security • Benchmarking, Pro filing and Performance measurement tools • Build : • Real-time solutions • Data Marts/Data Warehouses on top
  • 8.
    Life in theBig Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 7 Data Scientist Data AnalystData Engineer • Building Models • Validation/Testing • Algorithms • Continuous Improvement • Knowledge of : • Statistics • Linear Algebra • Machine Learning • R,Matlab etc. • Deep Domain Knowledge • Report Generation • Data Exploration • Hypotheses Testing • Pattern Discovery • Correlations • Serendipitous Discovery • Data Pipelines • Manage Platforms • Productionalize Algorithms • Agile Development • Knowledge of : • Platforms • Algorithms • Java, C++ etc. • Scripting languagues like python
  • 9.
    Data Engineering Copyright 2013,Vivek A. Ganesan, All rights reserved 8 • Strong CS Background • Algorithms • Database theory • Scripting languages • Server side languages • Distributed Systems Background • Clusters • Networking • Monitoring/Performance • Data Science/Machine Learning • Search/IR • Text Analytics • Classification • Clustering • Infrastructure • Hadoop • Cassandra • Mongo DB • Platforms • Solr • Hive • HBase • Mahout • Applications • Recommendation Engines • Fraud Prevention • Disease Prevention
  • 10.
    Data Engineer’s Role Copyright2013, Vivek A. Ganesan, All rights reserved 9 • Data Dialysis – Cleaning up Data • Hard to do at Scale • Newer tools in this space • Great scope for innovation • ETL -> ELT • Distributed Bulk loading • Full-fledged data pipelines • Supporting both data scientists and data analysts • Productionalizing algorithms • Production support • Optimization • A/B Testing and Continuous Improvement
  • 11.
    About this Meetup: Structure Copyright 2013, Vivek A. Ganesan, All rights reserved 10 • Agile teams • Monthly Scrum • Week 1 : Introduction to Problem • Week 2 : Algorithm + Platform • Week 3 : Technical help (Algorithm, Platform, Testing and Deployment) • Week 4 : Panel + Demo • Showcase Startups/Experts in the space • Teams show demos • Panel judges winners • We might have prizes (needs to be figured out) • Weekly Meetup (on Mondays) • Might move to a bigger venue if there is enough demand
  • 12.
    About this Meetup: Schedule Copyright 2013, Vivek A. Ganesan, All rights reserved 11 • May 29th : Kickoff • Scrum 1 • June 3rd – Collaborative Filtering Introduction • June 10th – Mongo DB Introduction • June 17th – Analytics on Mongo DB • June 24th – Panel + Demo • Scrum 2 (TBD) • Come along now, it will be fun! • Oh, the name 
  • 13.
    Questions? Comments? Thank You! E-mail:vivganes@gmail.com Twitter : onevivek Copyright 2013, Vivek A. Ganesan, All rights reserved 12