SlideShare a Scribd company logo
Introduction to
Data Engineering
Vivek A. Ganesan
vivganes@gmail.com
Agenda
Copyright 2013, Vivek A. Ganesan, All rights reserved 1
o Introduction
o What is data engineering?
o Why data engineering?
o Required Skills
o Questions?
Introduction
Copyright 2013, Vivek A. Ganesan, All rights reserved 2
o What’s with the name?
o All other names were taken 
o Gods = Geeks on Data
o Well, it is now Geeking out on Data
o Why a Data Geek?
o Geeks are cool
o Data Geeks are way cool
Partial Omniscience (Super power of Prediction)
Data, Data, Data!
Copyright 2013, Vivek A. Ganesan, All rights reserved 3
• Significant increase in data (Volume)
• Social Networks
• Transaction Logs
• Fast streams of data (Velocity)
• Sensor data
• Machine-to-machine data
• Different kinds of data (Variety)
• Text
• Audio
• Video
• This trend is only going to grow!
Note : EB = Exabyte = 1 million Petabytes
Big Data Trends
Before Big Data
Copyright 2013, Vivek A. Ganesan, All rights reserved 4
• Life was simple … well mostly
• The ETL engineers managed data
pipelines
• The Data Scientists (they weren’t
called that, btw, they were
mostly Statisticians who
programmed in SAS, SPSS or S)
did the analysis
• Data Warehouses, Data marts
and OLAP cubes were the
platforms
• Data Analysts mostly generated
reports but they were proficient
in SQL, Excel, Pivot Tables etc.
• Data Architects …
well, they architected

• They managed :
• Data models
• Star Schemas
• Data Governance
• Master Data
Management
(MDM)
• Data Security
• For the most part, they
had to coax different
groups to share data
Big Data – What Changed?
Copyright 2013, Vivek A. Ganesan, All rights reserved 5
• Life … got interesting
• Huge data volumes – ETL became
a problem
• Traditional Statistical tools
couldn’t handle the volume
• Data Warehouses, Data marts
and OLAP cubes not primary
analytical means – “in situ”
analysis preferred i.e. no moving
data to an analytics platform
• Data Analysts still on point for
reports but now they no longer
had SQL interfaces (thanks to
NoSQL and Map Reduce)
• Data Architects …
well, they still need to
architect 
• Still need :
• Data models
• Data Governance
• Data Security
• For the most part, they
had to coax different
groups to share data
• They have to do all of
this when the
technology is rapidly
evolving
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 6
• The Good
• Data recognized as an asset
• Data Driven Products more
common
• Working with Data is cool
• The Bad
• Complexity is overwhelming
• No sophisticated toolset yet
• Technology is fast changing
• The Ugly
• No SQL!
• Security
• Governance
• Performance
• The Opportunity
• Solve for :
• SQL semantics
• Data Governance
• Data Security
• Benchmarking, Pro
filing and
Performance
measurement tools
• Build :
• Real-time solutions
• Data Marts/Data
Warehouses on top
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 7
Data Scientist Data AnalystData Engineer
• Building Models
• Validation/Testing
• Algorithms
• Continuous
Improvement
• Knowledge of :
• Statistics
• Linear Algebra
• Machine
Learning
• R,Matlab etc.
• Deep Domain
Knowledge
• Report Generation
• Data Exploration
• Hypotheses Testing
• Pattern Discovery
• Correlations
• Serendipitous
Discovery
• Data Pipelines
• Manage Platforms
• Productionalize
Algorithms
• Agile Development
• Knowledge of :
• Platforms
• Algorithms
• Java, C++ etc.
• Scripting
languagues
like python
Data Engineering
Copyright 2013, Vivek A. Ganesan, All rights reserved 8
• Strong CS Background
• Algorithms
• Database theory
• Scripting languages
• Server side languages
• Distributed Systems Background
• Clusters
• Networking
• Monitoring/Performance
• Data Science/Machine Learning
• Search/IR
• Text Analytics
• Classification
• Clustering
• Infrastructure
• Hadoop
• Cassandra
• Mongo DB
• Platforms
• Solr
• Hive
• HBase
• Mahout
• Applications
• Recommendation
Engines
• Fraud Prevention
• Disease Prevention
Data Engineer’s Role
Copyright 2013, Vivek A. Ganesan, All rights reserved 9
• Data Dialysis – Cleaning up Data
• Hard to do at Scale
• Newer tools in this space
• Great scope for innovation
• ETL -> ELT
• Distributed Bulk loading
• Full-fledged data pipelines
• Supporting both data scientists
and data analysts
• Productionalizing algorithms
• Production support
• Optimization
• A/B Testing and Continuous
Improvement
About this Meetup : Structure
Copyright 2013, Vivek A. Ganesan, All rights reserved 10
• Agile teams
• Monthly Scrum
• Week 1 : Introduction to Problem
• Week 2 : Algorithm + Platform
• Week 3 : Technical help
(Algorithm, Platform, Testing and
Deployment)
• Week 4 : Panel + Demo
• Showcase Startups/Experts in
the space
• Teams show demos
• Panel judges winners
• We might have prizes (needs
to be figured out)
• Weekly Meetup (on
Mondays)
• Might move to a bigger
venue if there is
enough demand
About this Meetup : Schedule
Copyright 2013, Vivek A. Ganesan, All rights reserved 11
• May 29th : Kickoff
• Scrum 1
• June 3rd – Collaborative
Filtering Introduction
• June 10th – Mongo DB
Introduction
• June 17th – Analytics on
Mongo DB
• June 24th – Panel + Demo
• Scrum 2 (TBD)
• Come along now, it will
be fun!
• Oh, the name 
Questions? Comments?
Thank You!
E-mail: vivganes@gmail.com
Twitter : onevivek
Copyright 2013, Vivek A. Ganesan, All rights
reserved
12

More Related Content

What's hot

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Data Mesh
Data MeshData Mesh
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 

What's hot (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 

Similar to Introduction to Data Engineering

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
Caserta
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
Yannick Pouliot
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
nathanmarz
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent
Jonny Daenen
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Alex Gorbachev
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
Ike Ellis
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
Praveen Kumar (Tyagi)
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
kammeyer
 

Similar to Introduction to Data Engineering (20)

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 

More from Vivek Aanand Ganesan

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
Collaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutionsCollaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutionsVivek Aanand Ganesan
 
Collaborative filtering getting_started
Collaborative filtering getting_startedCollaborative filtering getting_started
Collaborative filtering getting_startedVivek Aanand Ganesan
 
Recommendation Engines Program Kickoff
Recommendation Engines Program KickoffRecommendation Engines Program Kickoff
Recommendation Engines Program Kickoff
Vivek Aanand Ganesan
 

More from Vivek Aanand Ganesan (6)

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Collaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutionsCollaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutions
 
Mongodb hackathon 02
Mongodb hackathon 02Mongodb hackathon 02
Mongodb hackathon 02
 
Collaborative filtering getting_started
Collaborative filtering getting_startedCollaborative filtering getting_started
Collaborative filtering getting_started
 
Mongodb hackathon 01
Mongodb hackathon 01Mongodb hackathon 01
Mongodb hackathon 01
 
Recommendation Engines Program Kickoff
Recommendation Engines Program KickoffRecommendation Engines Program Kickoff
Recommendation Engines Program Kickoff
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Introduction to Data Engineering

  • 1. Introduction to Data Engineering Vivek A. Ganesan vivganes@gmail.com
  • 2. Agenda Copyright 2013, Vivek A. Ganesan, All rights reserved 1 o Introduction o What is data engineering? o Why data engineering? o Required Skills o Questions?
  • 3. Introduction Copyright 2013, Vivek A. Ganesan, All rights reserved 2 o What’s with the name? o All other names were taken  o Gods = Geeks on Data o Well, it is now Geeking out on Data o Why a Data Geek? o Geeks are cool o Data Geeks are way cool Partial Omniscience (Super power of Prediction)
  • 4. Data, Data, Data! Copyright 2013, Vivek A. Ganesan, All rights reserved 3 • Significant increase in data (Volume) • Social Networks • Transaction Logs • Fast streams of data (Velocity) • Sensor data • Machine-to-machine data • Different kinds of data (Variety) • Text • Audio • Video • This trend is only going to grow! Note : EB = Exabyte = 1 million Petabytes Big Data Trends
  • 5. Before Big Data Copyright 2013, Vivek A. Ganesan, All rights reserved 4 • Life was simple … well mostly • The ETL engineers managed data pipelines • The Data Scientists (they weren’t called that, btw, they were mostly Statisticians who programmed in SAS, SPSS or S) did the analysis • Data Warehouses, Data marts and OLAP cubes were the platforms • Data Analysts mostly generated reports but they were proficient in SQL, Excel, Pivot Tables etc. • Data Architects … well, they architected  • They managed : • Data models • Star Schemas • Data Governance • Master Data Management (MDM) • Data Security • For the most part, they had to coax different groups to share data
  • 6. Big Data – What Changed? Copyright 2013, Vivek A. Ganesan, All rights reserved 5 • Life … got interesting • Huge data volumes – ETL became a problem • Traditional Statistical tools couldn’t handle the volume • Data Warehouses, Data marts and OLAP cubes not primary analytical means – “in situ” analysis preferred i.e. no moving data to an analytics platform • Data Analysts still on point for reports but now they no longer had SQL interfaces (thanks to NoSQL and Map Reduce) • Data Architects … well, they still need to architect  • Still need : • Data models • Data Governance • Data Security • For the most part, they had to coax different groups to share data • They have to do all of this when the technology is rapidly evolving
  • 7. Life in the Big Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 6 • The Good • Data recognized as an asset • Data Driven Products more common • Working with Data is cool • The Bad • Complexity is overwhelming • No sophisticated toolset yet • Technology is fast changing • The Ugly • No SQL! • Security • Governance • Performance • The Opportunity • Solve for : • SQL semantics • Data Governance • Data Security • Benchmarking, Pro filing and Performance measurement tools • Build : • Real-time solutions • Data Marts/Data Warehouses on top
  • 8. Life in the Big Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 7 Data Scientist Data AnalystData Engineer • Building Models • Validation/Testing • Algorithms • Continuous Improvement • Knowledge of : • Statistics • Linear Algebra • Machine Learning • R,Matlab etc. • Deep Domain Knowledge • Report Generation • Data Exploration • Hypotheses Testing • Pattern Discovery • Correlations • Serendipitous Discovery • Data Pipelines • Manage Platforms • Productionalize Algorithms • Agile Development • Knowledge of : • Platforms • Algorithms • Java, C++ etc. • Scripting languagues like python
  • 9. Data Engineering Copyright 2013, Vivek A. Ganesan, All rights reserved 8 • Strong CS Background • Algorithms • Database theory • Scripting languages • Server side languages • Distributed Systems Background • Clusters • Networking • Monitoring/Performance • Data Science/Machine Learning • Search/IR • Text Analytics • Classification • Clustering • Infrastructure • Hadoop • Cassandra • Mongo DB • Platforms • Solr • Hive • HBase • Mahout • Applications • Recommendation Engines • Fraud Prevention • Disease Prevention
  • 10. Data Engineer’s Role Copyright 2013, Vivek A. Ganesan, All rights reserved 9 • Data Dialysis – Cleaning up Data • Hard to do at Scale • Newer tools in this space • Great scope for innovation • ETL -> ELT • Distributed Bulk loading • Full-fledged data pipelines • Supporting both data scientists and data analysts • Productionalizing algorithms • Production support • Optimization • A/B Testing and Continuous Improvement
  • 11. About this Meetup : Structure Copyright 2013, Vivek A. Ganesan, All rights reserved 10 • Agile teams • Monthly Scrum • Week 1 : Introduction to Problem • Week 2 : Algorithm + Platform • Week 3 : Technical help (Algorithm, Platform, Testing and Deployment) • Week 4 : Panel + Demo • Showcase Startups/Experts in the space • Teams show demos • Panel judges winners • We might have prizes (needs to be figured out) • Weekly Meetup (on Mondays) • Might move to a bigger venue if there is enough demand
  • 12. About this Meetup : Schedule Copyright 2013, Vivek A. Ganesan, All rights reserved 11 • May 29th : Kickoff • Scrum 1 • June 3rd – Collaborative Filtering Introduction • June 10th – Mongo DB Introduction • June 17th – Analytics on Mongo DB • June 24th – Panel + Demo • Scrum 2 (TBD) • Come along now, it will be fun! • Oh, the name 
  • 13. Questions? Comments? Thank You! E-mail: vivganes@gmail.com Twitter : onevivek Copyright 2013, Vivek A. Ganesan, All rights reserved 12