• Save
Introduction to Data Engineering
Upcoming SlideShare
Loading in...5
×
 

Introduction to Data Engineering

on

  • 1,383 views

Introduction to Data Engineering

Introduction to Data Engineering

Big Data Gods Meetup - Week 1 Presentation

Statistics

Views

Total Views
1,383
Views on SlideShare
1,383
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to Data Engineering Introduction to Data Engineering Presentation Transcript

  • Introduction toData EngineeringVivek A. Ganesanvivganes@gmail.com
  • AgendaCopyright 2013, Vivek A. Ganesan, All rights reserved 1o Introductiono What is data engineering?o Why data engineering?o Required Skillso Questions?
  • IntroductionCopyright 2013, Vivek A. Ganesan, All rights reserved 2o What’s with the name?o All other names were taken o Gods = Geeks on Datao Well, it is now Geeking out on Datao Why a Data Geek?o Geeks are coolo Data Geeks are way coolPartial Omniscience (Super power of Prediction)
  • Data, Data, Data!Copyright 2013, Vivek A. Ganesan, All rights reserved 3• Significant increase in data (Volume)• Social Networks• Transaction Logs• Fast streams of data (Velocity)• Sensor data• Machine-to-machine data• Different kinds of data (Variety)• Text• Audio• Video• This trend is only going to grow!Note : EB = Exabyte = 1 million PetabytesBig Data Trends
  • Before Big DataCopyright 2013, Vivek A. Ganesan, All rights reserved 4• Life was simple … well mostly• The ETL engineers managed datapipelines• The Data Scientists (they weren’tcalled that, btw, they weremostly Statisticians whoprogrammed in SAS, SPSS or S)did the analysis• Data Warehouses, Data martsand OLAP cubes were theplatforms• Data Analysts mostly generatedreports but they were proficientin SQL, Excel, Pivot Tables etc.• Data Architects …well, they architected• They managed :• Data models• Star Schemas• Data Governance• Master DataManagement(MDM)• Data Security• For the most part, theyhad to coax differentgroups to share data
  • Big Data – What Changed?Copyright 2013, Vivek A. Ganesan, All rights reserved 5• Life … got interesting• Huge data volumes – ETL becamea problem• Traditional Statistical toolscouldn’t handle the volume• Data Warehouses, Data martsand OLAP cubes not primaryanalytical means – “in situ”analysis preferred i.e. no movingdata to an analytics platform• Data Analysts still on point forreports but now they no longerhad SQL interfaces (thanks toNoSQL and Map Reduce)• Data Architects …well, they still need toarchitect • Still need :• Data models• Data Governance• Data Security• For the most part, theyhad to coax differentgroups to share data• They have to do all ofthis when thetechnology is rapidlyevolving
  • Life in the Big Data UniverseCopyright 2013, Vivek A. Ganesan, All rights reserved 6• The Good• Data recognized as an asset• Data Driven Products morecommon• Working with Data is cool• The Bad• Complexity is overwhelming• No sophisticated toolset yet• Technology is fast changing• The Ugly• No SQL!• Security• Governance• Performance• The Opportunity• Solve for :• SQL semantics• Data Governance• Data Security• Benchmarking, Profiling andPerformancemeasurement tools• Build :• Real-time solutions• Data Marts/DataWarehouses on top
  • Life in the Big Data UniverseCopyright 2013, Vivek A. Ganesan, All rights reserved 7Data Scientist Data AnalystData Engineer• Building Models• Validation/Testing• Algorithms• ContinuousImprovement• Knowledge of :• Statistics• Linear Algebra• MachineLearning• R,Matlab etc.• Deep DomainKnowledge• Report Generation• Data Exploration• Hypotheses Testing• Pattern Discovery• Correlations• SerendipitousDiscovery• Data Pipelines• Manage Platforms• ProductionalizeAlgorithms• Agile Development• Knowledge of :• Platforms• Algorithms• Java, C++ etc.• Scriptinglanguagueslike python
  • Data EngineeringCopyright 2013, Vivek A. Ganesan, All rights reserved 8• Strong CS Background• Algorithms• Database theory• Scripting languages• Server side languages• Distributed Systems Background• Clusters• Networking• Monitoring/Performance• Data Science/Machine Learning• Search/IR• Text Analytics• Classification• Clustering• Infrastructure• Hadoop• Cassandra• Mongo DB• Platforms• Solr• Hive• HBase• Mahout• Applications• RecommendationEngines• Fraud Prevention• Disease Prevention
  • Data Engineer’s RoleCopyright 2013, Vivek A. Ganesan, All rights reserved 9• Data Dialysis – Cleaning up Data• Hard to do at Scale• Newer tools in this space• Great scope for innovation• ETL -> ELT• Distributed Bulk loading• Full-fledged data pipelines• Supporting both data scientistsand data analysts• Productionalizing algorithms• Production support• Optimization• A/B Testing and ContinuousImprovement
  • About this Meetup : StructureCopyright 2013, Vivek A. Ganesan, All rights reserved 10• Agile teams• Monthly Scrum• Week 1 : Introduction to Problem• Week 2 : Algorithm + Platform• Week 3 : Technical help(Algorithm, Platform, Testing andDeployment)• Week 4 : Panel + Demo• Showcase Startups/Experts inthe space• Teams show demos• Panel judges winners• We might have prizes (needsto be figured out)• Weekly Meetup (onMondays)• Might move to a biggervenue if there isenough demand
  • About this Meetup : ScheduleCopyright 2013, Vivek A. Ganesan, All rights reserved 11• May 29th : Kickoff• Scrum 1• June 3rd – CollaborativeFiltering Introduction• June 10th – Mongo DBIntroduction• June 17th – Analytics onMongo DB• June 24th – Panel + Demo• Scrum 2 (TBD)• Come along now, it willbe fun!• Oh, the name 
  • Questions? Comments?Thank You!E-mail: vivganes@gmail.comTwitter : onevivekCopyright 2013, Vivek A. Ganesan, All rightsreserved12