SlideShare a Scribd company logo
1 of 12
Become a Data Architect – session 1 Data Architect Base Salary:
135K+ for 4+ years of experience
150K (range 110K-180K)
There is a huge demand for data specialists:
• Data Engineer
• Data Analyst
• Data Architect
• Data Scientist
• Machine Learning Engineer
• Researcher (Data Science or Machine Learning)
• Managers and PMs
The biggest demand is for data architects.
Adding the word "Architect" to any technical profession
increases salary by ~20%. Especially if you also add
words "Senior" or "Enterprise":
• Senior Data Architect
• Enterprise Data Architect
With job market being so "hungry", the education and
experience becomes optional:
• Optional: several years of relevant experience
• Optional: BS Degree
Data Architect Salary at Microsoft:
200K Microsoft (143K-232K)
Total compensation up to $318K
Two Ways to Grow
Learn new skills at work
while doing something else.
Reach 80% of readiness.
Become "entitled".
Get promoted.
Slow progress
Learn 5-10% of new skills.
Convince a manger to give you the new project/job
based solely on your enthusiasm and desire
(I can do this job, trust me ... )
Learn the skill while doing the job.
Fast progress
We do this
What Does a Data Architect Do ?
Data Architect (DA):
• Interviews business stakeholders to understand requirements and constraints
• Proposes a solution diagram (usually constructs from templates)
• How data is loaded, stored, maintained, queried, and consumed
• How to do analytics (self-service), Machine Learning Modeling, reporting
• Select tools/technology, considering costs, compliance, privacy, security
• Automation, data lineage, data governance
• DA designs all stages and plans for execution: Design, Create, Deploy, Manage
• DA establishes models, policies, rules, standards that govern data collection, processing, storage, and usage
• DA advises and educates managers, engineers, analysts
Most Essential Technical Skills:
• Strong data modeling skills
• Database architecture and DW (Data Warehousing)
• ETL Tools
• Template Data Architectures in all three major Clouds
• Data governance know-how
• SQL, Python or R
• Analytics dashboarding (Power BI, Tableau, ...)
Business skills:
• Excellent communication skills.
• Listen to managers carefully to understand requirements
• Convert data challenges into automated processes
• Max results for min resources
• Excellent presentation skills
• Explain complex concepts to non-technical audience
• Advise data modelers, data engineers, database administrators, and junior architects
• Industry Knowledge, how data is collected, analyzed and utilized
• Maintaining flexibility in the face of big data developments.
Generic Data Diagram
Components of a big data architecture
from https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
• Design data-flow and data-storage strategy/architecture
• Build an inventory of data (available, needed, where to get)
• Work with IT, Data Scientists, and Management
• Identify and evaluate current data management technologies
• Create a fluid, end-to-end vision for how data will flow through an organization
• Develop data models for database structures
• Design, document, construct and deploy database architectures and apps
• Provide for scale, security, performance, data recovery, reliability, etc.
• Ensure data accuracy and accessibility
• Create frameworks / templates for solutions
• Constantly monitor, refine and report on the performance of data management systems
• Meld new systems with existing DW
• Produce and enforce database development standards
• Maintain a corporate repository of all data architecture artifacts and procedures
• Make presentations to upper management
Data Architect Responsibilities
• Good foundation in Computer Science, Software Architectures, Engineering
• Data structures, algorithms,
• System Design, Distributed System Design
• 3-tier architecture = MVC (Model, View, Controller)
• 3-tier using clusters (a.k.a. Shared Architecture), consistent hashing
• Lambda Architecture - events
• Streaming Architecture (Kafka): no central DB, just a message bus
• Databases types (SQL, noSQ (Key-value), Graph, etc.),
• OLTP vs OLAP, columnar storage (VertiPaq), denormalise for speed
• MPP (Massive Parallel Processing) DW, Clusters, Polaris Engine
• SQL mastery (DML , DDL, DCL, TCL)
• DBs: Mainframe DB2, Sybase, MS SQL Server, MySQL,
• Oracle, PostgreSQL, MongoDB, DynamoDB, CosmosDB, BigQuery,
• Apache Cassandra, SnowFlake.net, Pig, etc.
• Data Warehouse (Kimball Star Schema, Facts, Dimensions, Snowflake schema)
• ETL tools (bcp, Oracle Data Loader, Informatica, Ab Initio, StreamSets,
• ADF (Azure Data Factory), Azure Synapse Integrate, etc.)
• Data Analytics (Power BI, Tableau, visualizations, Reporting, self-service)
• Data modeling tools (ERWin, Enterprise Architect, Visio, etc.), UML
• Data schema, entities, relations, data flows, hierarchies
• CAP Theorem (Consistency, Availability, Partition Tolerance)
• Geographical redundancy,
• ACID transactions, dirty reads
• Replication, transaction log
• Distributed transactions, two-phase commit
• Backup/archival software
DA Technical Skills
• receiving/sending data in different formats (XML, SOAP, JSON, REST, protocol
buffers)
• Working with APIs
• File formats: CSV, parquet, JSON, Apache Arrow
• Handling nulls, missing data, data quality and integrity
• Hadoop/Spark data processing, loading, map-reduce, Google Big Table, HDFS,
GDFS
• Streaming, Kafka, Event Hub, IoT ingesting
• Design patterns
• Big data handling
• Data mining
• Data security, access, data privacy, GDPR, differential privacy
• Risk assessment
• Data governance (measure and manage data quality, ownership,
compliance, security, cleaning, standards, categories, encryption, etc.)
• Data lineage
• Agile methodologies and ERP implementation, GitHub, GitLab
• App Servers
• Machine Learning, predictive modeling, NLP and text analytics
• Python, C/C++, Java, Perl
• Unix/Linux and MS Windows
• Some Math and Statistics
• IaaS, PaaS, SaaS (infrastructure, Platform, Software as a Service)
Reference architectures and specific tools for all 3 major Clouds
for ETL, SQL DW, Analytics, Machine Learning, Visualization, Reporting, etc.
• S3 (Simple Storage Service)
• EC2 (Elastic Compute Cloud)
• Lambda functions (serverless)
• Databases: Redshift, Snowflake, Athena (serverless), Aurora (MySQL & PostreSQL compatible DB),
MariaDB, MySQL, PostreSQL, Microsoft SQL Server, DynamoDB, Apache PrestoDB, Neptune Graph
Database
• Amazon Glue: managed ETL service
• Amazon Data Pipeline
• AppFlow, Kinesis Firehose, AWS Datasync, AWS Data Migration Service
• Amazon EMR (Elastic MapReduce) - Hadoop, Spark
• Amazon Machine Learning: SageMaker (python, jupyter notebooks, deployment, ...)
• Amazon AI Services:
• Amazon Comprehend (extract from text)
• Amazon CodeGuru (auto code review)
• Amazon Lex (Chatbots)
• Amazon Forecast
• Amazon Textract (extract tet and data from millions of docs)
• Amazon Kendra (Natural Language Search)
• Amazon Fraud Detector
• Amazon Rekognition - image/video analysis,
• Amazon Personalize - recommendation engine,
• Amazon Translate - real time translation
• Amazon Polly - text-to-speech
• Amazon Transcribe - speech to text
• Amazon QuickSight - Analytics dashboards, ...
Amazon – AWS (Amazon Web Services)
• ADLS (Azure Data Lake Storage Gen2) blobs and files
• ADF (Azure Data Factory)
• Microsoft SQL Server, SQL Data Warehouse
• Azure Functions
• Synapse
• Integrate
• SQL pools (serverless & dedicated),
• PySpark
• ADLS
• Databricks
• Machine Learning Studio
• CosmosDB, Link for Cosmos DB
• Power BI
• Cognitive Services
• Azure Purview (data lineage, governance)
• Azure DevOps (agile planning, CI/CD tools, code repos, etc.)
Azure
Synapse
Integrate
Data
Pipeline
s
Azure Data Lake
Storage Gen2
Machine
Learnin
g
Web
End-
Points
• Cloud Storage
• Storage Transfer Service
• Cloud Functions
• Databases:
• Cloud SQL: managed MySQL, PostgreSQL, and SQL Server
• Cloud BigQuery: Serverless DW, globally scalable, cost-effective
• Cloud Spanner: 99.999% availability, gaming, global fin. ledger, inventory
• Cloud Bigtable: NoSQL wide-column (similar to HBase & Apache Cassandra)
• Firestore: NoSQL for Mobile, IoT, ...
• Firebase Realtime DB: mobile, personalized ads, in-app chats, ...
• Memorystore: Redis or Memcached
• MongoDB Atlas
• Neo4j Auro (Graph DB)
• Datastax (NoSQL built on Apache Cassandra)
• Datalab, DataPrep, DataFlow
• Machine Learning: DataLab, ML Engine, AutoML
• BI Dashboards: Google Data Studio
• Colaboratory (Colab) - free jupyter notebooks with GPU - https://colab.research.google.com/
• Kaggle - ML competitions, code, notebooks (kernels), ... - https://www.kaggle.com/
Google Cloud Platform
Oracle Cloud
Oracle Infrastructure, DB, Java, ERP apps, NetSuite, HR, CRM, ...
BEAM - Business Event Analysis & Modeling
Interview business stakeholders, and document the data and
process:
Create Event Matrix (Excel) documenting facts and dimensions.
Use "starter" templates for the interview and documentation.
• https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/
• https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1
• http://www.decisionone.co.uk/training/
• https://www.linkedin.com/in/lawrencecorr/
AWS to Azure services comparison
- https://docs.microsoft.com/en-us/azure/architecture/aws-professional/services
Browse Azure Architecture
- https://docs.microsoft.com/en-us/azure/architecture/browse/
Big data architectures
- https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
BEAM
• https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/
• https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1
• http://www.decisionone.co.uk/training/
• https://www.linkedin.com/in/lawrencecorr/
DA Resources

More Related Content

Similar to DA_01_Intro.pptx

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 

Similar to DA_01_Intro.pptx (20)

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 Webinar - QuerySurge and Azure DevOps in the Azure Cloud Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 

DA_01_Intro.pptx

  • 1. Become a Data Architect – session 1 Data Architect Base Salary: 135K+ for 4+ years of experience 150K (range 110K-180K) There is a huge demand for data specialists: • Data Engineer • Data Analyst • Data Architect • Data Scientist • Machine Learning Engineer • Researcher (Data Science or Machine Learning) • Managers and PMs The biggest demand is for data architects. Adding the word "Architect" to any technical profession increases salary by ~20%. Especially if you also add words "Senior" or "Enterprise": • Senior Data Architect • Enterprise Data Architect With job market being so "hungry", the education and experience becomes optional: • Optional: several years of relevant experience • Optional: BS Degree Data Architect Salary at Microsoft: 200K Microsoft (143K-232K) Total compensation up to $318K
  • 2. Two Ways to Grow Learn new skills at work while doing something else. Reach 80% of readiness. Become "entitled". Get promoted. Slow progress Learn 5-10% of new skills. Convince a manger to give you the new project/job based solely on your enthusiasm and desire (I can do this job, trust me ... ) Learn the skill while doing the job. Fast progress We do this
  • 3. What Does a Data Architect Do ? Data Architect (DA): • Interviews business stakeholders to understand requirements and constraints • Proposes a solution diagram (usually constructs from templates) • How data is loaded, stored, maintained, queried, and consumed • How to do analytics (self-service), Machine Learning Modeling, reporting • Select tools/technology, considering costs, compliance, privacy, security • Automation, data lineage, data governance • DA designs all stages and plans for execution: Design, Create, Deploy, Manage • DA establishes models, policies, rules, standards that govern data collection, processing, storage, and usage • DA advises and educates managers, engineers, analysts Most Essential Technical Skills: • Strong data modeling skills • Database architecture and DW (Data Warehousing) • ETL Tools • Template Data Architectures in all three major Clouds • Data governance know-how • SQL, Python or R • Analytics dashboarding (Power BI, Tableau, ...) Business skills: • Excellent communication skills. • Listen to managers carefully to understand requirements • Convert data challenges into automated processes • Max results for min resources • Excellent presentation skills • Explain complex concepts to non-technical audience • Advise data modelers, data engineers, database administrators, and junior architects • Industry Knowledge, how data is collected, analyzed and utilized • Maintaining flexibility in the face of big data developments.
  • 5. Components of a big data architecture from https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
  • 6. • Design data-flow and data-storage strategy/architecture • Build an inventory of data (available, needed, where to get) • Work with IT, Data Scientists, and Management • Identify and evaluate current data management technologies • Create a fluid, end-to-end vision for how data will flow through an organization • Develop data models for database structures • Design, document, construct and deploy database architectures and apps • Provide for scale, security, performance, data recovery, reliability, etc. • Ensure data accuracy and accessibility • Create frameworks / templates for solutions • Constantly monitor, refine and report on the performance of data management systems • Meld new systems with existing DW • Produce and enforce database development standards • Maintain a corporate repository of all data architecture artifacts and procedures • Make presentations to upper management Data Architect Responsibilities
  • 7. • Good foundation in Computer Science, Software Architectures, Engineering • Data structures, algorithms, • System Design, Distributed System Design • 3-tier architecture = MVC (Model, View, Controller) • 3-tier using clusters (a.k.a. Shared Architecture), consistent hashing • Lambda Architecture - events • Streaming Architecture (Kafka): no central DB, just a message bus • Databases types (SQL, noSQ (Key-value), Graph, etc.), • OLTP vs OLAP, columnar storage (VertiPaq), denormalise for speed • MPP (Massive Parallel Processing) DW, Clusters, Polaris Engine • SQL mastery (DML , DDL, DCL, TCL) • DBs: Mainframe DB2, Sybase, MS SQL Server, MySQL, • Oracle, PostgreSQL, MongoDB, DynamoDB, CosmosDB, BigQuery, • Apache Cassandra, SnowFlake.net, Pig, etc. • Data Warehouse (Kimball Star Schema, Facts, Dimensions, Snowflake schema) • ETL tools (bcp, Oracle Data Loader, Informatica, Ab Initio, StreamSets, • ADF (Azure Data Factory), Azure Synapse Integrate, etc.) • Data Analytics (Power BI, Tableau, visualizations, Reporting, self-service) • Data modeling tools (ERWin, Enterprise Architect, Visio, etc.), UML • Data schema, entities, relations, data flows, hierarchies • CAP Theorem (Consistency, Availability, Partition Tolerance) • Geographical redundancy, • ACID transactions, dirty reads • Replication, transaction log • Distributed transactions, two-phase commit • Backup/archival software DA Technical Skills • receiving/sending data in different formats (XML, SOAP, JSON, REST, protocol buffers) • Working with APIs • File formats: CSV, parquet, JSON, Apache Arrow • Handling nulls, missing data, data quality and integrity • Hadoop/Spark data processing, loading, map-reduce, Google Big Table, HDFS, GDFS • Streaming, Kafka, Event Hub, IoT ingesting • Design patterns • Big data handling • Data mining • Data security, access, data privacy, GDPR, differential privacy • Risk assessment • Data governance (measure and manage data quality, ownership, compliance, security, cleaning, standards, categories, encryption, etc.) • Data lineage • Agile methodologies and ERP implementation, GitHub, GitLab • App Servers • Machine Learning, predictive modeling, NLP and text analytics • Python, C/C++, Java, Perl • Unix/Linux and MS Windows • Some Math and Statistics • IaaS, PaaS, SaaS (infrastructure, Platform, Software as a Service) Reference architectures and specific tools for all 3 major Clouds for ETL, SQL DW, Analytics, Machine Learning, Visualization, Reporting, etc.
  • 8. • S3 (Simple Storage Service) • EC2 (Elastic Compute Cloud) • Lambda functions (serverless) • Databases: Redshift, Snowflake, Athena (serverless), Aurora (MySQL & PostreSQL compatible DB), MariaDB, MySQL, PostreSQL, Microsoft SQL Server, DynamoDB, Apache PrestoDB, Neptune Graph Database • Amazon Glue: managed ETL service • Amazon Data Pipeline • AppFlow, Kinesis Firehose, AWS Datasync, AWS Data Migration Service • Amazon EMR (Elastic MapReduce) - Hadoop, Spark • Amazon Machine Learning: SageMaker (python, jupyter notebooks, deployment, ...) • Amazon AI Services: • Amazon Comprehend (extract from text) • Amazon CodeGuru (auto code review) • Amazon Lex (Chatbots) • Amazon Forecast • Amazon Textract (extract tet and data from millions of docs) • Amazon Kendra (Natural Language Search) • Amazon Fraud Detector • Amazon Rekognition - image/video analysis, • Amazon Personalize - recommendation engine, • Amazon Translate - real time translation • Amazon Polly - text-to-speech • Amazon Transcribe - speech to text • Amazon QuickSight - Analytics dashboards, ... Amazon – AWS (Amazon Web Services)
  • 9. • ADLS (Azure Data Lake Storage Gen2) blobs and files • ADF (Azure Data Factory) • Microsoft SQL Server, SQL Data Warehouse • Azure Functions • Synapse • Integrate • SQL pools (serverless & dedicated), • PySpark • ADLS • Databricks • Machine Learning Studio • CosmosDB, Link for Cosmos DB • Power BI • Cognitive Services • Azure Purview (data lineage, governance) • Azure DevOps (agile planning, CI/CD tools, code repos, etc.) Azure Synapse Integrate Data Pipeline s Azure Data Lake Storage Gen2 Machine Learnin g Web End- Points
  • 10. • Cloud Storage • Storage Transfer Service • Cloud Functions • Databases: • Cloud SQL: managed MySQL, PostgreSQL, and SQL Server • Cloud BigQuery: Serverless DW, globally scalable, cost-effective • Cloud Spanner: 99.999% availability, gaming, global fin. ledger, inventory • Cloud Bigtable: NoSQL wide-column (similar to HBase & Apache Cassandra) • Firestore: NoSQL for Mobile, IoT, ... • Firebase Realtime DB: mobile, personalized ads, in-app chats, ... • Memorystore: Redis or Memcached • MongoDB Atlas • Neo4j Auro (Graph DB) • Datastax (NoSQL built on Apache Cassandra) • Datalab, DataPrep, DataFlow • Machine Learning: DataLab, ML Engine, AutoML • BI Dashboards: Google Data Studio • Colaboratory (Colab) - free jupyter notebooks with GPU - https://colab.research.google.com/ • Kaggle - ML competitions, code, notebooks (kernels), ... - https://www.kaggle.com/ Google Cloud Platform Oracle Cloud Oracle Infrastructure, DB, Java, ERP apps, NetSuite, HR, CRM, ...
  • 11. BEAM - Business Event Analysis & Modeling Interview business stakeholders, and document the data and process: Create Event Matrix (Excel) documenting facts and dimensions. Use "starter" templates for the interview and documentation. • https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/ • https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1 • http://www.decisionone.co.uk/training/ • https://www.linkedin.com/in/lawrencecorr/
  • 12. AWS to Azure services comparison - https://docs.microsoft.com/en-us/azure/architecture/aws-professional/services Browse Azure Architecture - https://docs.microsoft.com/en-us/azure/architecture/browse/ Big data architectures - https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/ BEAM • https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/ • https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1 • http://www.decisionone.co.uk/training/ • https://www.linkedin.com/in/lawrencecorr/ DA Resources