SlideShare a Scribd company logo
Become a Data Architect – session 1 Data Architect Base Salary:
135K+ for 4+ years of experience
150K (range 110K-180K)
There is a huge demand for data specialists:
• Data Engineer
• Data Analyst
• Data Architect
• Data Scientist
• Machine Learning Engineer
• Researcher (Data Science or Machine Learning)
• Managers and PMs
The biggest demand is for data architects.
Adding the word "Architect" to any technical profession
increases salary by ~20%. Especially if you also add
words "Senior" or "Enterprise":
• Senior Data Architect
• Enterprise Data Architect
With job market being so "hungry", the education and
experience becomes optional:
• Optional: several years of relevant experience
• Optional: BS Degree
Data Architect Salary at Microsoft:
200K Microsoft (143K-232K)
Total compensation up to $318K
Two Ways to Grow
Learn new skills at work
while doing something else.
Reach 80% of readiness.
Become "entitled".
Get promoted.
Slow progress
Learn 5-10% of new skills.
Convince a manger to give you the new project/job
based solely on your enthusiasm and desire
(I can do this job, trust me ... )
Learn the skill while doing the job.
Fast progress
We do this
What Does a Data Architect Do ?
Data Architect (DA):
• Interviews business stakeholders to understand requirements and constraints
• Proposes a solution diagram (usually constructs from templates)
• How data is loaded, stored, maintained, queried, and consumed
• How to do analytics (self-service), Machine Learning Modeling, reporting
• Select tools/technology, considering costs, compliance, privacy, security
• Automation, data lineage, data governance
• DA designs all stages and plans for execution: Design, Create, Deploy, Manage
• DA establishes models, policies, rules, standards that govern data collection, processing, storage, and usage
• DA advises and educates managers, engineers, analysts
Most Essential Technical Skills:
• Strong data modeling skills
• Database architecture and DW (Data Warehousing)
• ETL Tools
• Template Data Architectures in all three major Clouds
• Data governance know-how
• SQL, Python or R
• Analytics dashboarding (Power BI, Tableau, ...)
Business skills:
• Excellent communication skills.
• Listen to managers carefully to understand requirements
• Convert data challenges into automated processes
• Max results for min resources
• Excellent presentation skills
• Explain complex concepts to non-technical audience
• Advise data modelers, data engineers, database administrators, and junior architects
• Industry Knowledge, how data is collected, analyzed and utilized
• Maintaining flexibility in the face of big data developments.
Generic Data Diagram
Components of a big data architecture
from https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
• Design data-flow and data-storage strategy/architecture
• Build an inventory of data (available, needed, where to get)
• Work with IT, Data Scientists, and Management
• Identify and evaluate current data management technologies
• Create a fluid, end-to-end vision for how data will flow through an organization
• Develop data models for database structures
• Design, document, construct and deploy database architectures and apps
• Provide for scale, security, performance, data recovery, reliability, etc.
• Ensure data accuracy and accessibility
• Create frameworks / templates for solutions
• Constantly monitor, refine and report on the performance of data management systems
• Meld new systems with existing DW
• Produce and enforce database development standards
• Maintain a corporate repository of all data architecture artifacts and procedures
• Make presentations to upper management
Data Architect Responsibilities
• Good foundation in Computer Science, Software Architectures, Engineering
• Data structures, algorithms,
• System Design, Distributed System Design
• 3-tier architecture = MVC (Model, View, Controller)
• 3-tier using clusters (a.k.a. Shared Architecture), consistent hashing
• Lambda Architecture - events
• Streaming Architecture (Kafka): no central DB, just a message bus
• Databases types (SQL, noSQ (Key-value), Graph, etc.),
• OLTP vs OLAP, columnar storage (VertiPaq), denormalise for speed
• MPP (Massive Parallel Processing) DW, Clusters, Polaris Engine
• SQL mastery (DML , DDL, DCL, TCL)
• DBs: Mainframe DB2, Sybase, MS SQL Server, MySQL,
• Oracle, PostgreSQL, MongoDB, DynamoDB, CosmosDB, BigQuery,
• Apache Cassandra, SnowFlake.net, Pig, etc.
• Data Warehouse (Kimball Star Schema, Facts, Dimensions, Snowflake schema)
• ETL tools (bcp, Oracle Data Loader, Informatica, Ab Initio, StreamSets,
• ADF (Azure Data Factory), Azure Synapse Integrate, etc.)
• Data Analytics (Power BI, Tableau, visualizations, Reporting, self-service)
• Data modeling tools (ERWin, Enterprise Architect, Visio, etc.), UML
• Data schema, entities, relations, data flows, hierarchies
• CAP Theorem (Consistency, Availability, Partition Tolerance)
• Geographical redundancy,
• ACID transactions, dirty reads
• Replication, transaction log
• Distributed transactions, two-phase commit
• Backup/archival software
DA Technical Skills
• receiving/sending data in different formats (XML, SOAP, JSON, REST, protocol
buffers)
• Working with APIs
• File formats: CSV, parquet, JSON, Apache Arrow
• Handling nulls, missing data, data quality and integrity
• Hadoop/Spark data processing, loading, map-reduce, Google Big Table, HDFS,
GDFS
• Streaming, Kafka, Event Hub, IoT ingesting
• Design patterns
• Big data handling
• Data mining
• Data security, access, data privacy, GDPR, differential privacy
• Risk assessment
• Data governance (measure and manage data quality, ownership,
compliance, security, cleaning, standards, categories, encryption, etc.)
• Data lineage
• Agile methodologies and ERP implementation, GitHub, GitLab
• App Servers
• Machine Learning, predictive modeling, NLP and text analytics
• Python, C/C++, Java, Perl
• Unix/Linux and MS Windows
• Some Math and Statistics
• IaaS, PaaS, SaaS (infrastructure, Platform, Software as a Service)
Reference architectures and specific tools for all 3 major Clouds
for ETL, SQL DW, Analytics, Machine Learning, Visualization, Reporting, etc.
• S3 (Simple Storage Service)
• EC2 (Elastic Compute Cloud)
• Lambda functions (serverless)
• Databases: Redshift, Snowflake, Athena (serverless), Aurora (MySQL & PostreSQL compatible DB),
MariaDB, MySQL, PostreSQL, Microsoft SQL Server, DynamoDB, Apache PrestoDB, Neptune Graph
Database
• Amazon Glue: managed ETL service
• Amazon Data Pipeline
• AppFlow, Kinesis Firehose, AWS Datasync, AWS Data Migration Service
• Amazon EMR (Elastic MapReduce) - Hadoop, Spark
• Amazon Machine Learning: SageMaker (python, jupyter notebooks, deployment, ...)
• Amazon AI Services:
• Amazon Comprehend (extract from text)
• Amazon CodeGuru (auto code review)
• Amazon Lex (Chatbots)
• Amazon Forecast
• Amazon Textract (extract tet and data from millions of docs)
• Amazon Kendra (Natural Language Search)
• Amazon Fraud Detector
• Amazon Rekognition - image/video analysis,
• Amazon Personalize - recommendation engine,
• Amazon Translate - real time translation
• Amazon Polly - text-to-speech
• Amazon Transcribe - speech to text
• Amazon QuickSight - Analytics dashboards, ...
Amazon – AWS (Amazon Web Services)
• ADLS (Azure Data Lake Storage Gen2) blobs and files
• ADF (Azure Data Factory)
• Microsoft SQL Server, SQL Data Warehouse
• Azure Functions
• Synapse
• Integrate
• SQL pools (serverless & dedicated),
• PySpark
• ADLS
• Databricks
• Machine Learning Studio
• CosmosDB, Link for Cosmos DB
• Power BI
• Cognitive Services
• Azure Purview (data lineage, governance)
• Azure DevOps (agile planning, CI/CD tools, code repos, etc.)
Azure
Synapse
Integrate
Data
Pipeline
s
Azure Data Lake
Storage Gen2
Machine
Learnin
g
Web
End-
Points
• Cloud Storage
• Storage Transfer Service
• Cloud Functions
• Databases:
• Cloud SQL: managed MySQL, PostgreSQL, and SQL Server
• Cloud BigQuery: Serverless DW, globally scalable, cost-effective
• Cloud Spanner: 99.999% availability, gaming, global fin. ledger, inventory
• Cloud Bigtable: NoSQL wide-column (similar to HBase & Apache Cassandra)
• Firestore: NoSQL for Mobile, IoT, ...
• Firebase Realtime DB: mobile, personalized ads, in-app chats, ...
• Memorystore: Redis or Memcached
• MongoDB Atlas
• Neo4j Auro (Graph DB)
• Datastax (NoSQL built on Apache Cassandra)
• Datalab, DataPrep, DataFlow
• Machine Learning: DataLab, ML Engine, AutoML
• BI Dashboards: Google Data Studio
• Colaboratory (Colab) - free jupyter notebooks with GPU - https://colab.research.google.com/
• Kaggle - ML competitions, code, notebooks (kernels), ... - https://www.kaggle.com/
Google Cloud Platform
Oracle Cloud
Oracle Infrastructure, DB, Java, ERP apps, NetSuite, HR, CRM, ...
BEAM - Business Event Analysis & Modeling
Interview business stakeholders, and document the data and
process:
Create Event Matrix (Excel) documenting facts and dimensions.
Use "starter" templates for the interview and documentation.
• https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/
• https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1
• http://www.decisionone.co.uk/training/
• https://www.linkedin.com/in/lawrencecorr/
AWS to Azure services comparison
- https://docs.microsoft.com/en-us/azure/architecture/aws-professional/services
Browse Azure Architecture
- https://docs.microsoft.com/en-us/azure/architecture/browse/
Big data architectures
- https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
BEAM
• https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/
• https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1
• http://www.decisionone.co.uk/training/
• https://www.linkedin.com/in/lawrencecorr/
DA Resources

More Related Content

Similar to DA_01_Intro.pptx

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
Raheel Retiwalla
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
Torsten Steinbach
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
Martin Bém
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 Webinar - QuerySurge and Azure DevOps in the Azure Cloud Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
RTTS
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
Amazon Web Services
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
Revathiparamanathan
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
RTTS
 

Similar to DA_01_Intro.pptx (20)

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 Webinar - QuerySurge and Azure DevOps in the Azure Cloud Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
 

Recently uploaded

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

DA_01_Intro.pptx

  • 1. Become a Data Architect – session 1 Data Architect Base Salary: 135K+ for 4+ years of experience 150K (range 110K-180K) There is a huge demand for data specialists: • Data Engineer • Data Analyst • Data Architect • Data Scientist • Machine Learning Engineer • Researcher (Data Science or Machine Learning) • Managers and PMs The biggest demand is for data architects. Adding the word "Architect" to any technical profession increases salary by ~20%. Especially if you also add words "Senior" or "Enterprise": • Senior Data Architect • Enterprise Data Architect With job market being so "hungry", the education and experience becomes optional: • Optional: several years of relevant experience • Optional: BS Degree Data Architect Salary at Microsoft: 200K Microsoft (143K-232K) Total compensation up to $318K
  • 2. Two Ways to Grow Learn new skills at work while doing something else. Reach 80% of readiness. Become "entitled". Get promoted. Slow progress Learn 5-10% of new skills. Convince a manger to give you the new project/job based solely on your enthusiasm and desire (I can do this job, trust me ... ) Learn the skill while doing the job. Fast progress We do this
  • 3. What Does a Data Architect Do ? Data Architect (DA): • Interviews business stakeholders to understand requirements and constraints • Proposes a solution diagram (usually constructs from templates) • How data is loaded, stored, maintained, queried, and consumed • How to do analytics (self-service), Machine Learning Modeling, reporting • Select tools/technology, considering costs, compliance, privacy, security • Automation, data lineage, data governance • DA designs all stages and plans for execution: Design, Create, Deploy, Manage • DA establishes models, policies, rules, standards that govern data collection, processing, storage, and usage • DA advises and educates managers, engineers, analysts Most Essential Technical Skills: • Strong data modeling skills • Database architecture and DW (Data Warehousing) • ETL Tools • Template Data Architectures in all three major Clouds • Data governance know-how • SQL, Python or R • Analytics dashboarding (Power BI, Tableau, ...) Business skills: • Excellent communication skills. • Listen to managers carefully to understand requirements • Convert data challenges into automated processes • Max results for min resources • Excellent presentation skills • Explain complex concepts to non-technical audience • Advise data modelers, data engineers, database administrators, and junior architects • Industry Knowledge, how data is collected, analyzed and utilized • Maintaining flexibility in the face of big data developments.
  • 5. Components of a big data architecture from https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
  • 6. • Design data-flow and data-storage strategy/architecture • Build an inventory of data (available, needed, where to get) • Work with IT, Data Scientists, and Management • Identify and evaluate current data management technologies • Create a fluid, end-to-end vision for how data will flow through an organization • Develop data models for database structures • Design, document, construct and deploy database architectures and apps • Provide for scale, security, performance, data recovery, reliability, etc. • Ensure data accuracy and accessibility • Create frameworks / templates for solutions • Constantly monitor, refine and report on the performance of data management systems • Meld new systems with existing DW • Produce and enforce database development standards • Maintain a corporate repository of all data architecture artifacts and procedures • Make presentations to upper management Data Architect Responsibilities
  • 7. • Good foundation in Computer Science, Software Architectures, Engineering • Data structures, algorithms, • System Design, Distributed System Design • 3-tier architecture = MVC (Model, View, Controller) • 3-tier using clusters (a.k.a. Shared Architecture), consistent hashing • Lambda Architecture - events • Streaming Architecture (Kafka): no central DB, just a message bus • Databases types (SQL, noSQ (Key-value), Graph, etc.), • OLTP vs OLAP, columnar storage (VertiPaq), denormalise for speed • MPP (Massive Parallel Processing) DW, Clusters, Polaris Engine • SQL mastery (DML , DDL, DCL, TCL) • DBs: Mainframe DB2, Sybase, MS SQL Server, MySQL, • Oracle, PostgreSQL, MongoDB, DynamoDB, CosmosDB, BigQuery, • Apache Cassandra, SnowFlake.net, Pig, etc. • Data Warehouse (Kimball Star Schema, Facts, Dimensions, Snowflake schema) • ETL tools (bcp, Oracle Data Loader, Informatica, Ab Initio, StreamSets, • ADF (Azure Data Factory), Azure Synapse Integrate, etc.) • Data Analytics (Power BI, Tableau, visualizations, Reporting, self-service) • Data modeling tools (ERWin, Enterprise Architect, Visio, etc.), UML • Data schema, entities, relations, data flows, hierarchies • CAP Theorem (Consistency, Availability, Partition Tolerance) • Geographical redundancy, • ACID transactions, dirty reads • Replication, transaction log • Distributed transactions, two-phase commit • Backup/archival software DA Technical Skills • receiving/sending data in different formats (XML, SOAP, JSON, REST, protocol buffers) • Working with APIs • File formats: CSV, parquet, JSON, Apache Arrow • Handling nulls, missing data, data quality and integrity • Hadoop/Spark data processing, loading, map-reduce, Google Big Table, HDFS, GDFS • Streaming, Kafka, Event Hub, IoT ingesting • Design patterns • Big data handling • Data mining • Data security, access, data privacy, GDPR, differential privacy • Risk assessment • Data governance (measure and manage data quality, ownership, compliance, security, cleaning, standards, categories, encryption, etc.) • Data lineage • Agile methodologies and ERP implementation, GitHub, GitLab • App Servers • Machine Learning, predictive modeling, NLP and text analytics • Python, C/C++, Java, Perl • Unix/Linux and MS Windows • Some Math and Statistics • IaaS, PaaS, SaaS (infrastructure, Platform, Software as a Service) Reference architectures and specific tools for all 3 major Clouds for ETL, SQL DW, Analytics, Machine Learning, Visualization, Reporting, etc.
  • 8. • S3 (Simple Storage Service) • EC2 (Elastic Compute Cloud) • Lambda functions (serverless) • Databases: Redshift, Snowflake, Athena (serverless), Aurora (MySQL & PostreSQL compatible DB), MariaDB, MySQL, PostreSQL, Microsoft SQL Server, DynamoDB, Apache PrestoDB, Neptune Graph Database • Amazon Glue: managed ETL service • Amazon Data Pipeline • AppFlow, Kinesis Firehose, AWS Datasync, AWS Data Migration Service • Amazon EMR (Elastic MapReduce) - Hadoop, Spark • Amazon Machine Learning: SageMaker (python, jupyter notebooks, deployment, ...) • Amazon AI Services: • Amazon Comprehend (extract from text) • Amazon CodeGuru (auto code review) • Amazon Lex (Chatbots) • Amazon Forecast • Amazon Textract (extract tet and data from millions of docs) • Amazon Kendra (Natural Language Search) • Amazon Fraud Detector • Amazon Rekognition - image/video analysis, • Amazon Personalize - recommendation engine, • Amazon Translate - real time translation • Amazon Polly - text-to-speech • Amazon Transcribe - speech to text • Amazon QuickSight - Analytics dashboards, ... Amazon – AWS (Amazon Web Services)
  • 9. • ADLS (Azure Data Lake Storage Gen2) blobs and files • ADF (Azure Data Factory) • Microsoft SQL Server, SQL Data Warehouse • Azure Functions • Synapse • Integrate • SQL pools (serverless & dedicated), • PySpark • ADLS • Databricks • Machine Learning Studio • CosmosDB, Link for Cosmos DB • Power BI • Cognitive Services • Azure Purview (data lineage, governance) • Azure DevOps (agile planning, CI/CD tools, code repos, etc.) Azure Synapse Integrate Data Pipeline s Azure Data Lake Storage Gen2 Machine Learnin g Web End- Points
  • 10. • Cloud Storage • Storage Transfer Service • Cloud Functions • Databases: • Cloud SQL: managed MySQL, PostgreSQL, and SQL Server • Cloud BigQuery: Serverless DW, globally scalable, cost-effective • Cloud Spanner: 99.999% availability, gaming, global fin. ledger, inventory • Cloud Bigtable: NoSQL wide-column (similar to HBase & Apache Cassandra) • Firestore: NoSQL for Mobile, IoT, ... • Firebase Realtime DB: mobile, personalized ads, in-app chats, ... • Memorystore: Redis or Memcached • MongoDB Atlas • Neo4j Auro (Graph DB) • Datastax (NoSQL built on Apache Cassandra) • Datalab, DataPrep, DataFlow • Machine Learning: DataLab, ML Engine, AutoML • BI Dashboards: Google Data Studio • Colaboratory (Colab) - free jupyter notebooks with GPU - https://colab.research.google.com/ • Kaggle - ML competitions, code, notebooks (kernels), ... - https://www.kaggle.com/ Google Cloud Platform Oracle Cloud Oracle Infrastructure, DB, Java, ERP apps, NetSuite, HR, CRM, ...
  • 11. BEAM - Business Event Analysis & Modeling Interview business stakeholders, and document the data and process: Create Event Matrix (Excel) documenting facts and dimensions. Use "starter" templates for the interview and documentation. • https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/ • https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1 • http://www.decisionone.co.uk/training/ • https://www.linkedin.com/in/lawrencecorr/
  • 12. AWS to Azure services comparison - https://docs.microsoft.com/en-us/azure/architecture/aws-professional/services Browse Azure Architecture - https://docs.microsoft.com/en-us/azure/architecture/browse/ Big data architectures - https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/ BEAM • https://agilebi.guru/project/business-event-analysis-and-modeling-beam-templates/ • https://medium.com/hitachisolutions-braintrust/agile-data-modeling-e09c703205c1 • http://www.decisionone.co.uk/training/ • https://www.linkedin.com/in/lawrencecorr/ DA Resources