SlideShare a Scribd company logo
1 of 15
Download to read offline
Key projects in
Data Science
March 30
2024
This is a summary of my five years hands-on towards achieving the required experience, and skills in Data Science
and Engineering. This includes key partner trainings from Google® Vijayananda Mohire
Key projects in Data Science
Project
No.
Project Name Project Summary / Learning
1 Olympic Medal
Analysis
This project uses the Olympics.csv dataset from Kaggle. It provides various details and insights of the medal won by
various players and countries related data. Below is one example.
2 Migrating from
Spark to BigQuery
via Dataproc
Migrating the original Spark code, to Dataproc (lift-and-shift) and analysis of the spark tasks. Copy data to HDFS,
reading the CSV files, Spark Analysis using dataframe and Spark SQL.
3 Flight Departure
delay analysis
This project provides insights of departure delays using Google’s Big Query, sql query and Dataframe plots to visualize
the dataset analysis. Dataset used was Google’s internal storage with dataset name: cloud-training-
demos.airline_ontime_data.flights
4 Exploratory data
analysis using
BigQuery
EDA using linear regression using Python and Scikit-Learn, heatmaps for predicting US house value US and taxi fare
estimation
5 Exploring and
Creating an
Ecommerce
Analytics Pipeline
with Cloud
Dataprep v1.5 /
Data wrangling /
cleansing
Cloud Dataprep® by Trifacta® is an intelligent data service for visually exploring, cleaning, and preparing structured
and unstructured data for analysis. In this lab we will explore the Cloud Dataprep UI to build an ecommerce
transformation pipeline that will run at a scheduled interval and output results back into BigQuery.
The dataset we will be using is an ecommerce dataset that has millions of Google® Analytics records for the Google®
Merchandise Store loaded into BigQuery
In this lab, you learn how to perform these tasks:
 Connect BigQuery datasets to Cloud Dataprep
 Explore dataset quality with Cloud Dataprep
 Create a data transformation pipeline with Cloud Dataprep
 Schedule transformation jobs outputs to BigQuery
Analytics Pipeline
Recipe Book with rules
Results with duplicate rows removed
6 Advanced
Visualizations with
TensorFlow Data
Validation
This lab illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That
includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift
and skew in our dataset
First we'll use `tfdv.generate_statistics_from_csv` to compute statistics for our training data.
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are
present and the shapes of their value distributions. Now let's use [`tfdv.infer_schema`] to create a schema for our
data.
Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical
features, where we want to identify the range of acceptable values.
Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and
span N+1), such as between different days of training data. We express drift in terms of [L-infinity distance], and
you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.
Adding skew and drift comparators to visualize and make corrections. Few of the uses are:
1. Validating new data for inference to make sure that we haven't suddenly started receiving bad features
2. Validating new data for inference to make sure that our model has trained on that part of the decision surface
3. Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow
Transform] to make sure we haven't done something wrong
7 TPU Speed Data
Pipelines
TPUs are very fast, and the stream of training data must keep up with their training speed. In this lab, you will learn
how to load data from Cloud Storage with the tf.data.Dataset API to feed your TPU
You will learn:
• To use the tf.data.Dataset API to load training data.
• To use TFRecord format to load training data efficiently from Cloud Storage.
8 Data Pipelines –
design and deploy
Kubeflow pipeline in Google. This project offers three components that provide different outputs that can be combined
to provide a final response to the consumer
https://github.com/vijaymohire/gcp/blob/main/MyPipeExample.ipynb
https://github.com/vijaymohire/gcp/blob/main/KubeflowpipelineRun.png
9 Data ingestion, ETL Project name: Serverless Data Processing with Dataflow - Writing an ETL Pipeline using Apache Beam and Cloud
Dataflow (Python)
In this lab, you will learn how to:
 Build a batch Extract-Transform-Load pipeline in Apache Beam, which takes raw data from Google Cloud
Storage and writes it to Google BigQuery.
 Run the Apache Beam pipeline on Cloud Dataflow.
 Parameterize the execution of the pipeline.
10 Feature
Engineering
Project: Predict Bike Trip Duration with a Regression Model in BQML 2.5
In this lab, you learn to perform the following tasks:
 Query and explore the London bicycles dataset for feature engineering
 Create a linear regression model in BigQuery ML
 Evaluate the performance of your machine learning model
 Extract your model weights
Impact of number of bicycles
A potential feature is the number of bikes in the station. Perhaps, we hypothesize, people keep bicycles longer if
there are fewer bicycles on rent at the station they rented from.
1. In the query editor window paste the following query:
SELECT bikes_count, AVG(duration) AS duration FROM `bigquery-public-data`.london_bicycles.cycle_hire JOIN
`bigquery-public-data`.london_bicycles.cycle_stations ON cycle_hire.start_station_name = cycle_stations.name
GROUP BY bikes_count
2. Visualize your data in Looker Studio.
11 Data quality Project Name: Improving Data Quality
This notebook introduced a few concepts to improve data quality. We resolved missing values, converted the Date
feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one-
hot encoding features, and converted temporal features to meaningful representations. By the end of our lab, we
gained an understanding as to why data should be "cleaned" and "pre-processed" before input into a machine
learning model.
1. **Data Quality Issue #1**:
> **Missing Values**:
Each feature column has multiple missing values. In fact, we have a total of 18 missing values.
2. **Data Quality Issue #2**:
> **Date DataType**: Date is shown as an "object" datatype and should be a datetime. In addition, Date is in one
column. Our business requirement is to see the Date parsed out to year, month, and day.
3. **Data Quality Issue #3**:
> **Model Year**: We are only interested in years greater than 2006, not "<2006".
4. **Data Quality Issue #4**:
> **Categorical Columns**: The feature column "Light_Duty" is categorical and has a "Yes/No" choice. We cannot
feed values like this into a machine learning model. In addition, we need to "one-hot encode the remaining
"string"/"object" columns.
5. **Data Quality Issue #5**:
> **Temporal Features**: How do we handle year, month, and day?
https://github.com/vijaymohire/datascience/blob/main/dataengg/improve_data_quality-Lab%206.ipynb
12 Terraform
Deployment
Use Terraform to deploy Google resources to servers in US and EU. Create required VPC network, security groups and
deploy resources like VM instances, storage etc
https://github.com/vijaymohire/gcp/blob/main/Terraform%20for%20GCP%20resources%20deployment%20Demo.pdf
Disclaimer:
We have sourced the content from various courses and partner trainings. All details, references are for educational purposes only
• Google®
is trademark of the Google®
Corporation. All logos, trademarks and brand names belong to the respective owners as specified. We
have no intention to infringe any copyrights or alter related permissions set by the owners. Please refer to source websites for any further
details. This is for educational and information purpose only.
For more details, contact:
Bhadale IT Pvt. Ltd; Email: vijaymohire@gmail.com

More Related Content

Similar to Key projects Data Science and Engineering

Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdfGoogle.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
ssuser22b701
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 

Similar to Key projects Data Science and Engineering (20)

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & Python
 
BigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLBigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery ML
 
Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdfGoogle.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
 
Big query
Big queryBig query
Big query
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Databricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdfDatabricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdf
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...
 
Shipment Time Prediction for Maritime Industry using Machine Learning
Shipment Time Prediction for Maritime Industry using Machine LearningShipment Time Prediction for Maritime Industry using Machine Learning
Shipment Time Prediction for Maritime Industry using Machine Learning
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Data Driven Attribution in BigQuery with Shapley Values and Markov Chains
Data Driven Attribution in BigQuery with Shapley Values and Markov ChainsData Driven Attribution in BigQuery with Shapley Values and Markov Chains
Data Driven Attribution in BigQuery with Shapley Values and Markov Chains
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 

More from Vijayananda Mohire

Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for Agriculture
Vijayananda Mohire
 

More from Vijayananda Mohire (20)

Quantum Algorithms for Electronics - IEEE Certificate
Quantum Algorithms for Electronics - IEEE CertificateQuantum Algorithms for Electronics - IEEE Certificate
Quantum Algorithms for Electronics - IEEE Certificate
 
NexGen Solutions for cloud platforms, powered by GenQAI
NexGen Solutions for cloud platforms, powered by GenQAINexGen Solutions for cloud platforms, powered by GenQAI
NexGen Solutions for cloud platforms, powered by GenQAI
 
Certificate- Peer Review of Book Chapter on ML
Certificate- Peer Review of Book Chapter on MLCertificate- Peer Review of Book Chapter on ML
Certificate- Peer Review of Book Chapter on ML
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Bhadale IT Hub-Multi Cloud and Multi QAI
Bhadale IT Hub-Multi Cloud and Multi QAIBhadale IT Hub-Multi Cloud and Multi QAI
Bhadale IT Hub-Multi Cloud and Multi QAI
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Azure Quantum Workspace for developing Q# based quantum circuits
Azure Quantum Workspace for developing Q# based quantum circuitsAzure Quantum Workspace for developing Q# based quantum circuits
Azure Quantum Workspace for developing Q# based quantum circuits
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial Intelligence
 
Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for Agriculture
 
Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for Agriculture
 
Bhadale IT Intel and Azure Cloud Offerings
Bhadale IT Intel and Azure Cloud OfferingsBhadale IT Intel and Azure Cloud Offerings
Bhadale IT Intel and Azure Cloud Offerings
 
GitHub Copilot-vijaymohire
GitHub Copilot-vijaymohireGitHub Copilot-vijaymohire
GitHub Copilot-vijaymohire
 
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical ImplicationsPractical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
 
Cloud Infrastructure - Partner Delivery Accelerator (APAC)
Cloud Infrastructure - Partner Delivery Accelerator (APAC)Cloud Infrastructure - Partner Delivery Accelerator (APAC)
Cloud Infrastructure - Partner Delivery Accelerator (APAC)
 
Red Hat Sales Specialist - Red Hat Enterprise Linux
Red Hat Sales Specialist - Red Hat Enterprise LinuxRed Hat Sales Specialist - Red Hat Enterprise Linux
Red Hat Sales Specialist - Red Hat Enterprise Linux
 
RedHat_Transcript_Jan_2024
RedHat_Transcript_Jan_2024RedHat_Transcript_Jan_2024
RedHat_Transcript_Jan_2024
 
Generative AI Business Transformation
Generative AI Business TransformationGenerative AI Business Transformation
Generative AI Business Transformation
 
Microsoft Learn Transcript Jan 2024- vijaymohire
Microsoft Learn Transcript Jan 2024- vijaymohireMicrosoft Learn Transcript Jan 2024- vijaymohire
Microsoft Learn Transcript Jan 2024- vijaymohire
 
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
 
Intel Partnership - Gold Member - 2024
Intel Partnership - Gold Member - 2024Intel Partnership - Gold Member - 2024
Intel Partnership - Gold Member - 2024
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres
^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres
^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
 

Key projects Data Science and Engineering

  • 1. Key projects in Data Science March 30 2024 This is a summary of my five years hands-on towards achieving the required experience, and skills in Data Science and Engineering. This includes key partner trainings from Google® Vijayananda Mohire
  • 2. Key projects in Data Science Project No. Project Name Project Summary / Learning 1 Olympic Medal Analysis This project uses the Olympics.csv dataset from Kaggle. It provides various details and insights of the medal won by various players and countries related data. Below is one example.
  • 3.
  • 4.
  • 5. 2 Migrating from Spark to BigQuery via Dataproc Migrating the original Spark code, to Dataproc (lift-and-shift) and analysis of the spark tasks. Copy data to HDFS, reading the CSV files, Spark Analysis using dataframe and Spark SQL.
  • 6. 3 Flight Departure delay analysis This project provides insights of departure delays using Google’s Big Query, sql query and Dataframe plots to visualize the dataset analysis. Dataset used was Google’s internal storage with dataset name: cloud-training- demos.airline_ontime_data.flights
  • 7. 4 Exploratory data analysis using BigQuery EDA using linear regression using Python and Scikit-Learn, heatmaps for predicting US house value US and taxi fare estimation
  • 8. 5 Exploring and Creating an Ecommerce Analytics Pipeline with Cloud Dataprep v1.5 / Data wrangling / cleansing Cloud Dataprep® by Trifacta® is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis. In this lab we will explore the Cloud Dataprep UI to build an ecommerce transformation pipeline that will run at a scheduled interval and output results back into BigQuery. The dataset we will be using is an ecommerce dataset that has millions of Google® Analytics records for the Google® Merchandise Store loaded into BigQuery In this lab, you learn how to perform these tasks:  Connect BigQuery datasets to Cloud Dataprep  Explore dataset quality with Cloud Dataprep  Create a data transformation pipeline with Cloud Dataprep  Schedule transformation jobs outputs to BigQuery
  • 10. Results with duplicate rows removed 6 Advanced Visualizations with TensorFlow Data Validation This lab illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset First we'll use `tfdv.generate_statistics_from_csv` to compute statistics for our training data. TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions. Now let's use [`tfdv.infer_schema`] to create a schema for our data. Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values. Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of [L-infinity distance], and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Adding skew and drift comparators to visualize and make corrections. Few of the uses are: 1. Validating new data for inference to make sure that we haven't suddenly started receiving bad features 2. Validating new data for inference to make sure that our model has trained on that part of the decision surface
  • 11. 3. Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow Transform] to make sure we haven't done something wrong 7 TPU Speed Data Pipelines TPUs are very fast, and the stream of training data must keep up with their training speed. In this lab, you will learn how to load data from Cloud Storage with the tf.data.Dataset API to feed your TPU You will learn: • To use the tf.data.Dataset API to load training data. • To use TFRecord format to load training data efficiently from Cloud Storage. 8 Data Pipelines – design and deploy Kubeflow pipeline in Google. This project offers three components that provide different outputs that can be combined to provide a final response to the consumer https://github.com/vijaymohire/gcp/blob/main/MyPipeExample.ipynb https://github.com/vijaymohire/gcp/blob/main/KubeflowpipelineRun.png 9 Data ingestion, ETL Project name: Serverless Data Processing with Dataflow - Writing an ETL Pipeline using Apache Beam and Cloud Dataflow (Python) In this lab, you will learn how to:
  • 12.  Build a batch Extract-Transform-Load pipeline in Apache Beam, which takes raw data from Google Cloud Storage and writes it to Google BigQuery.  Run the Apache Beam pipeline on Cloud Dataflow.  Parameterize the execution of the pipeline. 10 Feature Engineering Project: Predict Bike Trip Duration with a Regression Model in BQML 2.5 In this lab, you learn to perform the following tasks:  Query and explore the London bicycles dataset for feature engineering  Create a linear regression model in BigQuery ML  Evaluate the performance of your machine learning model  Extract your model weights
  • 13. Impact of number of bicycles A potential feature is the number of bikes in the station. Perhaps, we hypothesize, people keep bicycles longer if there are fewer bicycles on rent at the station they rented from. 1. In the query editor window paste the following query: SELECT bikes_count, AVG(duration) AS duration FROM `bigquery-public-data`.london_bicycles.cycle_hire JOIN `bigquery-public-data`.london_bicycles.cycle_stations ON cycle_hire.start_station_name = cycle_stations.name GROUP BY bikes_count 2. Visualize your data in Looker Studio.
  • 14. 11 Data quality Project Name: Improving Data Quality This notebook introduced a few concepts to improve data quality. We resolved missing values, converted the Date feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one- hot encoding features, and converted temporal features to meaningful representations. By the end of our lab, we gained an understanding as to why data should be "cleaned" and "pre-processed" before input into a machine learning model. 1. **Data Quality Issue #1**: > **Missing Values**: Each feature column has multiple missing values. In fact, we have a total of 18 missing values. 2. **Data Quality Issue #2**: > **Date DataType**: Date is shown as an "object" datatype and should be a datetime. In addition, Date is in one column. Our business requirement is to see the Date parsed out to year, month, and day. 3. **Data Quality Issue #3**: > **Model Year**: We are only interested in years greater than 2006, not "<2006". 4. **Data Quality Issue #4**: > **Categorical Columns**: The feature column "Light_Duty" is categorical and has a "Yes/No" choice. We cannot feed values like this into a machine learning model. In addition, we need to "one-hot encode the remaining "string"/"object" columns. 5. **Data Quality Issue #5**: > **Temporal Features**: How do we handle year, month, and day? https://github.com/vijaymohire/datascience/blob/main/dataengg/improve_data_quality-Lab%206.ipynb 12 Terraform Deployment Use Terraform to deploy Google resources to servers in US and EU. Create required VPC network, security groups and deploy resources like VM instances, storage etc
  • 15. https://github.com/vijaymohire/gcp/blob/main/Terraform%20for%20GCP%20resources%20deployment%20Demo.pdf Disclaimer: We have sourced the content from various courses and partner trainings. All details, references are for educational purposes only • Google® is trademark of the Google® Corporation. All logos, trademarks and brand names belong to the respective owners as specified. We have no intention to infringe any copyrights or alter related permissions set by the owners. Please refer to source websites for any further details. This is for educational and information purpose only. For more details, contact: Bhadale IT Pvt. Ltd; Email: vijaymohire@gmail.com