This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
Unleash Your Potential - Namagunga Girls Coding Club
Key projects Data Science and Engineering
1. Key projects in
Data Science
March 30
2024
This is a summary of my five years hands-on towards achieving the required experience, and skills in Data Science
and Engineering. This includes key partner trainings from Google® Vijayananda Mohire
2. Key projects in Data Science
Project
No.
Project Name Project Summary / Learning
1 Olympic Medal
Analysis
This project uses the Olympics.csv dataset from Kaggle. It provides various details and insights of the medal won by
various players and countries related data. Below is one example.
3.
4.
5. 2 Migrating from
Spark to BigQuery
via Dataproc
Migrating the original Spark code, to Dataproc (lift-and-shift) and analysis of the spark tasks. Copy data to HDFS,
reading the CSV files, Spark Analysis using dataframe and Spark SQL.
6. 3 Flight Departure
delay analysis
This project provides insights of departure delays using Google’s Big Query, sql query and Dataframe plots to visualize
the dataset analysis. Dataset used was Google’s internal storage with dataset name: cloud-training-
demos.airline_ontime_data.flights
7. 4 Exploratory data
analysis using
BigQuery
EDA using linear regression using Python and Scikit-Learn, heatmaps for predicting US house value US and taxi fare
estimation
8. 5 Exploring and
Creating an
Ecommerce
Analytics Pipeline
with Cloud
Dataprep v1.5 /
Data wrangling /
cleansing
Cloud Dataprep® by Trifacta® is an intelligent data service for visually exploring, cleaning, and preparing structured
and unstructured data for analysis. In this lab we will explore the Cloud Dataprep UI to build an ecommerce
transformation pipeline that will run at a scheduled interval and output results back into BigQuery.
The dataset we will be using is an ecommerce dataset that has millions of Google® Analytics records for the Google®
Merchandise Store loaded into BigQuery
In this lab, you learn how to perform these tasks:
Connect BigQuery datasets to Cloud Dataprep
Explore dataset quality with Cloud Dataprep
Create a data transformation pipeline with Cloud Dataprep
Schedule transformation jobs outputs to BigQuery
10. Results with duplicate rows removed
6 Advanced
Visualizations with
TensorFlow Data
Validation
This lab illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That
includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift
and skew in our dataset
First we'll use `tfdv.generate_statistics_from_csv` to compute statistics for our training data.
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are
present and the shapes of their value distributions. Now let's use [`tfdv.infer_schema`] to create a schema for our
data.
Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical
features, where we want to identify the range of acceptable values.
Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and
span N+1), such as between different days of training data. We express drift in terms of [L-infinity distance], and
you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.
Adding skew and drift comparators to visualize and make corrections. Few of the uses are:
1. Validating new data for inference to make sure that we haven't suddenly started receiving bad features
2. Validating new data for inference to make sure that our model has trained on that part of the decision surface
11. 3. Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow
Transform] to make sure we haven't done something wrong
7 TPU Speed Data
Pipelines
TPUs are very fast, and the stream of training data must keep up with their training speed. In this lab, you will learn
how to load data from Cloud Storage with the tf.data.Dataset API to feed your TPU
You will learn:
• To use the tf.data.Dataset API to load training data.
• To use TFRecord format to load training data efficiently from Cloud Storage.
8 Data Pipelines –
design and deploy
Kubeflow pipeline in Google. This project offers three components that provide different outputs that can be combined
to provide a final response to the consumer
https://github.com/vijaymohire/gcp/blob/main/MyPipeExample.ipynb
https://github.com/vijaymohire/gcp/blob/main/KubeflowpipelineRun.png
9 Data ingestion, ETL Project name: Serverless Data Processing with Dataflow - Writing an ETL Pipeline using Apache Beam and Cloud
Dataflow (Python)
In this lab, you will learn how to:
12. Build a batch Extract-Transform-Load pipeline in Apache Beam, which takes raw data from Google Cloud
Storage and writes it to Google BigQuery.
Run the Apache Beam pipeline on Cloud Dataflow.
Parameterize the execution of the pipeline.
10 Feature
Engineering
Project: Predict Bike Trip Duration with a Regression Model in BQML 2.5
In this lab, you learn to perform the following tasks:
Query and explore the London bicycles dataset for feature engineering
Create a linear regression model in BigQuery ML
Evaluate the performance of your machine learning model
Extract your model weights
13. Impact of number of bicycles
A potential feature is the number of bikes in the station. Perhaps, we hypothesize, people keep bicycles longer if
there are fewer bicycles on rent at the station they rented from.
1. In the query editor window paste the following query:
SELECT bikes_count, AVG(duration) AS duration FROM `bigquery-public-data`.london_bicycles.cycle_hire JOIN
`bigquery-public-data`.london_bicycles.cycle_stations ON cycle_hire.start_station_name = cycle_stations.name
GROUP BY bikes_count
2. Visualize your data in Looker Studio.
14. 11 Data quality Project Name: Improving Data Quality
This notebook introduced a few concepts to improve data quality. We resolved missing values, converted the Date
feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one-
hot encoding features, and converted temporal features to meaningful representations. By the end of our lab, we
gained an understanding as to why data should be "cleaned" and "pre-processed" before input into a machine
learning model.
1. **Data Quality Issue #1**:
> **Missing Values**:
Each feature column has multiple missing values. In fact, we have a total of 18 missing values.
2. **Data Quality Issue #2**:
> **Date DataType**: Date is shown as an "object" datatype and should be a datetime. In addition, Date is in one
column. Our business requirement is to see the Date parsed out to year, month, and day.
3. **Data Quality Issue #3**:
> **Model Year**: We are only interested in years greater than 2006, not "<2006".
4. **Data Quality Issue #4**:
> **Categorical Columns**: The feature column "Light_Duty" is categorical and has a "Yes/No" choice. We cannot
feed values like this into a machine learning model. In addition, we need to "one-hot encode the remaining
"string"/"object" columns.
5. **Data Quality Issue #5**:
> **Temporal Features**: How do we handle year, month, and day?
https://github.com/vijaymohire/datascience/blob/main/dataengg/improve_data_quality-Lab%206.ipynb
12 Terraform
Deployment
Use Terraform to deploy Google resources to servers in US and EU. Create required VPC network, security groups and
deploy resources like VM instances, storage etc
15. https://github.com/vijaymohire/gcp/blob/main/Terraform%20for%20GCP%20resources%20deployment%20Demo.pdf
Disclaimer:
We have sourced the content from various courses and partner trainings. All details, references are for educational purposes only
• Google®
is trademark of the Google®
Corporation. All logos, trademarks and brand names belong to the respective owners as specified. We
have no intention to infringe any copyrights or alter related permissions set by the owners. Please refer to source websites for any further
details. This is for educational and information purpose only.
For more details, contact:
Bhadale IT Pvt. Ltd; Email: vijaymohire@gmail.com