SlideShare a Scribd company logo
1 of 15
Download to read offline
Key projects in
Data Science
March 30
2024
This is a summary of my five years hands-on towards achieving the required experience, and skills in Data Science
and Engineering. This includes key partner trainings from Google® Vijayananda Mohire
Key projects in Data Science
Project
No.
Project Name Project Summary / Learning
1 Olympic Medal
Analysis
This project uses the Olympics.csv dataset from Kaggle. It provides various details and insights of the medal won by
various players and countries related data. Below is one example.
2 Migrating from
Spark to BigQuery
via Dataproc
Migrating the original Spark code, to Dataproc (lift-and-shift) and analysis of the spark tasks. Copy data to HDFS,
reading the CSV files, Spark Analysis using dataframe and Spark SQL.
3 Flight Departure
delay analysis
This project provides insights of departure delays using Google’s Big Query, sql query and Dataframe plots to visualize
the dataset analysis. Dataset used was Google’s internal storage with dataset name: cloud-training-
demos.airline_ontime_data.flights
4 Exploratory data
analysis using
BigQuery
EDA using linear regression using Python and Scikit-Learn, heatmaps for predicting US house value US and taxi fare
estimation
5 Exploring and
Creating an
Ecommerce
Analytics Pipeline
with Cloud
Dataprep v1.5 /
Data wrangling /
cleansing
Cloud Dataprep® by Trifacta® is an intelligent data service for visually exploring, cleaning, and preparing structured
and unstructured data for analysis. In this lab we will explore the Cloud Dataprep UI to build an ecommerce
transformation pipeline that will run at a scheduled interval and output results back into BigQuery.
The dataset we will be using is an ecommerce dataset that has millions of Google® Analytics records for the Google®
Merchandise Store loaded into BigQuery
In this lab, you learn how to perform these tasks:
 Connect BigQuery datasets to Cloud Dataprep
 Explore dataset quality with Cloud Dataprep
 Create a data transformation pipeline with Cloud Dataprep
 Schedule transformation jobs outputs to BigQuery
Analytics Pipeline
Recipe Book with rules
Results with duplicate rows removed
6 Advanced
Visualizations with
TensorFlow Data
Validation
This lab illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That
includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift
and skew in our dataset
First we'll use `tfdv.generate_statistics_from_csv` to compute statistics for our training data.
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are
present and the shapes of their value distributions. Now let's use [`tfdv.infer_schema`] to create a schema for our
data.
Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical
features, where we want to identify the range of acceptable values.
Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and
span N+1), such as between different days of training data. We express drift in terms of [L-infinity distance], and
you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.
Adding skew and drift comparators to visualize and make corrections. Few of the uses are:
1. Validating new data for inference to make sure that we haven't suddenly started receiving bad features
2. Validating new data for inference to make sure that our model has trained on that part of the decision surface
3. Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow
Transform] to make sure we haven't done something wrong
7 TPU Speed Data
Pipelines
TPUs are very fast, and the stream of training data must keep up with their training speed. In this lab, you will learn
how to load data from Cloud Storage with the tf.data.Dataset API to feed your TPU
You will learn:
• To use the tf.data.Dataset API to load training data.
• To use TFRecord format to load training data efficiently from Cloud Storage.
8 Data Pipelines –
design and deploy
Kubeflow pipeline in Google. This project offers three components that provide different outputs that can be combined
to provide a final response to the consumer
https://github.com/vijaymohire/gcp/blob/main/MyPipeExample.ipynb
https://github.com/vijaymohire/gcp/blob/main/KubeflowpipelineRun.png
9 Data ingestion, ETL Project name: Serverless Data Processing with Dataflow - Writing an ETL Pipeline using Apache Beam and Cloud
Dataflow (Python)
In this lab, you will learn how to:
 Build a batch Extract-Transform-Load pipeline in Apache Beam, which takes raw data from Google Cloud
Storage and writes it to Google BigQuery.
 Run the Apache Beam pipeline on Cloud Dataflow.
 Parameterize the execution of the pipeline.
10 Feature
Engineering
Project: Predict Bike Trip Duration with a Regression Model in BQML 2.5
In this lab, you learn to perform the following tasks:
 Query and explore the London bicycles dataset for feature engineering
 Create a linear regression model in BigQuery ML
 Evaluate the performance of your machine learning model
 Extract your model weights
Impact of number of bicycles
A potential feature is the number of bikes in the station. Perhaps, we hypothesize, people keep bicycles longer if
there are fewer bicycles on rent at the station they rented from.
1. In the query editor window paste the following query:
SELECT bikes_count, AVG(duration) AS duration FROM `bigquery-public-data`.london_bicycles.cycle_hire JOIN
`bigquery-public-data`.london_bicycles.cycle_stations ON cycle_hire.start_station_name = cycle_stations.name
GROUP BY bikes_count
2. Visualize your data in Looker Studio.
11 Data quality Project Name: Improving Data Quality
This notebook introduced a few concepts to improve data quality. We resolved missing values, converted the Date
feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one-
hot encoding features, and converted temporal features to meaningful representations. By the end of our lab, we
gained an understanding as to why data should be "cleaned" and "pre-processed" before input into a machine
learning model.
1. **Data Quality Issue #1**:
> **Missing Values**:
Each feature column has multiple missing values. In fact, we have a total of 18 missing values.
2. **Data Quality Issue #2**:
> **Date DataType**: Date is shown as an "object" datatype and should be a datetime. In addition, Date is in one
column. Our business requirement is to see the Date parsed out to year, month, and day.
3. **Data Quality Issue #3**:
> **Model Year**: We are only interested in years greater than 2006, not "<2006".
4. **Data Quality Issue #4**:
> **Categorical Columns**: The feature column "Light_Duty" is categorical and has a "Yes/No" choice. We cannot
feed values like this into a machine learning model. In addition, we need to "one-hot encode the remaining
"string"/"object" columns.
5. **Data Quality Issue #5**:
> **Temporal Features**: How do we handle year, month, and day?
https://github.com/vijaymohire/datascience/blob/main/dataengg/improve_data_quality-Lab%206.ipynb
12 Terraform
Deployment
Use Terraform to deploy Google resources to servers in US and EU. Create required VPC network, security groups and
deploy resources like VM instances, storage etc
https://github.com/vijaymohire/gcp/blob/main/Terraform%20for%20GCP%20resources%20deployment%20Demo.pdf
Disclaimer:
We have sourced the content from various courses and partner trainings. All details, references are for educational purposes only
• Google®
is trademark of the Google®
Corporation. All logos, trademarks and brand names belong to the respective owners as specified. We
have no intention to infringe any copyrights or alter related permissions set by the owners. Please refer to source websites for any further
details. This is for educational and information purpose only.
For more details, contact:
Bhadale IT Pvt. Ltd; Email: vijaymohire@gmail.com

More Related Content

Similar to Key projects Data Science and Engineering

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLMárton Kodok
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022SkillCertProExams
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonIRJET Journal
 
BigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLBigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLMárton Kodok
 
Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdfGoogle.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdfssuser22b701
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...IRJET Journal
 
Shipment Time Prediction for Maritime Industry using Machine Learning
Shipment Time Prediction for Maritime Industry using Machine LearningShipment Time Prediction for Maritime Industry using Machine Learning
Shipment Time Prediction for Maritime Industry using Machine LearningIRJET Journal
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Neo4j
 
Data Driven Attribution in BigQuery with Shapley Values and Markov Chains
Data Driven Attribution in BigQuery with Shapley Values and Markov ChainsData Driven Attribution in BigQuery with Shapley Values and Markov Chains
Data Driven Attribution in BigQuery with Shapley Values and Markov ChainsChristopher Gutknecht
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 

Similar to Key projects Data Science and Engineering (20)

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & Python
 
BigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery MLBigdataConference Europe - BigQuery ML
BigdataConference Europe - BigQuery ML
 
Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdfGoogle.Test4prep.Professional-Data-Engineer.v2038q.pdf
Google.Test4prep.Professional-Data-Engineer.v2038q.pdf
 
Big query
Big queryBig query
Big query
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...
 
Shipment Time Prediction for Maritime Industry using Machine Learning
Shipment Time Prediction for Maritime Industry using Machine LearningShipment Time Prediction for Maritime Industry using Machine Learning
Shipment Time Prediction for Maritime Industry using Machine Learning
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Data Driven Attribution in BigQuery with Shapley Values and Markov Chains
Data Driven Attribution in BigQuery with Shapley Values and Markov ChainsData Driven Attribution in BigQuery with Shapley Values and Markov Chains
Data Driven Attribution in BigQuery with Shapley Values and Markov Chains
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 

More from Vijayananda Mohire

Certificate- Peer Review of Book Chapter on ML
Certificate- Peer Review of Book Chapter on MLCertificate- Peer Review of Book Chapter on ML
Certificate- Peer Review of Book Chapter on MLVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Bhadale IT Hub-Multi Cloud and Multi QAI
Bhadale IT Hub-Multi Cloud and Multi QAIBhadale IT Hub-Multi Cloud and Multi QAI
Bhadale IT Hub-Multi Cloud and Multi QAIVijayananda Mohire
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Azure Quantum Workspace for developing Q# based quantum circuits
Azure Quantum Workspace for developing Q# based quantum circuitsAzure Quantum Workspace for developing Q# based quantum circuits
Azure Quantum Workspace for developing Q# based quantum circuitsVijayananda Mohire
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceVijayananda Mohire
 
Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureVijayananda Mohire
 
Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureVijayananda Mohire
 
Bhadale IT Intel and Azure Cloud Offerings
Bhadale IT Intel and Azure Cloud OfferingsBhadale IT Intel and Azure Cloud Offerings
Bhadale IT Intel and Azure Cloud OfferingsVijayananda Mohire
 
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical ImplicationsPractical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical ImplicationsVijayananda Mohire
 
Cloud Infrastructure - Partner Delivery Accelerator (APAC)
Cloud Infrastructure - Partner Delivery Accelerator (APAC)Cloud Infrastructure - Partner Delivery Accelerator (APAC)
Cloud Infrastructure - Partner Delivery Accelerator (APAC)Vijayananda Mohire
 
Red Hat Sales Specialist - Red Hat Enterprise Linux
Red Hat Sales Specialist - Red Hat Enterprise LinuxRed Hat Sales Specialist - Red Hat Enterprise Linux
Red Hat Sales Specialist - Red Hat Enterprise LinuxVijayananda Mohire
 
Generative AI Business Transformation
Generative AI Business TransformationGenerative AI Business Transformation
Generative AI Business TransformationVijayananda Mohire
 
Microsoft Learn Transcript Jan 2024- vijaymohire
Microsoft Learn Transcript Jan 2024- vijaymohireMicrosoft Learn Transcript Jan 2024- vijaymohire
Microsoft Learn Transcript Jan 2024- vijaymohireVijayananda Mohire
 
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0Vijayananda Mohire
 
Intel Partnership - Gold Member - 2024
Intel Partnership - Gold Member - 2024Intel Partnership - Gold Member - 2024
Intel Partnership - Gold Member - 2024Vijayananda Mohire
 
Bhadale Group of Companies -Defense Academy Ver. 2.0
Bhadale Group of Companies -Defense Academy Ver. 2.0Bhadale Group of Companies -Defense Academy Ver. 2.0
Bhadale Group of Companies -Defense Academy Ver. 2.0Vijayananda Mohire
 
Develop Generative AI solutions with Azure OpenAI Service
Develop Generative AI solutions with Azure OpenAI ServiceDevelop Generative AI solutions with Azure OpenAI Service
Develop Generative AI solutions with Azure OpenAI ServiceVijayananda Mohire
 

More from Vijayananda Mohire (20)

Certificate- Peer Review of Book Chapter on ML
Certificate- Peer Review of Book Chapter on MLCertificate- Peer Review of Book Chapter on ML
Certificate- Peer Review of Book Chapter on ML
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Bhadale IT Hub-Multi Cloud and Multi QAI
Bhadale IT Hub-Multi Cloud and Multi QAIBhadale IT Hub-Multi Cloud and Multi QAI
Bhadale IT Hub-Multi Cloud and Multi QAI
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Azure Quantum Workspace for developing Q# based quantum circuits
Azure Quantum Workspace for developing Q# based quantum circuitsAzure Quantum Workspace for developing Q# based quantum circuits
Azure Quantum Workspace for developing Q# based quantum circuits
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial Intelligence
 
Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for Agriculture
 
Bhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for AgricultureBhadale IT Cloud Solutions for Agriculture
Bhadale IT Cloud Solutions for Agriculture
 
Bhadale IT Intel and Azure Cloud Offerings
Bhadale IT Intel and Azure Cloud OfferingsBhadale IT Intel and Azure Cloud Offerings
Bhadale IT Intel and Azure Cloud Offerings
 
GitHub Copilot-vijaymohire
GitHub Copilot-vijaymohireGitHub Copilot-vijaymohire
GitHub Copilot-vijaymohire
 
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical ImplicationsPractical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
Practical ChatGPT From Use Cases to Prompt Engineering & Ethical Implications
 
Cloud Infrastructure - Partner Delivery Accelerator (APAC)
Cloud Infrastructure - Partner Delivery Accelerator (APAC)Cloud Infrastructure - Partner Delivery Accelerator (APAC)
Cloud Infrastructure - Partner Delivery Accelerator (APAC)
 
Red Hat Sales Specialist - Red Hat Enterprise Linux
Red Hat Sales Specialist - Red Hat Enterprise LinuxRed Hat Sales Specialist - Red Hat Enterprise Linux
Red Hat Sales Specialist - Red Hat Enterprise Linux
 
RedHat_Transcript_Jan_2024
RedHat_Transcript_Jan_2024RedHat_Transcript_Jan_2024
RedHat_Transcript_Jan_2024
 
Generative AI Business Transformation
Generative AI Business TransformationGenerative AI Business Transformation
Generative AI Business Transformation
 
Microsoft Learn Transcript Jan 2024- vijaymohire
Microsoft Learn Transcript Jan 2024- vijaymohireMicrosoft Learn Transcript Jan 2024- vijaymohire
Microsoft Learn Transcript Jan 2024- vijaymohire
 
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
Bhadale Group of Companies -Futuristic Products Brief-Ver 1.0
 
Intel Partnership - Gold Member - 2024
Intel Partnership - Gold Member - 2024Intel Partnership - Gold Member - 2024
Intel Partnership - Gold Member - 2024
 
Bhadale Group of Companies -Defense Academy Ver. 2.0
Bhadale Group of Companies -Defense Academy Ver. 2.0Bhadale Group of Companies -Defense Academy Ver. 2.0
Bhadale Group of Companies -Defense Academy Ver. 2.0
 
Develop Generative AI solutions with Azure OpenAI Service
Develop Generative AI solutions with Azure OpenAI ServiceDevelop Generative AI solutions with Azure OpenAI Service
Develop Generative AI solutions with Azure OpenAI Service
 

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Key projects Data Science and Engineering

  • 1. Key projects in Data Science March 30 2024 This is a summary of my five years hands-on towards achieving the required experience, and skills in Data Science and Engineering. This includes key partner trainings from Google® Vijayananda Mohire
  • 2. Key projects in Data Science Project No. Project Name Project Summary / Learning 1 Olympic Medal Analysis This project uses the Olympics.csv dataset from Kaggle. It provides various details and insights of the medal won by various players and countries related data. Below is one example.
  • 3.
  • 4.
  • 5. 2 Migrating from Spark to BigQuery via Dataproc Migrating the original Spark code, to Dataproc (lift-and-shift) and analysis of the spark tasks. Copy data to HDFS, reading the CSV files, Spark Analysis using dataframe and Spark SQL.
  • 6. 3 Flight Departure delay analysis This project provides insights of departure delays using Google’s Big Query, sql query and Dataframe plots to visualize the dataset analysis. Dataset used was Google’s internal storage with dataset name: cloud-training- demos.airline_ontime_data.flights
  • 7. 4 Exploratory data analysis using BigQuery EDA using linear regression using Python and Scikit-Learn, heatmaps for predicting US house value US and taxi fare estimation
  • 8. 5 Exploring and Creating an Ecommerce Analytics Pipeline with Cloud Dataprep v1.5 / Data wrangling / cleansing Cloud Dataprep® by Trifacta® is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis. In this lab we will explore the Cloud Dataprep UI to build an ecommerce transformation pipeline that will run at a scheduled interval and output results back into BigQuery. The dataset we will be using is an ecommerce dataset that has millions of Google® Analytics records for the Google® Merchandise Store loaded into BigQuery In this lab, you learn how to perform these tasks:  Connect BigQuery datasets to Cloud Dataprep  Explore dataset quality with Cloud Dataprep  Create a data transformation pipeline with Cloud Dataprep  Schedule transformation jobs outputs to BigQuery
  • 10. Results with duplicate rows removed 6 Advanced Visualizations with TensorFlow Data Validation This lab illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset First we'll use `tfdv.generate_statistics_from_csv` to compute statistics for our training data. TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions. Now let's use [`tfdv.infer_schema`] to create a schema for our data. Does our evaluation dataset match the schema from our training dataset? This is especially important for categorical features, where we want to identify the range of acceptable values. Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of training data. We express drift in terms of [L-infinity distance], and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Adding skew and drift comparators to visualize and make corrections. Few of the uses are: 1. Validating new data for inference to make sure that we haven't suddenly started receiving bad features 2. Validating new data for inference to make sure that our model has trained on that part of the decision surface
  • 11. 3. Validating our data after we've transformed it and done feature engineering (probably using [TensorFlow Transform] to make sure we haven't done something wrong 7 TPU Speed Data Pipelines TPUs are very fast, and the stream of training data must keep up with their training speed. In this lab, you will learn how to load data from Cloud Storage with the tf.data.Dataset API to feed your TPU You will learn: • To use the tf.data.Dataset API to load training data. • To use TFRecord format to load training data efficiently from Cloud Storage. 8 Data Pipelines – design and deploy Kubeflow pipeline in Google. This project offers three components that provide different outputs that can be combined to provide a final response to the consumer https://github.com/vijaymohire/gcp/blob/main/MyPipeExample.ipynb https://github.com/vijaymohire/gcp/blob/main/KubeflowpipelineRun.png 9 Data ingestion, ETL Project name: Serverless Data Processing with Dataflow - Writing an ETL Pipeline using Apache Beam and Cloud Dataflow (Python) In this lab, you will learn how to:
  • 12.  Build a batch Extract-Transform-Load pipeline in Apache Beam, which takes raw data from Google Cloud Storage and writes it to Google BigQuery.  Run the Apache Beam pipeline on Cloud Dataflow.  Parameterize the execution of the pipeline. 10 Feature Engineering Project: Predict Bike Trip Duration with a Regression Model in BQML 2.5 In this lab, you learn to perform the following tasks:  Query and explore the London bicycles dataset for feature engineering  Create a linear regression model in BigQuery ML  Evaluate the performance of your machine learning model  Extract your model weights
  • 13. Impact of number of bicycles A potential feature is the number of bikes in the station. Perhaps, we hypothesize, people keep bicycles longer if there are fewer bicycles on rent at the station they rented from. 1. In the query editor window paste the following query: SELECT bikes_count, AVG(duration) AS duration FROM `bigquery-public-data`.london_bicycles.cycle_hire JOIN `bigquery-public-data`.london_bicycles.cycle_stations ON cycle_hire.start_station_name = cycle_stations.name GROUP BY bikes_count 2. Visualize your data in Looker Studio.
  • 14. 11 Data quality Project Name: Improving Data Quality This notebook introduced a few concepts to improve data quality. We resolved missing values, converted the Date feature column to a datetime format, renamed feature columns, removed a value from a feature column, created one- hot encoding features, and converted temporal features to meaningful representations. By the end of our lab, we gained an understanding as to why data should be "cleaned" and "pre-processed" before input into a machine learning model. 1. **Data Quality Issue #1**: > **Missing Values**: Each feature column has multiple missing values. In fact, we have a total of 18 missing values. 2. **Data Quality Issue #2**: > **Date DataType**: Date is shown as an "object" datatype and should be a datetime. In addition, Date is in one column. Our business requirement is to see the Date parsed out to year, month, and day. 3. **Data Quality Issue #3**: > **Model Year**: We are only interested in years greater than 2006, not "<2006". 4. **Data Quality Issue #4**: > **Categorical Columns**: The feature column "Light_Duty" is categorical and has a "Yes/No" choice. We cannot feed values like this into a machine learning model. In addition, we need to "one-hot encode the remaining "string"/"object" columns. 5. **Data Quality Issue #5**: > **Temporal Features**: How do we handle year, month, and day? https://github.com/vijaymohire/datascience/blob/main/dataengg/improve_data_quality-Lab%206.ipynb 12 Terraform Deployment Use Terraform to deploy Google resources to servers in US and EU. Create required VPC network, security groups and deploy resources like VM instances, storage etc
  • 15. https://github.com/vijaymohire/gcp/blob/main/Terraform%20for%20GCP%20resources%20deployment%20Demo.pdf Disclaimer: We have sourced the content from various courses and partner trainings. All details, references are for educational purposes only • Google® is trademark of the Google® Corporation. All logos, trademarks and brand names belong to the respective owners as specified. We have no intention to infringe any copyrights or alter related permissions set by the owners. Please refer to source websites for any further details. This is for educational and information purpose only. For more details, contact: Bhadale IT Pvt. Ltd; Email: vijaymohire@gmail.com