Data science Git
management
Arindam Banerjee
Data Scientist, Ericsson
8/7/2020 Arindam Banerjee 1
Version Control Systems (VCS)
• Version control is indispensable for any kind of development.
• Crucial for collaboration and teamwork.
• VCS record changes (revisions) to a file or set of files over time so that a
specific version can be recalled later.
• Part of Software Component Management.
• Revisions are generally thought as a directed tree - a directed acyclic graph.
• Revisions occur in sequence over time - can be arranged in order by revision
number or timestamp.
• Example: Git, Subversion, Mercurial, CVS etc.
8/7/2020 Arindam Banerjee 2
Git
• “Git is a distributed version-control system for tracking changes
in source code during software development.” — Wikipedia
• Most recognized and widely used modern version control system.
• Git is free and open-source software distributed under GNU General
Public License Version 2.
• Git is a mature, actively maintained open source project originally
developed in 2005 by Linus Torvalds.
• Designed with performance, security and flexibility in mind.
8/7/2020 Arindam Banerjee 3
Characteristics
• Support non-linear development - rapid branching and merging,
navigating a non-linear development history.
• Distributed – local version and remote version.
• Fast and scalable
• Toolkit-based design – C and Shell Script for speed and portability.
• Primarily developed on Linux but supports macOS and Windows.
8/7/2020 Arindam Banerjee 4
• Founded in 2008, GitHub.com - most well known and widely
used cloud-based, version control platform.
• It has been a subsidiary of Microsoft since 2018.
• Project files are stored at remote cloud location - repository.
• Once local changes are pushed to GitHub, remote version is
updated.
• Every change is logged, and commit is recorded.
• Allows rollback to a previous version.
• Anyone can access the remote repository and download it.
• Only registered users can contribute, discuss, manage
repositories, and review code changes.
8/7/2020 Arindam Banerjee 5
Branching
• One of the most important
and efficient features of git.
• A temporary copy of the
project can be made where
changes are made first
without breaking anything in
production version.
8/7/2020 Arindam Banerjee 6
For Data Scientists
• Software engineers, developers, and data scientists use GitHub to store
their work, read documentation of other’s work, and collaborate across
teams and organizations.
• For incremental development and deployment.
• CI/CD pipeline - DevOps operations.
• Faster release cycle - agile workflow with frequent smaller changes.
• Putting better and retrained models into production.
8/7/2020 Arindam Banerjee 7
Creating a git repo
• Make your project directory as you want it.
• Run git init inside the top directory of your project.
• After that a directory called .git will be created in that location.
• This is where git records all the commits and keeps track of everything.
8/7/2020 Arindam Banerjee 8
Cloning a git repo
• Instead of creating a repo from scratch, one can clone an existing repo
from remote location.
• Use git clone command with a URL of the remote repository.
• git clone
https://github.com/link/to/your/remote/repo
• Cloning creates a new directory and places the cloned Git repository in
it.
8/7/2020 Arindam Banerjee 9
Check repo’s status
• git status command is used to check the state of the repo.
• Run this command after running any other command.
• It shows information about untracked new files which are created in the
working directory.
• Tracked files which have been modified.
• Branch you are currently in.
8/7/2020 Arindam Banerjee 10
Information about commits
• git log command displays all the commits of a repository.
• It shows the SHA (a unique id), the author, the date, the commit message etc.
• git log --stat command shows the files modified in the commit, the
number of lines that have been modified (added/deleted) and a summary of
modification.
• git log --patch or git log –p shows the actual changes
made to a file.
• git show command shows the same result as git log –p
• git show XXXXXX command shows the changes made in the commit of
SHA XXXXXX.
8/7/2020 Arindam Banerjee 11
Add commits to repo
• Git’s staging step allows to continue making changes to the working
directory, and interaction with version control is required, it allows to
record changes in small commits.
• If a file needs to be committed, it should be kept in the Staging Index.
• git add file1 file2 … command is used to move files from
the Working Directory to the Staging Index.
• git add . is used to include all files of the current working directory
to stage.
• git rm --cached file1... command is used to unstage files if
you added wrong files by mistake.
8/7/2020 Arindam Banerjee 12
Git commit
• Use git status first.
• Configure at least username and email address to make commit to a Git
repository because this information is stored in each commit. Use below
mentioned command:
git config --global user.name “Firstname Lastname”
git config --global user.email
“your.email@example.com”
• Commit using the git commit command:
git commit -m “Your commit message goes here.”
• Use git log command to review the commit just made.
8/7/2020 Arindam Banerjee 13
Git Branching
• A powerful feature of Git.
• It represents an independent line of development. New commits are recorded in the
history for the current branch, and as a fork in the history of the project.
• git branch “new_branch_name” is used to create a new branch with the
given name.
• git checkout “new_branch_name” command is used to switch between
branches. Else even after branching, commits wil lbe made to main branch.
• git checkout –b “new_branch_name” – create and checkout to the new
branch.
• git branch command will show the active branch with an asterisk sign.
• git branch –d “new_branch_name” will delete the new branch.
8/7/2020 Arindam Banerjee 14
Git Merging
• git merge command is used to combine branches in Git.
• git merge “new_branch_name”
8/7/2020 Arindam Banerjee 15
Communicating with remote repo
• git fetch command downloads commits, files, and references from
a remote repository into local repo.
• Git isolates fetched content from existing local content, so that local
development work is no way affected.
• git pull on the other hand does that AND brings (copy) those
changes from the remote repository.
• git push command is used to upload local repository content to a
remote repository.
• Push exports commits to remote branches.
8/7/2020 Arindam Banerjee 16
GitHub for your Personal Branding
• Make your work public.
• Participate in Opensource Projects.
• Recruiters search through GitHub for potential job candidates.
• GitHub activity, repos, commits, documentation etc. are evidence of a
skilled practitioner.
• GitHub Pages - Websites for you and your projects. Hosted directly from
your GitHub repository. Just edit, push, and your changes are live.
8/7/2020 Arindam Banerjee 17
8/7/2020 Arindam Banerjee 18

Data science Git management

  • 1.
    Data science Git management ArindamBanerjee Data Scientist, Ericsson 8/7/2020 Arindam Banerjee 1
  • 2.
    Version Control Systems(VCS) • Version control is indispensable for any kind of development. • Crucial for collaboration and teamwork. • VCS record changes (revisions) to a file or set of files over time so that a specific version can be recalled later. • Part of Software Component Management. • Revisions are generally thought as a directed tree - a directed acyclic graph. • Revisions occur in sequence over time - can be arranged in order by revision number or timestamp. • Example: Git, Subversion, Mercurial, CVS etc. 8/7/2020 Arindam Banerjee 2
  • 3.
    Git • “Git isa distributed version-control system for tracking changes in source code during software development.” — Wikipedia • Most recognized and widely used modern version control system. • Git is free and open-source software distributed under GNU General Public License Version 2. • Git is a mature, actively maintained open source project originally developed in 2005 by Linus Torvalds. • Designed with performance, security and flexibility in mind. 8/7/2020 Arindam Banerjee 3
  • 4.
    Characteristics • Support non-lineardevelopment - rapid branching and merging, navigating a non-linear development history. • Distributed – local version and remote version. • Fast and scalable • Toolkit-based design – C and Shell Script for speed and portability. • Primarily developed on Linux but supports macOS and Windows. 8/7/2020 Arindam Banerjee 4
  • 5.
    • Founded in2008, GitHub.com - most well known and widely used cloud-based, version control platform. • It has been a subsidiary of Microsoft since 2018. • Project files are stored at remote cloud location - repository. • Once local changes are pushed to GitHub, remote version is updated. • Every change is logged, and commit is recorded. • Allows rollback to a previous version. • Anyone can access the remote repository and download it. • Only registered users can contribute, discuss, manage repositories, and review code changes. 8/7/2020 Arindam Banerjee 5
  • 6.
    Branching • One ofthe most important and efficient features of git. • A temporary copy of the project can be made where changes are made first without breaking anything in production version. 8/7/2020 Arindam Banerjee 6
  • 7.
    For Data Scientists •Software engineers, developers, and data scientists use GitHub to store their work, read documentation of other’s work, and collaborate across teams and organizations. • For incremental development and deployment. • CI/CD pipeline - DevOps operations. • Faster release cycle - agile workflow with frequent smaller changes. • Putting better and retrained models into production. 8/7/2020 Arindam Banerjee 7
  • 8.
    Creating a gitrepo • Make your project directory as you want it. • Run git init inside the top directory of your project. • After that a directory called .git will be created in that location. • This is where git records all the commits and keeps track of everything. 8/7/2020 Arindam Banerjee 8
  • 9.
    Cloning a gitrepo • Instead of creating a repo from scratch, one can clone an existing repo from remote location. • Use git clone command with a URL of the remote repository. • git clone https://github.com/link/to/your/remote/repo • Cloning creates a new directory and places the cloned Git repository in it. 8/7/2020 Arindam Banerjee 9
  • 10.
    Check repo’s status •git status command is used to check the state of the repo. • Run this command after running any other command. • It shows information about untracked new files which are created in the working directory. • Tracked files which have been modified. • Branch you are currently in. 8/7/2020 Arindam Banerjee 10
  • 11.
    Information about commits •git log command displays all the commits of a repository. • It shows the SHA (a unique id), the author, the date, the commit message etc. • git log --stat command shows the files modified in the commit, the number of lines that have been modified (added/deleted) and a summary of modification. • git log --patch or git log –p shows the actual changes made to a file. • git show command shows the same result as git log –p • git show XXXXXX command shows the changes made in the commit of SHA XXXXXX. 8/7/2020 Arindam Banerjee 11
  • 12.
    Add commits torepo • Git’s staging step allows to continue making changes to the working directory, and interaction with version control is required, it allows to record changes in small commits. • If a file needs to be committed, it should be kept in the Staging Index. • git add file1 file2 … command is used to move files from the Working Directory to the Staging Index. • git add . is used to include all files of the current working directory to stage. • git rm --cached file1... command is used to unstage files if you added wrong files by mistake. 8/7/2020 Arindam Banerjee 12
  • 13.
    Git commit • Usegit status first. • Configure at least username and email address to make commit to a Git repository because this information is stored in each commit. Use below mentioned command: git config --global user.name “Firstname Lastname” git config --global user.email “your.email@example.com” • Commit using the git commit command: git commit -m “Your commit message goes here.” • Use git log command to review the commit just made. 8/7/2020 Arindam Banerjee 13
  • 14.
    Git Branching • Apowerful feature of Git. • It represents an independent line of development. New commits are recorded in the history for the current branch, and as a fork in the history of the project. • git branch “new_branch_name” is used to create a new branch with the given name. • git checkout “new_branch_name” command is used to switch between branches. Else even after branching, commits wil lbe made to main branch. • git checkout –b “new_branch_name” – create and checkout to the new branch. • git branch command will show the active branch with an asterisk sign. • git branch –d “new_branch_name” will delete the new branch. 8/7/2020 Arindam Banerjee 14
  • 15.
    Git Merging • gitmerge command is used to combine branches in Git. • git merge “new_branch_name” 8/7/2020 Arindam Banerjee 15
  • 16.
    Communicating with remoterepo • git fetch command downloads commits, files, and references from a remote repository into local repo. • Git isolates fetched content from existing local content, so that local development work is no way affected. • git pull on the other hand does that AND brings (copy) those changes from the remote repository. • git push command is used to upload local repository content to a remote repository. • Push exports commits to remote branches. 8/7/2020 Arindam Banerjee 16
  • 17.
    GitHub for yourPersonal Branding • Make your work public. • Participate in Opensource Projects. • Recruiters search through GitHub for potential job candidates. • GitHub activity, repos, commits, documentation etc. are evidence of a skilled practitioner. • GitHub Pages - Websites for you and your projects. Hosted directly from your GitHub repository. Just edit, push, and your changes are live. 8/7/2020 Arindam Banerjee 17
  • 18.