Setting Up a Linux Data Science Environment on Windows

Phillip Peng
4 min readJan 5, 2024

a guaranteed-to-work guide to setup a Linux data science environment for Windows users.

Summary chart created by Mermaid JS

Linux is often preferred for data science projects because it supports a range of Python packages specifically designed for this OS, which may not work as well on other systems. Additionally, Linux is sometimes a requirement for advanced projects due to its compatibility with certain libraries, its efficient memory management, and the availability of specialized tools that perform optimally in this environment. Therefore, learning to use Linux can be crucial for effectively handling data science tasks.

Welcome to this guaranteed-to-work guide where we will guide Windows users through setting up a Linux Python environment for data science projects using the Windows Subsystem for Linux (WSL) and Git. This setup provides the best of both worlds: the robustness of a Linux environment and the convenience of Windows.

Part 1: Installing WSL and Ubuntu

Step 1: Enable WSL

  1. Open PowerShell as Administrator and run:
wsl --install

2. Restart your computer when prompted.

Step 2: Install Ubuntu

  1. Open Microsoft Store and search for “Ubuntu.”
  2. Select your preferred version (e.g., Ubuntu 20.04) and click “Install.”
  3. Launch Ubuntu from the Start Menu. Set up your username and password.

Part 2: Setting Up Python and Conda

Step 1: Update and Upgrade Ubuntu

  1. Open the Ubuntu terminal and run:
sudo apt update && sudo apt upgrade

Step 2: Install Miniconda

  1. Download Miniconda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

2. Run the installer:

bash Miniconda3-latest-Linux-x86_64.sh

3. Follow the on-screen instructions. You need to accept the license terms and then confirm the installation location.

4. Reload your profile:

source ~/.bashrc

5. Verify the installation:

conda --version

Part 3: Configuring the Data Science Environment

Step 1: Create a Conda Environment

  1. Create a new environment named ‘ds_env’ with Python 3.10:
conda create --name ds_env python=3.10

Of course, you have the flexibility to name your environment as you please and choose from various Python versions.

2. Activate the environment:

conda activate ds_env

By switching to the chosen environment, you are ready to install the data science package you need into this environment.

Step 2: Install Data Science Packages

  1. Install essential packages like NumPy, Pandas, Matplotlib, Scikit-Learn, and Jupyter:
conda install numpy pandas matplotlib scikit-learn jupyter

Step 3: Launch Jupyter Lab

  1. Start Jupyter Lab:
jupyter lab

2. Access Jupyter through the URL provided in the terminal.

Part 4: Integrate Git for Version Control

Git, a version control system, is essential for managing data science projects. It tracks changes in code and data, enabling you to revert to previous states if needed. Git’s ability to handle different versions makes collaborative work seamless, allowing multiple team members to contribute without conflict. With Git, your project’s development is well-structured and easily understandable, enhancing efficiency and collaboration. Whether working alone or in a team, Git is a key tool for any data scientist. Thus, it is the best practice to integrate Git in every environment you are to work under.

Step 1: Install Git

  1. Install Git in Ubuntu:
sudo apt install git

2. Configure Git with your user name and email:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Step 2: Using Git in Projects

  1. Initialize a new Git repository in your project directory:
git init

2. Track your files and commit changes:

git add .
git commit -m "Initial commit"

3. Connect to a remote repository (like GitHub) if needed. To add a remote repository to your Git configuration, you can use the following Bash command:

git remote add [remote-name] [remote-url]

where [remote-name] is the short name you assign to the remote repository and [remote-url] is the URL of the remote repository.

Step 3: Best Practices with Git

  • Commit Often: Regular commits help in maintaining a clear history of changes.
  • Use Branches: For new features or experiments, use branches.
  • Pull/Push Regularly: If working in a team, regularly pull from and push to remote repositories to stay synchronized.
  • Use ‘.gitignore’ file to specify intentionally untracked files such as temporary files, compiled code, and sensitive information.

A sample .gitignore file looks as below:

# ignore all .log files
*.log

# ignore specific directory
node_modules/

# ignore specific file
config.env

# ignore all files in a specific folder
temp/*

Part 5: Integrating with Windows

Step 1: Accessing Windows Files from Ubuntu

  • You can access your Windows files in /mnt/c/ (C: Drive) in Ubuntu. This is a huge deal. It is like mounting a project folder to a docker container. You can benefit from using the designated Linux environment and share the project files with the Windows system.

Step 2: Setting Up Visual Studio Code

  1. Install Visual Studio Code in Windows.
  2. Install the Remote — WSL extension for seamless integration.
  3. Start the VS code editor: cd to the project folder and then run the following command:
code .

Part 6: Best Practices and Tips

  1. Regularly update your packages to stay current.
  2. Use virtual environments for project-specific dependencies.
  3. Backup your work regularly.
  4. Familiarize yourself with Linux commands for efficient navigation and operation.

Conclusion

Congratulations! You now have a powerful data science environment at your fingertips, combining the flexibility of Linux and the convenience of Windows. Happy data exploring!

--

--