Setting up the development environment on Google Virtual Machine

Setting up the development environment on Google Virtual Machine

Data Engineering Zoom Camp 2023

I'm participating in this year's cohort of the Data Engineering Zoomcamp 2023. This is a community-led, free data engineering course of about 8 weeks. In this blog, I'll summarise the steps to configure a Google Virtual Machine to make it ready for the rest of the course.

Configure SSH Keys

Generate a new SSH key with the following commands:

cd ~/.ssh
ssh-keygen -t rsa -f <key-file-name> -C <username> -b 2048

It'll raise a prompt to enter a passphrase. You can leave it and press enter. If it asks for confirmation, press enter again. Here's an example:

This generates 2 files in the .ssh folder, one for the public (gcp-blog.pub) and one for the private key (gcp-blog).

Next, upload the public key to GCP with the following steps:

  • Open the gcp-blog.pub file and copy its contents. Or you can use the cat command to display the contents in the terminal.

  • Go to the Google Cloud console > Compute Engine > Settings > Metadata.

  • Click on SSH Keys > Add SSH Keys

  • Paste the contents of the public key that you copied previously on the text box and click Save.

Now, you can connect to your Google VMs using the following command:

ssh -i <PATH_TO_PRIVATE_KEY> <USERNAME>@<EXTERNAL_IP>

Create a Virtual Machine

To set up a Virtual Machine:

  • Go to Compute Engine > VM Instances

  • Click on Create Instance.

  • Populate the configurations for the VM with the following details (Name and Region can be as per your preference):

  • Next, change the boot disk with the following configurations:

  • Leave the rest of the configurations to default values and click Create.

This will spin up a virtual machine instance for you. In order to ssh into this instance, run the following command:

ssh -i <PATH_TO_PRIVATE_KEY> <USERNAME>@<EXTERNAL_IP>

Here's an example on my system:

You can also configure an ssh alias, which is a convenient way to store the ssh details in a config file. You can follow my blog on this and set up your alias to easily connect with a VM.

I have created an alias for the VM by the name dezoomcamp. Here's the new command to ssh:

ssh dezoomcamp

If you want to connect to a VM using any other options, please go through the official documentation on Connecting to VMs.

Configure the Virtual Machine

Now that you have a Virtual Machine running and a way to ssh into it and run Linux commands, let's start with installing the requirements of the course to make it ready for development.

Installing Anaconda

  • Download the file in the VM:

  •       wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
    
  • Run the downloaded file:

  •       bash Anaconda3-2022.10-Linux-x86_64.sh
    
  • Keep pressing Enter to scroll down and enter yes to accept the license terms.

  • Press Enter to confirm the default location.

  • Enter yes to run the conda init when asked.

  • After the installation is complete, run the command source .bashrc to apply the changes to the .bashrc file. Alternatively, you can log out of the session using the logout command and then ssh back in for the changes to take effect.

You'll notice the anaconda environment name in the shell prompt once the changes are applied. Also, from the above image it is confirmed that python is also installed.

Installing Docker

Run the following commands:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install docker.io

This will install docker but you'll not be able to run it without sudo. It'll throw a permission denied error:

To run docker without sudo, run the following commands:

sudo groupadd docker
sudo gpasswd -a $USER docker
sudo service docker restart

(Refer to this link for an explanation of the above commands.)

Log out of the ssh session and log back in to re-evaluate the group memberships and try running the docker run hello-world command.

This time it works!

Installing docker-compose

  • Go to the docker-compose GitHub repo and under the latest release, find the "docker-compose-linux-x86_64" asset and copy its link.

  • Create a new bin folder in VM and download the asset into it:

  •       mkdir bin
          cd bin
          wget https://github.com/docker/compose/releases/download/v2.16.0/docker-compose-linux-x86_64 -O docker-compose
    
  • This will create a new file docker-compose in the bin folder.

  • Make this file executable:

  •       chmod +x docker-compose
    
  • Now, add the bin folder to the path:

    • Go to the home directory: cd

    • Open the .bashrc file: nano .bashrc

    • Paste the following to the end of the .bashrc file:

    •       export PATH="${HOME}/bin:${PATH}"
      
    • Press Ctrl+O > enter > Ctrl+X

    • Run source .bashrc for the changes to take effect.

Now, you have docker-compose installed as well.

Installing Pgcli

Pgcli is used to connect to the Postgres database and execute queries. It can be installed using the following command:

pip install pgcli

Once the installation is complete, you can connect to a Postgres database using the command:

pgcli -h <hostname> -u <username> -d <database-name>

It will prompt for the password afterwards.

Installing Terraform

  • Go to terraform's installation website.

  • Copy the link for Linux's AMD64 file.

  • Go to the bin folder created previously and download the file using wget:

  •     cd ~/bin
        wget https://releases.hashicorp.com/terraform/1.3.9/terraform_1.3.9_linux_amd64.zip
    
  • To unzip this file, install the unzip package:

  •     sudo apt-get install unzip
    
  • Now, unzip the file:

  •     unzip terraform_1.3.9_linux_amd64.zip
    
  • There will be a terraform executable file extracted. You can delete the zip file.

  • Since this file is in the bin folder and the bin folder is in the path, everything is set up.

  • Try running the following command to verify terraform's installation:

  •     terraform --version
    

Creating a service account

  • Go to IAM and Admin > Service Accounts

  • Click on the "Create Service Account" button at the top and provide the service account name.

  • Assign the following roles to this service account:

  • Click Done. The service account is now created.

  • To generate keys for this service account, click on the 3-dot menu and then on "manage keys"

  • On the following page, click on Add Key > Create new key> JSON format. Click Create.

  • This will download a JSON file. This is to be uploaded to the VM using SFTP.

  • Navigate to the path of the downloaded file on the terminal and open an SFTP connection by running the command: sftp <ssh-alias-name>.

  • Next, create a folder .gcp and upload the credentials file into it by running the following commands:

  •     mkdir .gcp
        put <credentials-file-name> .gcp/
    
  • This will upload the file to VM. Here's a screenshot from my system:

Authenticate GCP using the service account credentials

To authenticate GCP, we need to set an environment variable $GOOGLE_APPLICATION_CREDENTIALS to point to the service account JSON file.

export GOOGLE_APPLICATION_CREDENTIALS=/home/aditya/.gcp/<credentials-filename>.json

Next, authenticate GCP using the following command:

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Here's my example:

Installing Pyspark

Instructions to install spark on Linux are well documented on the course's GitHub repository.

You can follow the below 2 files in the mentioned order, to install Pyspark:

  1. linux.md

  2. pyspark.md

Tip: You can write all the environment variables mentioned in the above 2 files to .bashrc followed by the source .bashrc command to make the variables permanent. Otherwise they'll need to be setup everytime you restart your VM.

Cloning the course repo

Clone the course repo into the home directory. Your final folder structure should look something like this:

Open a Remote Connection from Visual Studio Code

Install the "Remote-SSH" extension.

Press F1 and select "Remote SSH: Connect to Host". It will show you the hosts configured on your system.

Select the one associated with your Virtual Machine.

Open the course repo folder and you're all set to start your development!


Conclusion

You can also refer to the course video for setting up the development environment here. Make sure to stop your VM once you're done with the setup or your development.

Thanks for reading!

Did you find this article valuable?

Support Aditya Gupta by becoming a sponsor. Any amount is appreciated!