Is not a secret that Data Science tools like Jupyter, Apache Zeppelin or the more recently launched Cloud Data Lab and Jupyter Lab are a must be known for the day by day work so How could be combined the power of easily developing models and the capacity of computation of a Big Data Cluster? Well in this article I will share very simple step to start using Jupyter notebooks for PySpark in a Data Proc Cluster in GCP.
![15z8CZNVMvYh4VPiUuI8prg](https://www.datasciencecentral.com/wp-content/uploads/2021/10/15z8CZNVMvYh4VPiUuI8prg.jpg)
Final goal
![Image result for jupyter spark](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1AN87UxKNCm970Y2PsGYgpg.png)
Prerequisites
1. Have a Google Cloud account (Just log in with your Gmail and automatically get $300 of credit for one year) [1]
2. Create a new project with your favorite name
![1GvOJhKTOJL2codcrTbfkXQ](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1GvOJhKTOJL2codcrTbfkXQ.png)
Steps
- In order to make easier the deployment, I’m going to use a beta featurethat only can be applied when creating a Data Proc Cluster through Google Cloud Shell. For our cluster, we need to define many features like numbers of workers, master´s high availability, amount of RAM an Hard Drive, etc. To make easy I recommend simulating the creation of the cluster by the UI. First we need to enable Dataproc (figures 1 and 2).
![1E7O6NsguGWDjBqsee0Xflg](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1E7O6NsguGWDjBqsee0Xflg.jpg)
Figure 1 Enable Dataproc API I
![1z3uZ3pqD2kYZ2k61dpomLg](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1z3uZ3pqD2kYZ2k61dpomLg.jpg)
Figure 2 Enable Dataproc API II
2. Get the equivalent command line simulating the creation process with your own cluster´ size. I’m going to set basic specs:
- Region: global
- Cluster mode: Standard
- Master node: 2 vCPUs, 7.5GB memory, and 300 disk size
- Workers nodes: 2vCPUs, 7.5GB memory, and 200 disk size
![1IyXhskeyHl2RyJDjd-WF0Q](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1IyXhskeyHl2RyJDjd-WF0Q.jpg)
Simulate creating a cluster through UI
![1kOUcMIleCAO4RZIbCCiQBQ](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1kOUcMIleCAO4RZIbCCiQBQ.png)
Basic specs
Important: You should click Advance options and change Image to 1.3 Debian 9 to make beta parameters works.
![1QW1bG2Dzmm9gH1HD6W5eBw](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1QW1bG2Dzmm9gH1HD6W5eBw.jpg)
To access click Advance options
![1CYjXQt4wX0aodUmkpunJ3Q](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1CYjXQt4wX0aodUmkpunJ3Q.jpg)
Change to 1.3 Debian 9
3. Get equivalent command line
![1mrfvwgaePdYCU1J1WwTrwA](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1mrfvwgaePdYCU1J1WwTrwA.png)
Click in command line
![1sGdxX9ufj16cuCQus8m2RQ](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1sGdxX9ufj16cuCQus8m2RQ.png)
Copy the gcloud command
4. Close the simulation and click to Activate Cloud Shell
Activate Cloud Shell
5. Modify your command adding and run (could take several minutes)
— optional-components=ANACONDA,JUPYTER
Change
gcloud dataproc clusters to gcloud beta dataproc clusters
Run
gcloud beta dataproc clusters create cluster-jupyter — subnet default — zone europe-west1-d — master-machine-type n1-standard-2 — master-boot-disk-size 300 — num-workers 2 — worker-machine-type n1-standard-2 — worker-boot-disk-size 200 — optional-components=ANACONDA,JUPYTER — image-version 1.3-deb9 — project jupyter-cluster-223203
![15JOaOqROIWior2Vugt91Tg](https://www.datasciencecentral.com/wp-content/uploads/2021/10/15JOaOqROIWior2Vugt91Tg.jpg)
running in shell
![1JQ9-g7bISKKEzv5uMGmkFQ](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1JQ9-g7bISKKEzv5uMGmkFQ.jpg)
cluster created
6. Allow incoming traffic for Jupyter port, search for the firewall rules in the landing page and create a rule.
![1FieyzH2edsTB7zzwNVghMw](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1FieyzH2edsTB7zzwNVghMw.jpg)
search Firewall rules VPC network
![1tdHu9LBjOzij-gobNIxRKA](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1tdHu9LBjOzij-gobNIxRKA.png)
click on create a rule
7. Define the Firewall rule opening port 8123 and save.
![1DJmZqrMbX1LOjIKyUp_inA](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1DJmZqrMbX1LOjIKyUp_inA.png)
parameters
![1P-DE75NkjPpXaT4dzIicXA](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1P-DE75NkjPpXaT4dzIicXA.png)
Rule working
8. Enter your Jupyter notebook! (you need your master IP and add the jupyter default port e.g. http://30.195.xxx.xx:8123 )
![1dr4e7BD78v_zBYNmP61lzw](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1dr4e7BD78v_zBYNmP61lzw.jpg)
get master´s IP
9. Let´s create our first Pyspark notebook
![1lQ0Pi2vAPV-wumkMWtNa7A](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1lQ0Pi2vAPV-wumkMWtNa7A.jpg)
create the first Pyspark notebook
10. Validate that is running well
![15z8CZNVMvYh4VPiUuI8prg](https://www.datasciencecentral.com/wp-content/uploads/2021/10/15z8CZNVMvYh4VPiUuI8prg.jpg)
Bonus: Check Spark UI
- To access Spark UI you need to add another Firewall rule like the step 7. Open ports 8088, 4040, 9870 and 4041.
![12nZrpsG839qeOTmKHGDZrQ](https://www.datasciencecentral.com/wp-content/uploads/2021/10/12nZrpsG839qeOTmKHGDZrQ.png)
Create Spark UI rule
- Click on Spark UI link got in our first notebook, you will get an ERR_NAME_NOT_RESOLVED error, just replace the URL to the master IP
e.g. http://3x.xxx.xx.x:8088/proxy/application_1542773664669_0001
![19d5brAxPsskbJjSfOhXfUw](https://www.datasciencecentral.com/wp-content/uploads/2021/10/19d5brAxPsskbJjSfOhXfUw.jpg)
Spark UI
Conclusion
In this article, I tried to deploy Jupyter in a Data Proc Cluster making more friendly to use PySpark in a real cluster. Please feel free if you have questions or suggestions for next articles.
See you in the next article! Happy Learning!