Is not a secret that Data Science tools like Jupyter, Apache Zeppelin or the more recently launched Cloud Data Lab and Jupyter Lab are a must be known for the day by day work so How could be combined the power of easily developing models and the capacity of computation of a Big Data Cluster? Well in this article I will share very simple step to start using Jupyter notebooks for PySpark in a Data Proc Cluster in GCP.
data:image/s3,"s3://crabby-images/0fccd/0fccd24e0ef1444d4629ef273fe2bc777b1bb2a8" alt="15z8CZNVMvYh4VPiUuI8prg"
Final goal
data:image/s3,"s3://crabby-images/1da8a/1da8aaac2b39215f20246d2805d32303f6fc1073" alt="Image result for jupyter spark"
Prerequisites
1. Have a Google Cloud account (Just log in with your Gmail and automatically get $300 of credit for one year) [1]
2. Create a new project with your favorite name
data:image/s3,"s3://crabby-images/4a88d/4a88dc610f016e9fec9e16007f54500e0f879b26" alt="1GvOJhKTOJL2codcrTbfkXQ"
Steps
- In order to make easier the deployment, I’m going to use a beta featurethat only can be applied when creating a Data Proc Cluster through Google Cloud Shell. For our cluster, we need to define many features like numbers of workers, master´s high availability, amount of RAM an Hard Drive, etc. To make easy I recommend simulating the creation of the cluster by the UI. First we need to enable Dataproc (figures 1 and 2).
data:image/s3,"s3://crabby-images/aa425/aa425565326462e638f941e1674c005f48236a73" alt="1E7O6NsguGWDjBqsee0Xflg"
Figure 1 Enable Dataproc API I
data:image/s3,"s3://crabby-images/5a61b/5a61bd7faa954ec7c72233e752dccdc16138dbaa" alt="1z3uZ3pqD2kYZ2k61dpomLg"
Figure 2 Enable Dataproc API II
2. Get the equivalent command line simulating the creation process with your own cluster´ size. I’m going to set basic specs:
- Region: global
- Cluster mode: Standard
- Master node: 2 vCPUs, 7.5GB memory, and 300 disk size
- Workers nodes: 2vCPUs, 7.5GB memory, and 200 disk size
data:image/s3,"s3://crabby-images/ef873/ef873ef68a905601e4f977d3a8aa33aef17eaec9" alt="1IyXhskeyHl2RyJDjd-WF0Q"
Simulate creating a cluster through UI
data:image/s3,"s3://crabby-images/8e6ee/8e6eecf08bdbbb5cd10b3624d9d71049f7421715" alt="1kOUcMIleCAO4RZIbCCiQBQ"
Basic specs
Important: You should click Advance options and change Image to 1.3 Debian 9 to make beta parameters works.
data:image/s3,"s3://crabby-images/a2fc4/a2fc4b38ea3e02c814cebe26acc58b615fb305b1" alt="1QW1bG2Dzmm9gH1HD6W5eBw"
To access click Advance options
data:image/s3,"s3://crabby-images/d2cc4/d2cc425e5808a95061e1e787a0a73398f5e06e4d" alt="1CYjXQt4wX0aodUmkpunJ3Q"
Change to 1.3 Debian 9
3. Get equivalent command line
data:image/s3,"s3://crabby-images/b57ae/b57ae2b6d1436fa9ae34ab31238b5ab74fe63acb" alt="1mrfvwgaePdYCU1J1WwTrwA"
Click in command line
data:image/s3,"s3://crabby-images/4c69d/4c69de3e6a909e7d3f50d3b83062d6720fb67074" alt="1sGdxX9ufj16cuCQus8m2RQ"
Copy the gcloud command
4. Close the simulation and click to Activate Cloud Shell
Activate Cloud Shell
5. Modify your command adding and run (could take several minutes)
— optional-components=ANACONDA,JUPYTER
Change
gcloud dataproc clusters to gcloud beta dataproc clusters
Run
gcloud beta dataproc clusters create cluster-jupyter — subnet default — zone europe-west1-d — master-machine-type n1-standard-2 — master-boot-disk-size 300 — num-workers 2 — worker-machine-type n1-standard-2 — worker-boot-disk-size 200 — optional-components=ANACONDA,JUPYTER — image-version 1.3-deb9 — project jupyter-cluster-223203
data:image/s3,"s3://crabby-images/e64bb/e64bb988d0c7a8d33c05794fc495d95b75feb2ae" alt="15JOaOqROIWior2Vugt91Tg"
running in shell
data:image/s3,"s3://crabby-images/f9eb3/f9eb395f26280673aacf37893d93db3ecfe41acb" alt="1JQ9-g7bISKKEzv5uMGmkFQ"
cluster created
6. Allow incoming traffic for Jupyter port, search for the firewall rules in the landing page and create a rule.
data:image/s3,"s3://crabby-images/67456/67456230972d6bb4a03b7394b62a91a4ce3d626e" alt="1FieyzH2edsTB7zzwNVghMw"
search Firewall rules VPC network
data:image/s3,"s3://crabby-images/c05a2/c05a290cf281b97266d2670863bc73a12ecebb1b" alt="1tdHu9LBjOzij-gobNIxRKA"
click on create a rule
7. Define the Firewall rule opening port 8123 and save.
data:image/s3,"s3://crabby-images/52a15/52a15685005b69cf08a1e0fda1f6520da040716e" alt="1DJmZqrMbX1LOjIKyUp_inA"
parameters
data:image/s3,"s3://crabby-images/63788/63788b37b7f657c211061fe261df1da83e4cc205" alt="1P-DE75NkjPpXaT4dzIicXA"
Rule working
8. Enter your Jupyter notebook! (you need your master IP and add the jupyter default port e.g. http://30.195.xxx.xx:8123 )
data:image/s3,"s3://crabby-images/60ca6/60ca60c0e7cdd97eb9829342c1d009c06c3a612d" alt="1dr4e7BD78v_zBYNmP61lzw"
get master´s IP
9. Let´s create our first Pyspark notebook
data:image/s3,"s3://crabby-images/4f624/4f6242ce468b94767e8785ae63fe2c9924370496" alt="1lQ0Pi2vAPV-wumkMWtNa7A"
create the first Pyspark notebook
10. Validate that is running well
data:image/s3,"s3://crabby-images/0fccd/0fccd24e0ef1444d4629ef273fe2bc777b1bb2a8" alt="15z8CZNVMvYh4VPiUuI8prg"
Bonus: Check Spark UI
- To access Spark UI you need to add another Firewall rule like the step 7. Open ports 8088, 4040, 9870 and 4041.
data:image/s3,"s3://crabby-images/561d5/561d5129cdaef415474abb411731c554b173698f" alt="12nZrpsG839qeOTmKHGDZrQ"
Create Spark UI rule
- Click on Spark UI link got in our first notebook, you will get an ERR_NAME_NOT_RESOLVED error, just replace the URL to the master IP
e.g. http://3x.xxx.xx.x:8088/proxy/application_1542773664669_0001
data:image/s3,"s3://crabby-images/f33f8/f33f834bba90991185e240191a94107be1012de5" alt="19d5brAxPsskbJjSfOhXfUw"
Spark UI
Conclusion
In this article, I tried to deploy Jupyter in a Data Proc Cluster making more friendly to use PySpark in a real cluster. Please feel free if you have questions or suggestions for next articles.
See you in the next article! Happy Learning!