Dataproc is Managed Hadoop, Pig, Hive and Spark on GCP

Robert Thas John
2 min readJul 19, 2018

--

Creating Cloud Dataproc Clusters

If you have been managing your own Hadoop infrastructure, then it might be good news to you that you can migrate your setup to Google Cloud Platform. Dataproc is a cluster of machines with a master node and workers. It comes with automated cluster management and resizing.

Dataproc supports HDFS, and in addition it let’s you store your data on Google Cloud Storage. One advantage of this is that you can have your data on Storage, and only spin up your Dataproc clusters when you need them. When working this way, you can keep all your data in only one region, and connect to it from your cluster.

Let’s walk through creating a Dataproc instance and connecting to Cloud SQL. If you have not experience with Cloud SQL, please follow this link.

From the Navigation menu, click on Create cluster. Ensure that the zone is the same as that in which you configured your Cloud SQL. This will minimize network latency. You will need to specify the machine type for both your master and worker nodes.

Provisioning a cluster

Proceed to create your cluster when done. Please be patient at this step.

After waiting patiently, you will be able to submit jobs. You can specify the cluster, the job type, and the job file. You will be notified when your job completes.

Submit a job

Good luck using Cloud Dataproc.

--

--

Robert Thas John
Robert Thas John

No responses yet