Why Cloud Dataproc?
Dataproc is Google’s managed offering of Hadoop clusters. The article for setting up Dataproc is available here.
We live in an age where data has grown to the point where we deal in terabytes and petabytes (1024 terabytes, which would take you 27 years to download over a 4G network, or 1 day’s worth of uploads to YouTube) of data. Just think of the number of credit card transactions that get processed daily, or the number of taxi rides that happen.
At such a scale, you start to ask yourself how long it would take you to process data on one computer. There is a limit to the number of records one processor can handle. While you could get a computer with more cores, there is a limit to how many cores you can get, and consequently, how quickly your jobs can get processed.
Hadoop provides infrastructure for taking a large file, distributing it across a number of computers, processing the chunks of the file, and then combining them back together to provide a result. Those computers constitute a cluster.
It turns out that setting up and tuning a Hadoop cluster is a labour intensive and time consuming activity. A high-level overview of administering a Hadoop cluster includes the following:
- Resource provisioning
- Handling growing scale
- Reliability
- Deployment and configuration
- Improvements to utilization
- Performance tuning
- Monitoring
Setting up a Hadoop deployment involves the following activity:
- Obtaining your servers
- Configuring your servers
- Installing open source software
- Configuring the open source software for a particular task
- Optimizing the open source software for a particular task
- Processing data
It turns out that your can’t use one Hadoop cluster setup for every job. You need to configure your setup for each job. Also, you can’t just scale a Hadoop cluster impulsively.
There are two ways of scaling:
- Vertically, by purchasing bigger hardware
- Horizontally, by adding a server to the cluster
Scaling vertically has it’s own challenges, which includes what you will do with the old server after replacing it.
Scaling horizontally requires a reconfiguration of the open source software, as well as re-distribution of the data that needs to be processed. This task is called sharding.
There is also the problem of what to do with your cluster when it is not in use. The truth is nobody utilizes a server 100% all of the time. Eventually, your job completes, and the server lies idle.
Cloud Dataproc let’s you offload all of the work to Google Cloud Platform. This let’s you focus on the task of loading your jobs. A Dataproc deployment is a simple case of cluster creation and configuration. When your job completes, you can simply terminate the cluster, thus freeing up resources for someone else to make use of.
A Dataproc cluster is made up of a master and slaves. You can specify the geographical location of the cluster, which should be in the same region where your files are stored. You can specify the number of processors, memory and storage and the master as well as the slaves.
In conclusion, if you make use of Hadoop clusters, you should give Cloud Dataproc a good look because of the potential to save yourself time and financial resources.