Data Warehousing with BigQuery on GCP
A data warehouse is not used for processing transactions in realtime. Instead, it is a repository of historical data that is queried to answer questions and make business decisions.
Data is moved into warehouses in batches. Afterwards, analytics is carried out on the data in the warehouse. The data in a warehouse may be from different sources.
BigQuery is Google’s offering for data warehousing. It is designed to store and query petabytes of data without requiring you to setup and manage any operational infrastructure. It is, however, not a transactional database. It does not support high throughput. For that, you should look at BigTable, which supports high throughput but no edits. Another alternative is Cloud SQL, which is good for storing sub-terabyte data for realtime operations, or Spanner for large-scale data storage and retrieval.
BigQuery supports standard SQL, so you will be right at home coming from Oracle, IBM, or Microsoft ecosystems. There are a few inbuilt functions to support modern-day requirements, and learning about them will make your job easier.
There is no operational infrastructure to worry about. You don’t need to worry about the size of storage, or how much RAM is required to process your query, or the number of processors on your server. The system scales automatically to run your queries, and then shuts down when it is done. Google publishes sample databases so you can practice on them. These are huge datasets, and the first time you query them, be sure to drop me a note if you don’t have a smile on your face. You can find that dataset here. Be sure to be logged in to Google Cloud Platform.
One thing to note about BigQuery is that it helps to know how you will be querying your data ahead of time, so you can denormalize it before warehousing. It’s still okay if you go ahead and store the data in individual tables and run joins later on. You will be charged for the amount of data you process, but at this stage it should be a given that you only select the columns that you need during a query.
When working in Cloud Console, you can export your data as CVS or JSON. However, a REST API exists, as well as libraries for popular languages, so you can connect and run your queries programmatically.
If you make use of any enterprise-grade ETL tools, there is a good chance that they have BigQuery connectors.
BigQuery has the concept of datasets, rather than databases. These datasets belong to projects, which is how your billing is managed. Integrated access management is performed at the dataset level, and is one of Read, Write, and Owner. Tables and views belong to datasets.
In BigQuery, Tables are columnar structures. Each column is stored as a separate file, which is why specifying your columns in a query reduces your cost, because you are only charged for the data that is accessed and processed. As a result, you do not need indexes and keys.
You can validate queries. This will return the expected size that would get processed, thus letting you manage cost. You can also analyze your queries for performance bottlenecks.
I hope you take time out to take a look at BigQuery.