Google’s BigQuery

When I have heard first time the word ‘BigQuery’, it looked me if I need to write a long big SQL queries more than 20 or 30 lines or how will it do some wonders as its name or something else…?

It caused a curiosity, I started to explore it and happy to share my understanding with you.

BigQuery is a Software as a Service data warehouse solution offered by Google under cloud services which is categorized within analytics vertical.

Before diving into BigQuery, there are 2 more words which are relevant to have a better understanding of BigQuery — Data Lake and Data Warehouse.

Data Lake:

Image Source: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

Data lake is a scalable and secure data platform that allows enterprises to ingest data, store, process (batch, streaming) and analyze (structured, semi-structured and unstructured data) of any type or any volume. Hence, it is a centralized repository that allows you to store all your structured and unstructured data at any scale.

Properties of data lakes:

Scalable, Supports 3Vs of BigData (Variety, Velocity and Volume of data)
Durable
Highly Available
Retain data in ‘AS Is ’its native / raw format

Data Warehouse:

Data Warehouse is needed when you have a defined business case. Data warehouse always contains organized and cleansed data. Data is only loaded when you have business use case and proper schema is defined for downstream application without any redundant information.

Properties of data warehouse:

Loaded only when its use is defined
Processed / organized / transformed / Cleansed steps are performed while loading the data.
Tends to have consistent schema shared across the downstream applications.
Provide faster insights like creation of charts, reports, dashboard for senior management.
Current / historical data for reporting

Consumers of Data Warehouse:

Reporting / Business Intelligence Analyst
Decision makers who rely on mass amount of data to make decisions.
Users who use customized, complex processes to obtain information from multiple data sources like finance, sales, promotion etc departments.
Reporting solution with fast performance on a huge amount of data for reports, grid, charts.
Machine Learning Engineers to discover ‘hidden patterns’ of data-flows , groupings and pulling out any other useful insights.
Downstream data warehouse teams

Factors to be considered while choosing a good data warehouse:

Assets: It requires expensive servers to compute and store data.
People: Skilled database administrators are needed to manage data integrity.
Cost: Interacting with big data is expensive, slow process.
Scale: How much data storage is needed and will storage needs change over time?
Resilient: How is data protected to ensure availability and durability?
Security: How is the data organized, catalogued and access controlled?

What makes “BigQuery” a viable data warehouse solution?

BigQuery is a fully-managed, server-less data warehouse that enables scalable, cost-effective and fast analysis over petabytes of data. It is a server-less Software as a Service that supports querying using ANSI SQL. It also has built-in machine learning capabilities.

Analyze petabytes of data using ANSI SQL at blazing-fast speeds, with zero operational overhead that is No-Ops service that is no infrastructure to manage and no need a database administrator to manage a complex infrastructure setup and software. Data is stored encrypted at rest (by google managed key or customer managed key) and controlled by secured governance process (IAM).

Key Features:

A fast Query Engine and fully managed data storage
Automatic resource allocation on demand (Storage & Compute)
Materialized Views, Server-less, Real-time analytics, Automatic high availability
Automatic backup & easy restore
Data transfer service
Integration with popular visualization tools
Federated BigQuery — Querying data from external sources like Cloud Storage, Google Sheets etc.
Public, Commercial datasets available for usage and learning
Support of Geospatial data types & functions
BigQuery Business Intelligence Engine
BigQuery Machine Learning for simple models in 3 steps
Flexible pricing model
Pay only for what you use
Flat-tier-pricing for dedicated use
No pay for automatic cached query results

How is Big Query accessed and integration with third party:

BigQuery web browser — Runs queries, import/exports data
bq command line tool — python-based tool that can access BigQuery from the command line
Service API
RESTful API to access BigQuery programmatically
Requires authorization by OAuth2
Google client libraries for Python, JavaScript etc.
Result Display: Integration Possible with Third party Tools
Visualization and Statistical Tools tools like Tableau, QlikView, R, Google Data Studio etc.
Export data in a .CSV file, JSON or to GC Storage

How does BigQuery Works?

BigQuery has 2 fully managed services that are serverless. The two services are connected by Google’s high-speed internal network (petabit network). It enables us to separate, compute and storage.

Fast SQL Query Engine: It uses
- Dremel is a query service that allows you to run SQL-like queries against very, very large data sets and get accurate results in mere seconds.
- BigQuery is the public implementation of Dremel.
- BigQuery provides the core set of features available in Dremel to third party developers.
- Borg is used for Cluster Management System.
Managed Storage for datasets
- The tables are stored as highly-compressed columns-”Native storage”.
- Each column of that table highly compressed and Google’s internal Colossus file system.
- CFS durable and global available a backend for Gmail, Google photos
- Storage service supports batch & streaming ingest.

BigQuery Organization:

Projects: Top-level containers in the Google Cloud Platform that store the data
Datasets: Project holds datasets and datasets hold zero or more tables of data.
Tables: Within datasets, tables are row-column structures that hold actual data.
Jobs: The tasks you are performing on the data, such as running queries, loading data, and exporting data.

There are many other features which make BigQuery lighting fast and a low cost data warehouse solution like Nested & Repeated data types, Columnar Storage, Partitioning & Clustering options for effective performance and cost optimization, simple learning curve because of using ANSI SQL and strong service API usage in multiple programming languages.

In 3 simple steps you can build a simple machine learning model on vast amount of train data and you don’t need to worry about data sampling activities’ etc which is a very labour intensive process.

I hope, this article will give a understanding of choosing right data warehouse solution for your organization. BigQuery has big number of clientele compared to its competitors.

Thanks for reading, please share views.

Please find its BigQuery Part-2

References:

https://cloud.google.com/bigquery/what-is-bigquery

https://cloud.google.com/bigquery

Search This Blog

Tech Learn

Google’s BigQuery

Comments

Post a Comment

Popular posts from this blog

Google's BigQuery - Part - 2