Google’s BigQuery

When I have heard first time the word ‘BigQuery’, it looked me if I need to write a long big SQL queries more than 20 or 30 lines or how will it do some wonders as its name or something else…?

It caused a curiosity, I started to explore it and happy to share my understanding with you.

BigQuery is a Software as a Service data warehouse solution offered by Google under cloud services which is categorized within analytics vertical.

Before diving into BigQuery, there are 2 more words which are relevant to have a better understanding of BigQuery — Data Lake and Data Warehouse.

Data Lake:

Image Source: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Image Source: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

Data lake is a scalable and secure data platform that allows enterprises to ingest data, store, process (batch, streaming) and analyze (structured, semi-structured and unstructured data) of any type or any volume. Hence, it is a centralized repository that allows you to store all your structured and unstructured data at any scale.

Properties of data lakes:

  • Scalable, Supports 3Vs of BigData (Variety, Velocity and Volume of data)
  • Durable
  • Highly Available
  • Retain data in ‘AS Is ’its native / raw format

Data Warehouse:


Data Warehouse is needed when you have a defined business case. Data warehouse always contains organized and cleansed data. Data is only loaded when you have business use case and proper schema is defined for downstream application without any redundant information.

Properties of data warehouse:

  • Loaded only when its use is defined
  • Processed / organized / transformed / Cleansed steps are performed while loading the data.
  • Tends to have consistent schema shared across the downstream applications.
  • Provide faster insights like creation of charts, reports, dashboard for senior management.
  • Current / historical data for reporting

Consumers of Data Warehouse:

  • Reporting / Business Intelligence Analyst
  • Decision makers who rely on mass amount of data to make decisions.
  • Users who use customized, complex processes to obtain information from multiple data sources like finance, sales, promotion etc departments.
  • Reporting solution with fast performance on a huge amount of data for reports, grid, charts.
  • Machine Learning Engineers to discover ‘hidden patterns’ of data-flows , groupings and pulling out any other useful insights.
  • Downstream data warehouse teams

Factors to be considered while choosing a good data warehouse:

  • Assets: It requires expensive servers to compute and store data.
  • People: Skilled database administrators are needed to manage data integrity.
  • Cost: Interacting with big data is expensive, slow process.
  • Scale: How much data storage is needed and will storage needs change over time?
  • Resilient: How is data protected to ensure availability and durability?
  • Security: How is the data organized, catalogued and access controlled?

What makes “BigQuery” a viable data warehouse solution?

BigQuery is a fully-managedserver-less data warehouse that enables scalablecost-effective and fast analysis over petabytes of data. It is a server-less Software as a Service that supports querying using ANSI SQL. It also has built-in machine learning capabilities.

Analyze petabytes of data using ANSI SQL at blazing-fast speeds, with zero operational overhead that is No-Ops service that is no infrastructure to manage and no need a database administrator to manage a complex infrastructure setup and software. Data is stored encrypted at rest (by google managed key or customer managed key) and controlled by secured governance process (IAM).

Key Features:

  • A fast Query Engine and fully managed data storage
  • Automatic resource allocation on demand (Storage & Compute)
  • Materialized Views, Server-less, Real-time analytics, Automatic high availability
  • Automatic backup & easy restore
  • Data transfer service
  • Integration with popular visualization tools
  • Federated BigQuery — Querying data from external sources like Cloud Storage, Google Sheets etc.
  • Public, Commercial datasets available for usage and learning
  • Support of Geospatial data types & functions
  • BigQuery Business Intelligence Engine
  • BigQuery Machine Learning for simple models in 3 steps
  • Flexible pricing model
    Pay only for what you use
    Flat-tier-pricing for dedicated use
    No pay for automatic cached query results

How is Big Query accessed and integration with third party:


  1. BigQuery web browser — Runs queries, import/exports data
  2. bq command line tool — python-based tool that can access BigQuery from the command line
  3. Service API
    RESTful API to access BigQuery programmatically
    Requires authorization by OAuth2
    Google client libraries for Python, JavaScript etc.
  4. Result Display: Integration Possible with Third party Tools
    Visualization and Statistical Tools tools like Tableau, QlikView, R, Google Data Studio etc.
  5. Export data in a .CSV file, JSON or to GC Storage

How does BigQuery Works?

BigQuery has 2 fully managed services that are serverless. The two services are connected by Google’s high-speed internal network (petabit network). It enables us to separate, compute and storage.

  1. Fast SQL Query Engine: It uses
    Dremel is a query service that allows you to run SQL-like queries against very, very large data sets and get accurate results in mere seconds.
    - BigQuery is the public implementation of Dremel.
    - BigQuery provides the core set of features available in Dremel to third party developers.
    - Borg is used for Cluster Management System.
  2. Managed Storage for datasets
    The tables are stored as highly-compressed columns-”Native storage”.
    - Each column of that table highly compressed and Google’s internal Colossus file system.
    - CFS durable and global available a backend for Gmail, Google photos
    - Storage service supports batch & streaming ingest.

BigQuery Organization:


  • Projects: Top-level containers in the Google Cloud Platform that store the data
  • Datasets: Project holds datasets and datasets hold zero or more tables of data.
  • Tables: Within datasets, tables are row-column structures that hold actual data.
  • Jobs: The tasks you are performing on the data, such as running queries, loading data, and exporting data.

There are many other features which make BigQuery lighting fast and a low cost data warehouse solution like Nested & Repeated data types, Columnar Storage, Partitioning & Clustering options for effective performance and cost optimization, simple learning curve because of using ANSI SQL and strong service API usage in multiple programming languages.

In 3 simple steps you can build a simple machine learning model on vast amount of train data and you don’t need to worry about data sampling activities’ etc which is a very labour intensive process.

I hope, this article will give a understanding of choosing right data warehouse solution for your organization. BigQuery has big number of clientele compared to its competitors.

Thanks for reading, please share views.

Please find its BigQuery Part-2

References:

https://cloud.google.com/bigquery/what-is-bigquery

https://cloud.google.com/bigquery



Comments

Popular posts from this blog

Google's BigQuery - Part - 2