Data Platform in google cloud (big query and pub/sub)

MallikaGaur4 0 views 6 slides Oct 05, 2025
Slide 1
Slide 1 of 6
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6

About This Presentation

Big query


Slide Content

Designing and Governing Hyper-Scale Data Platforms on Google Cloud
I. Strategic Overview and Reference Architecture
Modern processing of large datasets requires a cohesive architecture that leverages
scalable, managed services to handle ingestion, transformation, and analysis. Google Cloud
Platform (GCP) achieves this through the tight integration of four fundamental services:
Pub/Sub, Dataflow, Dataproc, and BigQuery. These components form the backbone of
hyperscale data platforms, enabling organizations to move beyond traditional batch latency
cycles and adopt sophisticated real-time processing capabilities.
A. Defining the Roles of the Core GCP Data Services
The success of a cloud-native data architecture hinges on defining clear responsibilities for
each component:
* Pub/Sub (The Ingestion Layer): Pub/Sub functions as the highly scalable, secure, and
"no-ops" messaging or queue system. It serves as the asynchronous distribution backbone
for events, decoupling data producers from consumers. Crucial for building reactive systems,
Pub/Sub provides at-least-once message delivery guarantees and offers always-on
encryption and fine-grained access controls suitable for ingesting sensitive data.
* Dataflow (The Unified Transformation Engine): Dataflow is a fully managed, serverless
execution service that simplifies pipeline development. It uses the Apache Beam SDK,
allowing it to apply complex data transformation and enrichment logic seamlessly across
both stream (real-time) and batch modes. This unified model significantly reduces
operational complexity compared to maintaining separate codebases for batch and
streaming pipelines.
* Dataproc (The Managed Ecosystem Engine): Dataproc provides a dedicated solution for
managing Apache Spark and Hadoop workloads in a controlled environment. It is primarily
utilized when organizations need to migrate existing big data investments to the cloud with
minimal re-architecting, offering direct control over cluster configuration, processing
frameworks, and specific libraries.
* BigQuery (The Analytical Data Warehouse): BigQuery acts as the serverless,
petabyte-scale column-store destination for analysis. Serving as the primary data lakehouse
or data mart, it provides not only storage and query capabilities but also built-in governance
features for managing, monitoring, and securely sharing data and AI assets. B. The Unified Lambda/Kappa Architecture on GCP
The design of modern GCP data architectures favors agility, often prioritizing the principles
of the Kappa architecture where streaming is the primary integration pattern. This model
typically deploys a low-latency path using Pub/Sub for ingestion, Dataflow for stream
processing, and BigQuery for final analysis. The inherent security and no-ops scalability
offered by Pub/Sub, combined with Dataflow's ability to pipe data directly into BigQuery via
managed templates , significantly streamlines the construction of real-time data marts.
The platform's unified processing approach, built around Dataflow's Apache Beam model,
allows organizations to employ a hybrid strategy resembling both Lambda and Kappa
architectures. The same foundational code can be used for the streaming layer (real-time
data ingestion) and the batch layer (historical backfills or complex overnight aggregations),
simplifying system maintenance and operations. This elastic structure is founded on the
principle of decoupling, whereby services like BigQuery and Dataproc separate persistent
storage (Cloud Storage or BigQuery Storage) from compute resources (slots or VM
clusters). This decoupling is fundamental to GCP’s architecture, ensuring that the scaling of
compute components (Dataflow workers, BigQuery slots) remains independent of the

massive persistence layer, thereby eliminating bottlenecks common in legacy monolithic
systems.
II. Real-Time Data Ingestion and Stream Analytics
Processing unbounded data streams requires specialized capabilities to handle high volume,
low latency, and temporal challenges. Dataflow provides the robust environment necessary
for these advanced streaming pipelines.
A. Pub/Sub and Dataflow Streaming Pipelines
Pub/Sub acts as the robust entry point for analytical events, providing the necessary security
controls, including fine-grained access policies and continuous encryption, essential for
sensitive data. Dataflow consumes messages from a Pub/Sub subscription and facilitates
the streaming Extract, Load, and Transform (ELT) pattern. A foundational use case involves
Google-provided streaming templates that read messages (often JSON-encoded) from
Pub/Sub and write them directly into a BigQuery table.
Although the modern approach favors transformation in BigQuery (ELT), Dataflow enables
lightweight transformations and enrichment functions before the data is loaded. These
User-Defined Functions (UDFs) allow for cleansing or schema normalization on the ingested
message data, supporting critical ETL steps during the streaming load. To achieve optimal
throughput, Dataflow leverages the BigQuery Storage Write API and automatically handles
schema conversions between Apache Beam Row and BigQuery TableRow types. Before
committing to deploying a full Dataflow pipeline, however, architectural prudence requires
evaluating if a simple Pub/Sub BigQuery subscription with UDFs can satisfy requirements. If
simple parsing or filtering is sufficient, opting for the native Pub/Sub features is a superior
cost governance mechanism, as Dataflow is a continuous, resource-consuming service.
B. Managing Temporal Complexity: Windowing Functions
Unbounded data streams must be segmented into finite, logical components for meaningful
aggregation. This is achieved through windowing functions, which group data based on the
timestamps of individual elements.
Dataflow supports three primary window types:
* Tumbling (Fixed) Windows: Used for simple, non-overlapping aggregation periods, such
as calculating metrics every hour.
* Hopping (Sliding) Windows: Employed for rolling calculations that require overlapping time
intervals, allowing continuous metric updates.
* Session Windows: Crucial for behavioral analytics, these windows group data based on
activity specific to a data key (e.g., a user ID). The window boundary is determined by a
minimum gap duration of idle time, allowing systems to aggregate meaningful clusters of
related events, such as a burst of user mouse clicks separated by long periods of inactivity. C. Precision and Latency: Watermarks, Triggers, and State Management
In real-world systems, data arrival is not guaranteed to be in time order or at predictable
intervals. To manage this uncertainty, Dataflow utilizes Watermarks—a critical system
component that estimates when all data for a specific time window is expected to have
arrived. If the watermark progresses past the end of a window, any subsequent data arriving with a
timestamp within that window is considered late data. Handling late data requires careful
configuration to balance processing latency and data completeness. Dataflow employs
Triggers to emit preliminary, partial results for a window early, thereby improving end-to-end
latency. These results are then refined and updated if late data arrives before the configured
lateness threshold expires. This mechanism requires robust State Management within
Dataflow, where the window state must be retained until the maximum allowed lateness

period is reached. The engineering configuration of these watermarks, lateness thresholds,
and state retention policies represents a direct trade-off between latency and accuracy,
which maps explicitly to the organization’s Service Level Objectives (SLOs). A strict SLO
requiring low latency may mandate earlier triggering, while a higher tolerance for late data
arrival improves completeness but necessitates longer state retention and, consequently,
increased resource consumption.
III. Data Integration Methodologies and Processing Engine Selection
The choice between processing engines and integration methodologies is a strategic
decision that governs platform maintainability, financial efficiency, and architectural flexibility.
A. Strategic Choice: When to Use ETL versus Modern ELT
Historically, Extract, Transform, Load (ETL) dominated data warehousing, performing
transformations outside the warehouse before loading. However, the modern cloud
environment, specifically utilizing BigQuery, heavily favors the Extract, Load, Transform
(ELT) pattern. In the ELT model, raw data is loaded directly into BigQuery, and subsequent
transformations are performed in situ using BigQuery's massive, scalable SQL compute
capability. This approach empowers data analysts by allowing them to develop integration
pipelines using standard SQL, accelerating time-to-value.
While pure ELT is the goal for complex analysis, most contemporary architectures employ a
hybrid approach. Initial transformations (E \rightarrow T \rightarrow L) required for cleansing,
normalization, or lightweight enrichment are often performed at the ingestion layer using
Dataflow for streaming or staging processes. B. Processing Engine Deep Dive: Dataflow versus Dataproc
For new, scalable, cloud-native data pipelines, Dataflow should be the default choice. Its
serverless nature eliminates infrastructure management overhead, meaning engineering
teams do not need to configure or manage underlying clusters. Dataflow provides dynamic
autoscaling and automated work rebalancing, making it highly efficient for fluctuating
workloads.
In contrast, Dataproc is selected for specific compatibility needs. It is ideal for teams
migrating existing investments in Apache Spark, Hadoop, or other open-source big data
technologies, as it minimizes re-architecting. Dataproc provides direct control over the
cluster environment, which is necessary for complex batch processing requiring specific
custom configurations or libraries.
For ELT workloads, Dataproc can interact directly with the analytical layer. Dataproc uses
the spark-bigquery-connector (e.g., gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar for
Dataproc image 1.5+) to read data from BigQuery tables directly into a Spark DataFrame.
This allows Spark applications to perform heavy aggregations, joins, or preparation for
machine learning models using the established Spark ecosystem tools before writing results
back to BigQuery or Cloud Storage.
C. Decision Framework: Choosing the Right Engine
The choice between Dataflow and Dataproc balances operational overhead against
flexibility. The decision matrix below summarizes the selection criteria.
Dataflow vs. Dataproc for Big Data Processing
| Decision Factor | Cloud Dataflow (Apache Beam) | Cloud Dataproc (Spark/Hadoop) |
|---|---|---|
| Underlying Model | Fully Managed, Serverless | Managed Cluster (VMs) |
| Best Use Case | Unified Batch/Streaming, low-latency stream analytics, dynamic scalability
| Legacy Migration, existing Spark/Hadoop investment, complex batch processing requiring
specific libraries |

| Infrastructure Overhead | None (Eliminates cluster management) | Requires cluster
configuration and management (control over configs) |
| Cost Optimization Strategy | Pay-as-you-go, Flexible Resource Scheduling (FlexRS) for
batch, Autoscaling | Preemptible VMs, Custom configurations, Instance rightsizing |
| Expertise Required | Apache Beam SDK | Spark, Hadoop, Cluster configuration
management |
| Default Architectural Choice | Yes, for new cloud-native pipelines | Used for specific
compatibility needs |
The strategic introduction of Dataflow’s Flexible Resource Scheduling (FlexRS) has
positioned it as the cost-optimized solution for most non-time-sensitive batch processing.
FlexRS reduces batch costs by finding the best time to start the job and combining
preemptible VM instances with standard VMs, ensuring pipeline progress while leveraging
the lower cost of preemptible resources. This capability directly mitigates Dataproc’s
historical cost advantage in batch processing, further establishing Dataflow as the preferred
low-overhead option. For complex enterprise environments, these tools are often utilized
together, with orchestration services like Cloud Composer managing the sequencing,
triggering Dataflow for real-time tasks and Dataproc for specialized batch or machine
learning preparation jobs.
IV. BigQuery: The Analytical Layer and Performance Engineering
As the analytical repository, BigQuery's operational cost model necessitates rigorous
optimization of both physical data structure and querying discipline. Since BigQuery costs
are determined primarily by the volume of data scanned, not the number of rows returned ,
physical data modeling becomes a critical financial governance mechanism. A. Cost-Efficient Data Storage and Querying
The primary techniques for optimizing BigQuery data access involve physical data modeling:
* Partitioning: This technique divides a table into distinct segments, typically based on date
or ingestion time. When a query filters on the partitioning column, BigQuery can prune
irrelevant partitions, drastically limiting the total data scanned and reducing costs.
* Clustering: Clustering refines the organization within partitions. By designating up to four
clustering columns (often high-cardinality identifiers like user or event IDs), BigQuery
efficiently micro-segments data blocks. When queries filter on these clustered columns,
BigQuery only scans the necessary blocks, which is the single most important technique for
optimizing how data is read and reducing costs for filtered queries.
Equally important is disciplined querying. Data minimization is a mandatory best practice:
queries should explicitly select only required columns. The practice of using SELECT * is
discouraged because it forces the query engine to scan all columns in the table, incurring
maximum cost, even if a LIMIT clause is applied. Engineers must always preview the query
processing estimate before execution to anticipate and avoid accidental high-cost
operations.
BigQuery Physical Design and Cost Optimization Checklist
| Optimization Technique | Action | Primary Benefit |
|---|---|---|
| Data Minimization | Avoid SELECT *. Select only required columns. | Reduces data
scanned; direct cost reduction. |
| Table Partitioning | Partition tables by ingestion date or time column. | Limits data scanned
by pruning irrelevant partitions. |
| Table Clustering | Cluster by relevant high-cardinality keys (up to 4 fields). | Reduces data
blocks scanned within partitions, improving retrieval efficiency. |

| Capacity Planning | Monitor slot usage; transition to Flex Slots or Reservations. | Provides
cost predictability and potentially lower unit costs. |
| Query Preview | Always check the query processor estimate before running large queries. |
Prevents accidental high-cost queries. |
B. Capacity Planning and Billing Models
For organizations with high or stabilizing BigQuery consumption, active capacity
management is required. Monitoring slot consumption using the BigQuery slot estimator is
essential to track on-demand usage. Once average usage consistently exceeds a specified
threshold (e.g., 100 slots), migrating to the Reservation pricing model provides predictable,
fixed capacity and improves budget adherence. Flex Slots offer a short-term, 1-minute
commitment option, providing flexibility while utilizing the reservation capacity model for
transitionary periods. Furthermore, mandatory administrative cost controls, including setting
up BigQuery budgets, alerts, and quota limits, must be implemented to manage spending
proactively.
BigQuery's robust sharing capabilities also position it as a secure data marketplace. It allows
secure, live, zero-copy sharing of diverse data and AI assets, including BigQuery datasets,
tables, SQL stored procedures, ML models, and even real-time streams via Pub/Sub topics.
This capability fundamentally enhances collaboration and reduces the need for duplicative
ETL processes and redundant storage across different organizational units.
V. Operational Excellence and Data Governance Framework
Data governance is achieved not by separate tools, but by intrinsic, integrated capabilities
within the GCP ecosystem, coordinated primarily through Dataplex. Governance
encompasses automated data lineage, consistent quality validation, and rigorous Identity
and Access Management (IAM) controls. A. Establishing End-to-End Data Lineage via Dataplex Universal Catalog
Data lineage, tracked via the Dataplex Universal Catalog, records the relationships between
data assets and the processes (jobs) that created or transformed them. Tracking is
automatic once the Data Lineage API is enabled on a per-project basis, and it extends
coverage to Dataflow, Dataproc, and BigQuery jobs. For BigQuery, this automated tracking includes critical operations such as copy jobs, load
jobs using Cloud Storage URIs, and various Data Definition Language (DDL) and Data
Manipulation Language (DML) statements. Specifically, lineage is recorded for DDL
commands like CREATE TABLE AS SELECT and DML statements such as INSERT
SELECT, MERGE, UPDATE, and DELETE. BigQuery jobs are represented as "processes" in
the lineage graph, with each process containing the BigQuery job_id for traceability. This
integration confirms that governance and compliance are intrinsic to data processing
execution; since assets are auditable through specific SQL commands, impact analysis and
regulatory tracking are significantly simplified. Lineage information is viewable as an
interactive graph or a list, supporting both table-level and column-level traceability for
BigQuery processes.
B. Security and Access Control for Traceability
Maintaining end-to-end data lineage across enterprise platforms, which often span multiple
projects, depends critically on synchronized IAM policy application. To view lineage, the
roles/datalineage.viewer role must be granted on both the projects where the lineage
information was recorded and the projects where the user is viewing the data. Further granular permissions are necessary for full contextual understanding: the
roles/dataplex.catalogViewer grants permission to view metadata, and permissions like
bigquery.jobs.get are needed to retrieve the underlying SQL query or job definition of the

process that generated the lineage. This dependence on cross-project IAM synchronization
mandates that architectural rigor must extend to multi-project policy management to ensure
comprehensive data stewardship and security compliance across the processing lifecycle.
C. Ensuring Data Quality and Integrity
Dataplex Auto Data Quality provides the mechanism to define, measure, and automate the
validation of data integrity within BigQuery tables. Organizations can utilize predefined rules,
rules generated via data profiling recommendations, or custom SQL rules. This enables
quality management to be treated as code, allowing data expectations to be validated
against rules in production pipelines, with integrated monitoring and alerting for failures.
D. Pipeline Resource and Cost Management Strategies
Operational discipline in managing compute resources is essential for cost control. For
Dataflow, optimization begins with defining Service Level Objectives (SLOs) for throughput
and latency. Engineers use monitoring tools, including the Dataflow job interface, Metrics
Explorer, and Cloud Profiler (for CPU-intensive stages), to identify performance bottlenecks.
Cost-efficiency is achieved through customizing worker resources via "right fitting" and
strategically using FlexRS for batch workloads. For Dataproc, security and operations rely on
isolating clusters within a Virtual Private Cloud (VPC), enforcing IAM role-based access
control, and defining initialization actions for environment tuning.
VI. Conclusion and Strategic Recommendations
The GCP suite of services—Pub/Sub, Dataflow, Dataproc, and BigQuery—pro
Tags