Google BigQuery Radically Innovative Cloud Data Warehouse June 2021
Table of contents BigQuery overview BigQuery d ifferentiators TCO advantage BigQuery roadmap BigQuery myths Snowflake limitations Case studies Retail Financial services Others
Create a radically simple to use intelligent data platform that provides actionable insights in real-time , to drive digital transformations across enterprises . Smart Analytics vision
Serverless big data analytics Performance tuning Monitoring Reliability Deployment & configuration Utilization improvements Analysis and insights Resource provisioning Handling growing scale Serverless data analytics Traditional data warehouses Analysis and insights
Looker Dataflow (Streaming) Partner BI Tools IoT Core Google Cloud’s Smart Analytics Platform Collect, process, store, analyze, and visualize data & insights Store Databases Cloud Storage Analyze BigQuery (SQL) Understand Process Data Catalog (metadata management), Composer (workflow orchestration) Dataprep (Wrangling) Data Fusion (Data Integration) Dataproc (Hadoop/Spark) Collect Pub/Sub (Messaging) Data Transfer Service Migration Service BigQuery Dataproc (Spark) Smart Analytics as a Service : Fully Managed. Serverless. Enterprise class. Globally Distributed. Secure Streaming Batch
Top enterprises are experiencing data driven business transformations Retail Healthcare & life sciences Financial services Media & entertainment Gaming Energy & manufacturing Auto & transportation AMER EMEA JAPAC
Google BigQuery Data warehouse with customers ranging from TB to 100+ PB Cloud-scale enterprise data warehouse Standard SQL(ANSI 2011) with DML Support Encrypted, durable, highly available Unique S erverless platform Real-time insights Unique Built-in ML Insights for everyone Unique Unique
A reliable data warehouse that protects data so our customers can operate with trust BigQuery third party audits and certifications provide compliance assurance Reliable with 99.9% uptime SLA * Maximum data durability with data replication across multiple data centers Data governance and security with data access controls and regulatory compliance Built-in data protection with encryption, VPC service controls, and data replication USA HIPAA FedRAMP Germany BSI C5 Singapore MTCS Tier 3 Global ISO 27001 ISO 27017 ISO 27018 SOC 3 SOC 2 SOC 1 PCI DSS CSA STAR * Google internal data, August 2020
2018 2019 Q4 2019 2020 Hong Kong Los Angeles 1 2 3 4 Tokyo London Sydney Singapore Taiwan Finland N. Virginia Mumbai Montréal Seoul Netherlands Belgium Oregon Asia Multi-region BigQuery is in many places Frankfurt São Paulo Zurich South Carolina https://cloud.google.com/bigquery/docs/locations
USA HIPAA HiTrust FedRAMP FIPS 140-2 COPPA FERPA NIST 800-53 NIST 800-171 Sarbanes- Oxley Canada Personal Information & Electronic Documents Act Americas Argentina Personal Data Protection Law Third-party audits and certifications Global ISO 27001 ISO 27017 ISO 27018 SOC 1 SOC 2 SOC 3 PCI DSS CSA STAR MPAA Independent Security Evaluators Audit Europe GDPR EU Model Contract Clauses Privacy Shield Germany BSI C5 South Africa POPI UK NCSC Cloud Security Principles Spain Esquema Nacional de Seguridad Europe, Middle East, & Africa Australia Australian Privacy Principles Australian Prudential Regulatory Authority Standards IRAP Assessed Japan FISC My Number Act Singapore MTCS Tier 3 Asia Pacific Icons made by Freepik from www.flaticon.com
Smart Analytics— highly differentiated capabilities Secure data sharing Geospatial analytics Built-in intelligence Seamless real-time analytics Performance at scale Data lake interoperability 1 3 4 5 6 Democratize Insights (Sheets) 7 2 NDA
Decoupled Compute, Storage & State Reduce cost and seamless scale
SQL:2011 Compliant Petabit Network High-Available Cluster Compute (Dremel) BigQuery Streaming Ingest Free Bulk Loading Replicated, Distributed Storage (99.9999999999% durability) REST API Client Libraries In 7 languages Web UI, CLI Distributed Memory Shuffle Tier BigQuery | Architectural Advantage Decoupled storage and compute for maximum flexibility
Performance at scale Only data warehouse that scales from few terabytes to 100s of petabytes Custom file format (capacitor), deeply integrated with the query engine RDMA-based shuffle technology using Google's petabit Jupiter network infrastructure Automatic reclustering and history based optimization BI Engine, a fast, in-memory OLAP engine for sub-second queries over small-medium sized datasets NDA
1000s concurrency per customer BigQuery 250 petabytes of data stored by one BigQuery customer Source - The Economic Advantages of Google BigQuery 4.5 million rows/second was peak ingestion/insert rate for one BigQuery customer 100 trillion rows scanned by another BigQuery customer BigQuery scale
Administration & Management Lowest TCO with Serverless offering
BigQuery Reservations allows customers to: Control flat-rate spend Buy slots in Web UI in seconds Efficiently manage workloads in BigQuery Automatically share any unused capacity Enterprise-grade workload management with Reservations
BigQuery Reservations: customer benefits Deploy in seconds Workload management No compute silos 100% predictable spend Leverage BigQuery flat-rate pricing No surprises on your monthly BigQuery bill Mix-and-match on-demand and flat-rate Buy slots in BigQuery UI Deployed in seconds No instance pre-warming Peak performance immediately: no cache hydration period Buy slots for the entire org Partition slots into workloads & departments Guarantee capacity for workloads Schedule changes programmatically Idle slots are instantly available in organization No wastage or penalty for compartmentalization Economies of scale Reservations makes it simple to plan your spending and manage your entire organization’s workloads.
"The Slot Reservation API strikes a good balance between control and flexibility for managing BigQuery workloads. We're able to isolate expensive queries from each other without fearing that we're underutilizing BigQuery resources. The API has been remarkably easy to use and, in turn, has empowered us to optimize our workflows without needing to micromanage them." – Reddit Reservations were instrumental in helping us incrementally ramp up slot capacity as we migrated over from another data warehouse, greatly increasing our cost performance. The ability to share idle slot capacity across projects, workloads and users helps ensure our business critical workstreams stay online, while giving users the flexibility to run more complex workloads. – Discord "SKY has been using BigQuery's flat-rate for some time now. Taking advantage of BigQuery's flat-rate pricing has given SKY peace of mind when it comes to performance and our BigQuery bill. Reservations helped SKY rethink how to protect business critical workloads, while isolating lower-priority development projects and make sure we get the most of BigQuery's performance. " – Sky Customers love Reservations
Universal high performance analytical storage Any engine (Flexibility) + Any Storage (Federation)
3 2 1 3 2 1 3 2 1 Table 1 Table 2 Table 3 Zone A Zone B Zone C Region Tables are stored in optimized columnar format Each table is compressed and encrypted on disk Storage is durable & each table is replicated across datacenters BigQuery | Managed storage Durable and persistent storage with automatic backup
BigQuery storage: S ource for advanced analytics BigQuery Storage API • BI and ETL tools • Unified Batch and Stream processing • Hadoop and Spark workloads • Machine Learning Valuable data is accessible to all tools which can leverage it. Users should be able to access “warehouse” data easily with multiple tools.
Data lake interoperability Federation of Cloud SQL, Parquet & ORC, Bigtable and Sheets by bringing analysis to data wherever it is BigQuery Storage API treats data warehouse like storage and allows you to use BigQuery like a Data Warehouse and a Data Lake NDA
Secure data sharing and public datasets Securely share datasets with users within and outside the organization through virtual data marts Assign roles and permissions to data including; project, folder and dataset using Cloud IAM Track and Audit usage of data in details 70+ Public datasets accessible to everyone
B reaking down data silos Reduce time to Insights
BigQuery Data Transfer Service for SaaS apps Data transfer into BigQuery from SFDC and 100+ business apps in clicks Data warehouse migration service Data and schema migration to BigQuery Teradata to BigQuery Redshift & S3 to BigQuery Gain faster time to insight by breaking down data silos Cloud Data Fusion Code-free ETL and data integration across on-prem and cloud sources
Automated data transfer and federation Use BigQuery to analyze all data important to an organization. Users should be free to make optimal decisions for storage and compute. BigQuery Cloud Bigtable Cloud Spanner* Cloud Storage No-code, Direct import Federated Query *in development 100+ apps Adwords, DoubleClick, YouTube Google Analytics 360 Firebase Google Drive
Real-time Data Warehousing Always fast always fresh
Seamless real-time analytics at scale BigQuery high-performance streaming makes data immediately available Pub/sub and Dataflow integrations allow customers to build comprehensive batch and streaming pipelines BigQuery Streaming API increased streaming capacity by 10x
Column oriented, dynamic in-memory execution engine Horizontal scaling to support higher concurrency Native integration with BigQuery Streaming for real-time data refreshes Sub-second queries
Built on existing BigQuery Storage Eliminates the need to manage BI Servers, ETL pipelines or complex extracts No need to build and manage traditional OLAP cubes Open API for partner integration (coming soon) BI Engine BI Engine Data Studio Sheets Partners Query Execution Metadata Reservations Column oriented, in-memory storage BigQuery Storage Streaming and Batch Ingest API SQL
Intelligent Data Warehousing AI/ML Driven predictive analytics with SQL
Built-in intelligence BigQuery ML - build custom models with standard SQL Google Cloud provides extensive integrated AI and ML services for data analytics including; BigQuery ML, Auto ML, Cloud ML Engine, Tensorflow, and more. 1 2 3 Execute ML initiatives without moving data from BigQuery Iterate on models in SQL in BigQuery to increase development speed Automate common ML tasks, and hyperparameter tuning
Supported BigQuery ML models Classification Logistic regression DNN classifier XGBoost classifier Regression Linear regression DNN regressor XGBoost classifier Other Models k-means clustering Recommendation: Matrix factorization Model Import/Export TensorFlow and XGBoost models for prediction
Looker integration with BigQuery ML More partner tools coming soon Explore data Create the model Operationalize the ML workflow
Geospatial analytics Bringing our 15 years of investments in Google Maps to BigQuery
Geospatial analytics Analyze GIS data in BigQuery with familiar SQL BigQuery GIS Accurate spatial analyses with Geography data type over GeoJSON and WKT formats Support for core GIS functions – measurements, transforms, constructors, and more—using familiar SQL
Insights for everyone Every business decision powered by Insights
Connected Sheets No SQL or database knowledge needed. Easy to connect to, view, and understand data.
Democratize Insights with Connected Sheets Work with billions of rows of data using Sheets with BigQuery Unlock insights by empowering anyone to easily connect to, view, and understand big data. SQL knowledge now optional. Use the familiarity of spreadsheets for self-serve exploration, pivoting, filtering, charting, and formula analysis. Boost data-driven decision making and collaboration while controlling permissions to limit who can view, edit, or share data.
Build beautiful, data-source driven reports that can be shared from Sheets Democratize Insights with Connected Sheets Work with billions of rows of data using Sheets with BigQuery
Discovery & governance Data Catalog to discover and govern the data
Data Catalog Discover , manage , and understand your data assets Fully managed & scalable Easy to get started, there's no infrastructure to set up or manage Simplified data discovery Simple and easy-to-use search interface, powered by Google search technology that supports Gmail and Drive Built-in governance Cloud DLP and Cloud IAM integrations provide a foundation for governance
Simple search interface for data discovery Data Catalog
Auto-tagging of PII data using DLP Customizable schematized tags for business metadata Customizable schematized templates for business metadata
IncidentId IncidentType ReporterPhone Position Manifest 234698 Mooring 510-45-6789 40.44N, 73.59W $10,000 089145 CocInspection 405-94-7201 37.46N, 122.25W $25M Financial PII Govern sensitive data by data classes DLP Responder Auditor First responders have access to PII ( location and phone information) so that they can reach a boat in an emergency. Auditors have access to financial (manifest information) so that customs can exert control PhoneNum Location $Amount Data Catalog
Partner ecosystem Reuse your investment in existing tools
Growing ecosystem of data analytics partners Data Processing Data Ingestion Databases BI/Analytics Management
One of the l owest TCO in the industry
3-Year Total Cost of Ownership for Modeled Scenario $4M $3M $2M $1M $0 Up-front Capital Investment On-demand Cloud costs Administrative costs Support contracts Google BigQuery AWS Redshift BigQuery’s economic advantages over AWS Redshift, Azure SQL DW, and Snowflake 27% lower TCO than Snowflake 34% lower TCO than MS Azure SQL DW 26% lower TCO than AWS Redshift Snowflake MS Azure SQL DW 26%-34% Lower TCO Note : Snowflake support contracts included with on-demand cloud costs Source : ESG 2019 report,
BigQuery Roadmap
Multi-Cloud BigQuery No infrastructure to set up or manage. Within seconds, you can run a massive query against your data in any major public cloud. BigQuery executes on your cloud of choice. No need to maintain multiple copies of data to meet your data analytics needs. Write queries once, they run anywhere. Users have the ability to seamlessly migrate their query results or data itself housed elsewhere to Google Cloud. Fully managed and scalable Consistent and familiar Seamless path to migration
BigQuery on AWS Compute Fast follow with Azure BigQuery remains the entrypoint for customers to access their data. BigQuery runs the Query Engine on AWS as a managed service, which connects to the customer’s data stored on S3. Customers do not manage any Anthos Clusters or compute resources. Query Engine Anthos Clusters managed by BQ BigQuery Customer Data Storage S3 Buckets AWS BigQuery UI BQ AWS queries data directly from S3 User runs query and selects AWS region Query transmitted securely to AWS Query results passed to BQ (on Google Cloud) BQ Storage Query Engine Query result stored in S3 Queries on GCP data
Roadmap Enterprise class data warehousing Elastic resource and workload management . Resources expand to your scale and automatically meet your budget. Unlimited, high-performance, fine-grained DML support. TurboSQL for low-latency Queries, Materialized views , both for managed data and federated sources. Fine grained security & gov. controls Providing customers the ability to further define their access controls levels beyond organization, project and datasets. Policy-based Column ACLs Org-level visibility Column-level retention Row-level security Table ACLs Disaster resilience
Zero maintenance Always fresh Automatically synchronizes data refreshes with data changes in base tables. No user inputs require d. Always consistent with the base table. There will never be a situation when querying MV results in stale data. BigQuery will rewrite the query to use the MV for better performance and/or efficiency when querying the base table directly. Smart tuning BigQuery m aterialized views
Roadmap Built-in intelligence BigQuery ML: Adding support for more models; including: K-means Clustering, XGBoost support, DNN support and more. Integrations: With AutoML tables, AI APIs, and CMLE serving Model Interop: Model export as TF SavedModel (Alpha) for online prediction and model tuning in Python/Java Seamless real-time analytics at scale BigQuery Streaming API enhancements to support high throughput (500GB/s per table) Write API with native AVRO and Arrow support Improved reliability of Streaming API to minimize critical failure bottlenecks and lower resources consumption In-memory storage tier with automatic management for low-latency queries over large datasets. Accelerated BI across first and third-party tools.
Machine Learning with BQML Web App Devices Web App Device BigQuery BigTable GCS Collect Train & predict Serve Model retraining BI Dashboard Prediction Online Prediction, Streaming Prediction Model export Prediction
Supported BigQuery ML models Classification Logistic regression DNN classifier (TensorFlow) XGBoost Regression Other Models k-means clustering Recommendation: Matrix factorization Model Import/Export TensorFlow and XGBoost models for prediction AutoML Tables Linear regression DNN regressor (TensorFlow) XGBoost AutoML Tables
1 2 3 Democratize insights through self-serve analysis using natural language Increase BI team productivity by eliminating ad hoc reports Access through multiple interfaces (roadmap) - Sheets, Looker, Voice, Chatbots Data Q n A - natural language interface for BigQuery
Roadmap Democratize insights Looker acquisition: Provides a unified platform for business intelligence, data applications, and embedded analytics. Data lake interoperability Query Parquet & ORC in Cloud Storage: Expands federation capabilities and allows users to query Parquet & ORC files in Cloud Storage right from BigQuery. Data QnA: Natural language querying. Connected sheets: Empowers anyone to easily connect to, view, and understand big data. SQL knowledge now optional. Enhancements to the Storage API, with Hive Metastore integration. Analyze data in AWS, Azure and possibly on-prem: Expands BigQuery’s capabilities allowing customers to leverage the power of BigQuery on their data living on-prem or other clouds.
BigQuery myths Don’t believe everything you hear
Competition is good Vendors are forced to continuously invest in innovation and improve user experience. Users win. However, when a competitor spreads demonstrative falsehoods and distorts facts, only that vendor wins. When falsehoods are spread, users lose , and the industry loses. Always fact check what vendors say about each other.
BigQuery - according to Snowflake Snowflake tells its customers and prospects that BigQuery is expensive, limited, hard to use and insecure. While BigQuery (and Snowflake) can improve, most of what Snowflake says about BigQuery is demonstratively false or misguided. Not a real data warehouse Poor performance Obscure unpredictable pricing Limitations and quotas Concurrency limits Not secure Black box No workload management No data sharing Bad for BI workloads Limited DML Limited clustering & partitions No time travel Inflexible ingest Google will cancel BigQuery
Let’s unpack each of these claims
BigQuery “is not a real data warehouse” Reality Snowflake asserts Google BigQuery does not implement key features that are expected in a data warehouse, which means that a lot of database workloads will not work in BigQuery without non-trivial change. For example, BigQuery discourages JOINs at more than a small scale. By contrast, Snowflake is a full data warehouse capable of all of the things the people have come to expect of data warehouses–metadata management, a granular security model, broad SQL support, and more. In our testing, BigQuery vastly outperforms Snowflake in JOIN-heavy TPC-benchmarks of nearly all scales (100G to 100T). More on this later. BigQuery has metadata management (obviously) BigQuery has a granular security model BigQuery supports ANSI-Standard SQL, DML, and DDL BigQuery also has Partitions, a key Data Warehouse feature Some of the largest ex-Teradata customers run on BigQuery, like Home Depot, Macys, and Kohls. BigQuery has several of the largest Data Warehouses on record, some of which are more than 200PB in size. FALSE
BigQuery “Has poor performance” Reality Snowflake Asserts Google BigQuery does not perform well. BigQuery discourages JOINs beyond small scale. Because BigQuery doesn’t run on dedicated hardware, BI workloads don’t perform well. BigQuery vastly outperforms Snowflake in JOIN-heavy TPC- * benchmarks of nearly all scales (100G to 100T). See next slide for details. BigQuery separates storage and compute . BigQuery doesn’t rely on local disk for maximum performance, which makes performance more stable. BigQuery is stateless, and its flat-rate capacity DOES offer dedicated compute. BI Engine is BigQuery’s unlimited concurrency, sub-100ms in-memory accelerator for OLAP BI workloads. FALSE
BigQuery's per-query pricing makes it difficult to know what it will cost, meaning costs can add up quickly. It's "unlimited" pricing has lots of limitations and extra-charge scenarios that make it difficult to know what your cost will ultimately be. BigQuery runs the risk of exorbitant costs given its pricing structure. Yes you can scan 1PB fast with BigQuery, but that one query will also cost you $5,500. BigQuery “Obscure unpredictable pricing” Reality Snowflake Asserts BigQuery offers flexibility of two models – serverless and reservations. Users can move between, or use them together. Flat-rate pricing is 100% predictable. If you sign up to spend X per month, you will spend X per month. There are no extra-charges. BigQuery also makes data ingest free – unique. BigQuery also makes automatic re-clustering free – unique. We hear from customers that Snowflake’s initial estimates are often exceeded in production, since Snowflake encourages you to spin up more and more data warehouses. FALSE
BigQuery has quotas on how many concurrent jobs can be run how many queries can run per day how much data can be processed at once, and more BigQuery “Limitations and quotas” Reality Snowflake Asserts Two of the three quotas mentioned by Snowflake don’t even exist BigQuery has a default per-project concurrency limit of 100, which is easily raisable by orders of magnitude with support. Customers can also use multiple projects. BigQuery is transparent and open about its quotas and limits. Every single service in the world has quotas and limits—do not trust a vendor who tells you they don’t have these. Snowflake has a policy of hiding theirs. Snowflake expects you to find these while you’re in production— well after your evaluation period—at which point it is too late. You should be asking Snowflake for same level of transparency. FALSE
BigQuery stalls when it hits concurrency limits. BigQuery can provide great performance on queries that simply require large amounts of scanning horsepower, plus BigQuery has made performance improvement with complex joins and data updates. However, BigQuery has architectural limits to concurrency. Performance will stall if simultaneous connections to the data hits 50. “Concurrency limits” BigQuery Reality Snowflake Asserts BigQuery customers are already using 1000+ concurrent queries and one customer had tested 10k concurrent queries BigQuery has no limit on simultaneous connections. BigQuery has a default per-project concurrency limit of 100, not 50, which is easily raisable to infinity with a support ticket. Customers can also use multiple projects. We have customers running thousands of concurrent queries. Finally, BI Engine is an acceleration layer for BigQuery—specifically for high-concurrency low-latency scenarios. Snowflake does not have an equivalent. Snowflake also has per-DW concurrency limits, which are documented only in an obscure forum post. These are not soft limits raisable via a support ticket, but architectural limits. FALSE
“BigQuery is not secure” VPS Security. For the most stringent security requirements, Snowflake offers its Virtual Private Snowflake (VPS) edition that provides customers the robust data. warehouse-as-a-service Snowflake experience but within a dedicated, non-multitenant, VPC. Customers' data is isolated from the rest of the Snowflake and AWS cloud infrastructure. BigQuery Reality Snowflake Asserts All BigQuery editions can be configured with Google Cloud’s VPC controls (joined together with other Google Cloud services, versus just Snowflake). All of BigQuery supports CMEK and frequent key rotation, not just a special edition. BigQuery leverages Google’s world-class security innovation. Meltdown and Spectre vulnerabilities were discovered by Google. BigQuery was patched downtime-free, and without any user involvement, well before any other vendor knew that these vulnerabilities even existed. BigQuery runs on bare metal secured by crypto security chips and Google’s secure network. BigQuery is protected against DDOS attacks with the Google Cloud network. BigQuery does not run on public IaaS, and thus doesn’t have vendor or misconfiguration risks. FALSE
“BigQuery is a black box” BigQuery is a black box. You submit your job to BigQuery and it finishes when it finishes—users have no ability to control SLAs nor performance. By contrast, with Snowflake customers can easily make choices about the resources needed. BigQuery Reality Snowflake Asserts What Snowflake refers to as “black box” is serverless automation—BigQuery abstracts away the toil associated with Data Warehousing. One example is automatic re-clustering. BigQuery just does it, and for free. Snowflake requires you to pay them, and to administer this process. BigQuery Reservations enables advanced and efficient workload management and control over SLAs—without inefficiencies of spinning up data warehouses and hydrating local storage to get reasonable performance. Snowflake’s multi-cluster data warehouse wastes resources because idle capacity is not shared across data warehouses and it takes minutes to spin up and shut down data warehouses. FALSE
“No workload management” Inability to isolate workloads. Because BigQuery was originally designed as a large-scale scan engine, it does not natively have the ability to isolate workloads and allow those workloads to run concurrently. Snowflake easily isolates workloads with virtual warehouses. BigQuery Reality Snowflake Asserts BigQuery Reservations makes it easy for you to predict and control your monthly BigQuery bill . BigQuery Reservations guarantees users 100% price predictability BigQuery Reservations gives users the ability to perform enterprise-grade workload management. BigQuery Reservations is also efficient, being able to leverage any unused capacity, achieving economies of scale. BigQuery Reservations enables you to get capacity in seconds, no slow local cache to hydrate to get maximum performance. Snowflake’s multi-cluster setup creates data silos and wastes resources. Snowflake requires you to move data into local disk to get adequate performance FALSE
“No data sharing” Granular, live data sharing. Customers can share live data, with better granularity than what's available with Google Cloud and with better ease-of-use. BigQuery Reality Snowflake Asserts BigQuery has supported data sharing since 2012. BigQuery Public Datasets use this exact capability at scale. BigQuery enables users to share data down to individual rows and cells. BigQuery’s public datasets program gives users real-time access to various free and open datasets, like weather, census, and key points of interest. BigQuery’s Data Transfer Service gives users access to hundreds of external data sources like Facebook, Salesforce, as well as users’ Google data like Google Analytics, AdWords, and Youtube. FALSE
“Bad for BI workloads” Variability in Query Time (due to BQ's non-dedicated compute) makes it a poor choice for customer-facing dashboard BigQuery Reality Snowflake Asserts BigQuery’s Reservations does offer dedicated capacity. BI Engine is BigQuery’s acceleration layer specifically for BI-style workloads, with typical query latency measured sub-100ms. FALSE
“Limited DML” Performance on update DML (UPDATEs and DELETEs) can be very poor because BigQuery needs to rewrite entire partitions when data is modified. Limitations of SQL Syntax for DML operations (merges/updates). BigQuery Reality Snowflake Asserts Mostly FALSE BigQuery locks a partition during DML, it doesn’t have to overwrite the entire partition at all. Snowflake, meanwhile, doesn’t have a concept of partitions, and thus locks the entire table for DML. BigQuery has had MERGE and UPDATE statements for 3+ years Despite Snowflake’s claims, atomic mutations is an anti-pattern in Snowflake due to their file-based architecture. Snowflake will not tell you this until you go to production, at which point it is too late.
“Limited Clustering and Partitions” BigQuery requires you to partition your data. Snowflake has automatic micro-partitions. BigQuery doesn’t have automatic reclustering, requiring you to overwrite your tables to properly optimize for clustering degradation. BigQuery Reality Snowflake Asserts FALSE BigQuery’s partitions is a net additional benefit to customers, allowing customers to manage their data via per-partition idempotency. Snowflake lacks this capability entirely. Snowflake’s micro-partitions is just a fancy way of saying that their storage is file-driven. BigQuery does the exact same things natively that Snowflake advertises this way. BigQuery does have automatic re-cluster, and it’s better than Snowflake in two ways: – Unlike Snowflake, BigQuery charges you nothing. – Unlike Snowflake, with BigQuery this is an intrinsic feature in BigQuery, so nothing to set up, monitor, and administer. In addition, BigQuery runs free.
“No time travel” Google does not have time travel. BigQuery Reality Snowflake Asserts FALSE Google BigQuery does have (and has had it since 2010) time travel via AS OF SYSTEM TIME, going back seven days, the ANSI SQL Standard way. In addition, Snowflake customers tell us that Snowflake’s time travel feature doesn’t work well because it often requires Snowflake to query data that lives in object storage, rather than local disk, which makes queries ineffective. BigQuery doesn’t have these architectural limitations.
“Inflexible ingest” BigQuery is bad at loading data or schema evolution. BigQuery Reality Snowflake Asserts Misleading Unlike Snowflake, BigQuery offers free loads BigQuery also offers true Streaming ingest that scales to millions of rows per second BigQuery supports schema evolution. BigQuery is often used in conjunction with Dataflow, Dataproc, Data Fusion, Alooma, Data Loss Prevention API, or third party tools like Matillion and Segment, as well as support for Parquet/ORC/AVRO/CSV/Hive Metastore. BigQuery does lack VARIANT data type but has rich JSON and string operations, as well as nested structures.
“Google will CANCEL BigQuery” Google will cancel BigQuery just like they did with Reader, and this is why Google is desperate to bring Snowflake on as a partner. BigQuery Reality Snowflake Asserts FALSE Dremel and BigQuery are both older than Snowflake. BigQuery and Dremel are critical both to Google Cloud, and to Google’s internal analytics use cases.
Snowflake limitations Traditional EDW like Netezza in cloud
Snowflake – gaps & challenges Snowflake is just Netezza on the cloud; Invest in future with BigQuery with real-time ingestion at scale and embedded AI/ML capabilities. Snowflake is more expensive than other solutions and will keep getting more expensive at scale. Snowflake is less secure for Enterprises (Does not support CMEK, VPC-SC, Domain Restricted Sharing). Lack of interoperability with Data Lake creates silos. Snowflake’s unpredictable performance makes it unreliable when you need it most, impacting business outcomes. Snowflake drives vendor lock-in similar to Oracle.
The Home Depot drives operational efficiency and reduces costs Real-time operations and inventory mgmt Merchandising and assortment Customer experience Reduce shelf-outs which cost millions in lost sales with better inventory management Improve predictive accuracy by 2X ‘ with machine learning Increase incremental revenue with shelf availability improvements Google Cloud Optimize on-shelf inventory POS data Inventory data
~8 hours ~5 minutes Teradata BigQuery 100s TB Data stored 10s PB Data stored Challenge To scale their data warehouse and meet the need of data analysts To reduce costs and complexity How Google Helped Close ongoing Eng-to-Eng relationship Created/managed a project plan (milestones, deliverables, owners, and more) Identified/resolved challenges, themes, risks, and issues Impact As one of Teradata’s largest customers, The Home Depot will save $MM in licensing and maintenance fees, while moving towards a more reliable, scalable, and affordable solution on Google Cloud. “We need you because...you know [our] business, and you know Google internals and technology." — The Home Depot’s Enterprise Architect
Ocado improves forecasts and time to insight Predict inventory and demand with 80X faster delivery on analytical insights Improve scale and efficiency and reduce costs by 33% with the help of Google Cloud Prioritize customer emails using machine learning tags for 4X faster response times Google Cloud Predict demand Purchase order data Ecommerce data Customer email data Telemetry data Real-time operations and inventory mgmt Merchandising and assortment Customer experience
Launched Virtual Beauty Advisor to make personalized recommendations Ulta increases customer retention and acquisition Analyze data from 30 million loyalty members to create unique user experiences Google Cloud Personalized guest experience Sales data Product reviews data Purchase transaction data Social media data Real-time operations and inventory mgmt Merchandising and assortment Customer experience
Industry: Retail Country: Germany Challenge The B2B wholesaler maintains a broad presence in multiple markets globally. To better match customer needs and drive sales growth, the company wanted its ecommerce platform to fully embrace digitalization. Migration: Stacking tech and intelligence to help customers With Compute Engine and Virtual Private Cloud, the company’s ecommerce platform benefits from easy VM management and lower operating costs. Solutions including BigQuery and TensorFlow to support analytics that create sales intelligence. Reduces instability rates of its ecommerce platform by up to 80% Scales capacity to match 45x increase in daily events during data lake ramp-up phase Lowers infrastructure costs by 30% to 50% "When we looked at the data lake project, I recommended that we use managed services on Google Cloud. Google AI and machine learning is always going to be ahead of what we can build in-house.“ Dr. Werner Rath, Unit Owner IT Operations, METRO In partnership with:
Can run ad-hoc reports, and answer valuable questions using data. Can visualise data using BigQuery and DataStudio themselves. No ability to run reporting across all AdWords accounts Struggled to use data to answer questions Customer Story: Zalando BigQuery Data Transfer Service ? Data Transfer Service automates data movement from SaaS applications like AdWords to Google BigQuery on a scheduled, managed basis. Your analytics team can lay the foundation for a data warehouse without writing a single line of code. BEFORE NOW NEXT Next-level automation using advanced search query analysis & entity recognition Use ML to gain deeper insights from the data, and make more advanced business decisions Ingest data from third-party advertising channels, such as Bing, for holistic analysis
Financial Services
Respond faster to customer demands and improves customer service HSBC uncovers insights faster and responds to customer demand Customer nurture & relationship expansion E nhance global risk management with a better understanding of trading positions and risks Run financial analytics 10X faster Customer support & operations Customer acquisition & onboarding Google Cloud Addresses financial risk and improves customer service Financial data Risk data Transaction data Trade posting data
ANZ Bank deepens customer relationships and accelerates insights Reduce credit analysis time from 5 days to 20 seconds with parallel systems Simplify data movement and transformation to manage dependencies including layers of the data pipeline Deliver customized reports to institutional investment clients at scale and becomes a strategic partner to customers Customer support & operations Customer nurture & relationship expansion Customer acquisition & onboar ding Google Cloud Customized reports that deepen customer relationship Passenger shopping data Transactional data Retail customer loyalty data Payment data Supply chain data Credit rating data
Others
How METRO AG uses Google Cloud machine learning technologies
Industry: Technology Country: United States Challenge The fleet management provider helps businesses take care of more than 1.4 million vehicles. However, the company sought to streamline workflows, as on-premises hardware became increasingly complex to manage. Data analytics: Turning vehicle data into insights BigQuery is a key solution for Geotab, allowing billions of data records to be efficiently stored and processed. BigQuery also improves search depth, enabling complex queries that utilize data such as vehicle movement patterns. Expands capacity to reliably handle more than 3 billion records daily via BigQuery Accomplishes raw sensor data analysis in only 5 to 10 seconds Frees team to focus on development versus maintenance through Google ease of use “Google Cloud is helping us transform the way we create value for our customers. We can contextually benchmark data at scale, developing extensive data insights.” Mike Branch, VP of Data & Analytics, Geotab In partnership with:
From largest Hadoop cluster in EU to BigQuery 50x Increase in volume of insights ~16min ~33sec Hadoop 15.5TB processed BigQuery 750GB processed >100PB Data stored In BigQuery
Smart city innovations, enabled by real-time, geospatial, and integrated ML Predicting hazardous driving behavior Using BigQuery ML and BigQuery GIS Weather datasets External GIS data BigQuery 2.5B streaming inserts, daily