Bigquery run multiple queries python

Question

I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.

Nội dung chính Show

Write, run, and save multi-statement queries
Write a multi-statement query
Run a multi-statement query
Save a multi-statement query
Use variables in a multi-statement query
Declare a user-created variable
Set a user-created variable
Set a system variable
Reference a user-created variable
Reference a system variable
Use temporary tables in a multi-statement query
Create a temporary table
Reference a temporary table
Delete temporary tables
View temporary table data
Qualify temporary tables with _SESSION
Collect information about a multi-statement query job
Return the last executed statement
Return all executed statements
Enumerate child jobs
Debug a multi-statement-query
Permissions
Security constraints
Configuration field limitations
How do I run multiple queries in BigQuery?
How do I run a query in BigQuery Python?
How many concurrent queries can BigQuery handle?
Can I use Python in BigQuery?

Thanks in advance!

asked Jul 26, 2016 at 19:40

MMMdataMMMdata

7886 silver badges19 bronze badges

1

The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop

Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed

You can check for details in Running asynchronous queries

answered Jul 26, 2016 at 21:26

Mikhail BerlyantMikhail Berlyant

155k8 gold badges127 silver badges201 bronze badges

BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:

from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery

client: bigquery.Client = bigquery.Client()

def run(name, statement):
    return name, client.query(statement).result() # blocks the thread

def run_all(statements: Dict[str, str]):
    with ThreadPoolExecutor() as executor:
        jobs = []
        for name, statement in statements.items():
            jobs.append(executor.submit(run, name, statement))
        result = dict([job.result() for job in jobs])
    return result

P.S.: Some credits are due to @Fredrik Håård for this answer :)

answered Nov 9, 2021 at 17:17

Philippe HebertPhilippe Hebert

1,3541 gold badge18 silver badges39 bronze badges

This document describes how to use multi-statement queries in BigQuery, such as how to write multi-statement queries, use temporary tables in multi-statement queries, reference variables in multi-statement queries, and debug multi-statement queries.

A multi-statement query is a collection of SQL statements that you can execute in one request. With multi-statement queries you can run multiple statements in a sequence, with shared state. Multi-statement queries can have side effects such as adding or modifying table data.

Multi-statement queries are often used in stored procedures and support procedural language statements, which let you do things like define variables and implement control flow.

Write, run, and save multi-statement queries

A multi-statement query consists of one or more SQL statements separated by semicolons. Any valid SQL statement can be used in a multi-statement query. Multi-statement queries can also include procedural language statements, which let you use variables or implement control flow with your SQL statements.

Write a multi-statement query

You can write a multi-statement query in BigQuery. The following multi-query statement query declares a variable and uses the variable inside an IF statement:

DECLARE day INT64;
SET day = (SELECT EXTRACT(DAYOFWEEK from CURRENT_DATE));
if day = 1 or day = 7 THEN
  SELECT 'Weekend';
ELSE
  SELECT 'Weekday';
END IF

BigQuery interprets any request with multiple statements as a multi-statement query, unless the statements consist entirely of CREATE TEMP FUNCTION statements followed by a single SELECT statement. For example, the following is not considered a multi-statement query:

CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);
SELECT Add(3, 4);

Run a multi-statement query

You can run a multi-statement query in the same way as any other query, for example, in the Google Cloud console or using the bq command-line tool.

Save a multi-statement query

To save a multi-statement query, see Save and share queries.

Use variables in a multi-statement query

A multi-statement query can contain user-created variables and system variables.

You can declare user-created variables, assign values to them, and reference them throughout the query.
You can reference system variables in a query and assign values to some of them, but unlike user-defined variables, you don't declare them. System variables are built into BigQuery.

Declare a user-created variable

You must declare user-created variables either at the start of the multi-statement query or at the start of a BEGIN block. Variables declared at the start of the multi-statement query are in scope for the entire query. Variables declared inside a BEGIN block have scope for the block. They go out of scope after the corresponding END statement. The maximum size of a variable is 1 MB, and the maximum size of all variables used in a multi-statement query is 10 MB.

You can declare a variable with the DECLARE procedural statement like this:

DECLARE x INT64;

BEGIN
DECLARE y INT64;
-- Here you can reference x and y
END;

-- Here you can reference x, but not y

Set a user-created variable

After you declare a user-created variable, you can assign a value to it with the SET procedural statement like this:

DECLARE x INT64 DEFAULT 0;
SET x = 10;

Set a system variable

You don't create system variables, but you can override the default value for some of them like this:

SET @@dataset_project_id = 'MyProject';

You can also set and implicitly use a system variable in a multi-statement query. For example, in the following query you must include the project each time you wish to create a new table:

BEGIN
  CREATE TABLE MyProject.MyDataset.MyTempTableA (id STRING);
  CREATE TABLE MyProject.MyDataset.MyTempTableB (id STRING);
END;

If you don't want to add the project to table paths multiple times, you can assign the dataset project ID MyProject to the @@dataset_project_id system variable in the multi-statement query. This assignment makes MyProject the default project for the rest of the query.

SET @@dataset_project_id = 'MyProject';

BEGIN
  CREATE TABLE MyDataset.MyTempTableA (id STRING);
  CREATE TABLE MyDataset.MyTempTableB (id STRING);
END;

Similarly, you can set the @@dataset_id system variable to assign a default dataset for the query. For example:

SET @@dataset_project_id = 'MyProject';
SET @@dataset_id = 'MyDataset';

BEGIN
  CREATE TABLE MyTempTableA (id STRING);
  CREATE TABLE MyTempTableB (id STRING);
END;

You can also explicitly reference system variables like @@dataset_id in many parts of a multi-statement query. To learn more, see Reference a system variable.

Reference a user-created variable

After you have declared and set a user-created variable, you can reference it in a multi-statement query. If a variable and column share the same name, the column takes precedence.

This returns column x + column x:

DECLARE x INT64 DEFAULT 0;
SET x = 10;

WITH Numbers AS (SELECT 50 AS x)
SELECT (x+x) AS result FROM Numbers;

+--------+
| result |
+--------+
| 100    |
+--------+

This returns column y + variable x:

DECLARE x INT64 DEFAULT 0;
SET x = 10;

WITH Numbers AS (SELECT 50 AS y)
SELECT (y+x) AS result FROM Numbers;

+--------+
| result |
+--------+
| 60     |
+--------+

Reference a system variable

You can reference a system variable in a multi-statement query.

The following query returns the default time zone:

BEGIN
  SELECT @@time_zone AS default_time_zone;
END;

+-------------------+
| default_time_zone |
+-------------------+
| UTC               |
+-------------------+

You can use system variables with DDL and DML queries. For example, here are a few ways to use the system variable @@time_zone when creating and updating a table:

BEGIN
  CREATE TEMP TABLE MyTempTable
  AS SELECT @@time_zone AS default_time_zone;
END;

BEGIN
  CREATE OR REPLACE TABLE MyDataset.MyTable(default_time_zone STRING)
  OPTIONS (description = @@time_zone);
END;

BEGIN
  UPDATE MyDataset.MyTable
  SET default_time_zone = @@time_zone
  WHERE TRUE;
END;

There are some places where system variables cannot be used in DDL and DML queries. For example, you can't use a system variable as a project name, dataset, or table name. This produces an error when you attempt to include the @@dataset_id system variable in a table path:

BEGIN
  CREATE TEMP TABLE @@dataset_id.MyTempTable (id STRING);
END;

Use temporary tables in a multi-statement query

Temporary tables let you save intermediate results to a table. These temporary tables exist at the session level, so you don't need to save or maintain them in a dataset.

You can create and reference a temporary table in a multi-statement query. When you are finished with the temporary table, you can delete it manually or wait for BigQuery to delete it after 24 hours.

Create a temporary table

You can create a temporary table for a multi-statement query with the CREATE TABLE statement. The following example creates a temporary table to store the results of a query and uses the temporary table in a subquery:

-- Find the top 100 names from the year 2017.
CREATE TEMP TABLE top_names(name STRING)
AS
 SELECT name
 FROM `bigquery-public-data`.usa_names.usa_1910_current
 WHERE year = 2017
 ORDER BY number DESC LIMIT 100
;
-- Which names appear as words in Shakespeare's plays?
SELECT
 name AS shakespeare_name
FROM top_names
WHERE name IN (
 SELECT word
 FROM `bigquery-public-data`.samples.shakespeare
);

Other than the use of TEMP or TEMPORARY, the syntax is identical to the CREATE TABLE syntax.

When you create a temporary table, don't use a project or dataset qualifier in the table name. The table is automatically created in a special dataset.

You are not charged for storing temporary tables.

Reference a temporary table

You can refer to a temporary table by name for the duration of the current multi-statement query. This includes temporary tables created by a procedure within the multi-statement query. You cannot share temporary tables, and they are not visible using any of the standard list or other table manipulation methods.

Delete temporary tables

You can delete a temporary table explicitly before the multi-statement query completes by using the DROP TABLE statement:

CREATE TEMP TABLE table1(x INT64);
SELECT * FROM table1;  -- Succeeds
DROP TABLE table1;
SELECT * FROM table1;  -- Results in an error

After a multi-statement query finishes, the temporary table exists for up to 24 hours.

View temporary table data

After you create a temporary table, you can view the structure of the table and any data in it. To view the table structure and data, follow these steps:

In the Google Cloud console, go to the BigQuery Explorer page.
Go to Explorer
Click Query history.
Choose the query that created the temporary table.
In the Destination table row, click Temporary table.

Qualify temporary tables with _SESSION

When temporary tables are used together with a default dataset, unqualified table names refer to a temporary table if one exists, or a table in the default dataset. The exception is for CREATE TABLE statements, where the target table is considered a temporary table if and only if the TEMP or TEMPORARY keyword is present.

For example, consider the following multi-statement query:

-- Create table t1 in the default dataset
CREATE TABLE t1 (x INT64);

-- Create temporary table t1.
CREATE TEMP TABLE t1 (x INT64);

-- This statement selects from the temporary table.
SELECT * FROM t1;

-- Drop the temporary table
DROP TABLE t1;

-- Now that the temporary table is dropped, this statement selects from the
-- table in the default dataset.
SELECT * FROM t1;

You can explicitly indicate that you are referring to a temporary table by qualifying the table name with _SESSION:

-- Create a temp table
CREATE TEMP TABLE t1 (x INT64);

-- Create a temp table using the `_SESSION` qualifier
CREATE TEMP TABLE _SESSION.t2 (x INT64);

-- Select from a temporary table using the `_SESSION` qualifier
SELECT * FROM _SESSION.t1;

If you use the _SESSION qualifier for a query of a temporary table that does not exist, the multi-statement query throws an error indicating that the table does not exist. For example, if there is no temporary table named t3, the multi-statement query throws an error even if a table named t3 exists in the default dataset.

You cannot use _SESSION to create a non-temporary table:

CREATE TABLE _SESSION.t4 (x INT64);  -- Fails

Collect information about a multi-statement query job

A multi-statement query job contains information about a multi-statement query that has been executed. Some common tasks that you can perform with job data include returning the last statement executed with the multi-statement query or returning all statements executed with the multi-statement query.

Return the last executed statement

The jobs.getQueryResults method returns the query results for the last statement to execute in the multi-statement query. If no statement was executed, no results are returned.

Return all executed statements

To get the results of all statements in a multi-statement query, enumerate the child jobs and call jobs.getQueryResults on each of them.

Enumerate child jobs

Multi-statement queries are executed in BigQuery using jobs.insert, similar to any other query, with the multi-statement queries specified as the query text. When a multi-statement query runs, additional jobs, known as child jobs, are created for each statement in the multi-statement query. You can enumerate the child jobs of a multi-statement query by calling jobs.list, passing in the multi-statement query job ID as the parentJobId parameter.

Debug a multi-statement-query

Here are some tips for debugging multi-statement queries:

Use the ASSERT statement to assert that a Boolean condition is true.
Use BEGIN...EXCEPTION...END to catch errors and display the error message and stack trace.
Use SELECT FORMAT("....") to show intermediate results.
When you run a multi-statement query in the Google Cloud console, you can view the output of each statement in the multi-statement query. The bq command-line tool's 'bq query` command also shows the results of each step when you run a multi-statement query.
In the Google Cloud console, you can select an individual statement inside the query editor and run it.

Permissions

Permission to access a table, model, or other resource is checked at the time of execution. If a statement is not executed, or an expression is not evaluated, BigQuery does not check whether the user executing the multi-statement query has access to any resources referenced by it.

Within a multi-statement query, the permissions for each expression or statement are validated separately. For example:

SELECT * FROM dataset_with_access.table1;
SELECT * FROM dataset_without_access.table2;

If the user executing the query has access to table1 but does not have access to table2, the first query will succeed and the second query will fail. The multi-statement query job itself will also fail.

Security constraints

In multi-statement queries, you can use dynamic SQL to build SQL statements at runtime. This is convenient, but can offer new opportunities for misuse. For example, executing the following query poses a potential SQL injection security threat since the table parameter could be improperly filtered, allow access to, and be executed on unintended tables.

-- Risky query vulnerable to SQL injection attack.
EXECUTE IMMEDIATE CONCAT('SELECT * FROM SensitiveTable WHERE id = ', @id);

To avoid exposing or leaking sensitive data in a table or running commands like DROP TABLE to delete data in a table, BigQuery's dynamic procedural statements support multiple security measures to reduce exposure to SQL injection attacks, including:

An EXECUTE IMMEDIATE statement does not allow its query, expanded with query parameters and variables, to embed multiple SQL statements.
The following commands are restricted from being executed dynamically: BEGIN/END, CALL, CASE, IF, LOOP, WHILE, and EXECUTE IMMEDIATE.

Configuration field limitations

The following job configuration query fields cannot be set for a multi-statement query:

clustering
create_disposition
destination_table
destination_encryption_configuration
range_partitioning
schema_update_options
time_partitioning
user_defined_function_resources
write_disposition

Pricing

If you use on-demand billing, BigQuery charges based on the number of bytes processed during execution of the multi-statement queries.

For more information, see Query size calculation.

Quotas

For information about multi-statement query quotas, see Quotas and limits.

How do I run multiple queries in BigQuery?

Work with multi-statement queries.

Write a multi-statement query. You can write a multi-statement query in BigQuery. ... .

Declare a user-created variable. ... .

Set a user-created variable. ... .

Set a system variable. ... .

Reference a user-created variable. ... .

Reference a system variable. ... .

Create a temporary table. ... .

Enumerate child jobs..

How do I run a query in BigQuery Python?

Run the query.

Define a query string and use the client. ... .

Use the bigquery.Query() function to define a query and Query.Read() function to submit the query and get the results. ... .

Use the BigQuery.query() method to start the query..

How many concurrent queries can BigQuery handle?

Your project can run up to 100 concurrent interactive queries.

Can I use Python in BigQuery?

To access BigQuery using Python, you need to create a new project or select an existing one from your Google Cloud Console. Also, you need a service account that has the BigQuery API enabled. To learn how to set up all of these, check out this Python to Google Sheets tutorial.

programming python BigQuery multiple queries BigQuery multiple with BigQuery python transaction Bigquery best practices Python query BigQuery BigQuery transaction