google-big-queryHow do I format data for use in Google BigQuery?

Google BigQuery is a powerful cloud-based data warehouse that allows users to store and query large datasets. To use data in BigQuery, it must be formatted in a specific way. The best way to format data for BigQuery is to use the BigQuery Data Transfer Service (DTS). This service allows you to easily load data from a variety of sources, including Google Cloud Storage, Google Drive, and Google Sheets.

For example, if you have a CSV file stored in Google Cloud Storage, you can use the following code to load it into BigQuery:

# Imports the Google Cloud client library
from google.cloud import bigquery

# Instantiates a client
bigquery_client = bigquery.Client()

# The name for the new dataset
dataset_id = 'my_new_dataset'

# Prepares a reference to the new dataset
dataset_ref = bigquery_client.dataset(dataset_id)
dataset = bigquery.Dataset(dataset_ref)

# Creates the new dataset
dataset = bigquery_client.create_dataset(dataset)

# The name for the new table
table_id = 'my_new_table'

# Prepares a reference to the new table
table_ref = dataset_ref.table(table_id)
table = bigquery.Table(table_ref)

# Configures the schema of the table
schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('age', 'INTEGER')
]
table.schema = schema

# Creates the new table
table = bigquery_client.create_table(table)

# Configures the load job
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True

# The source file for the load job
uri = 'gs://my-bucket/data.csv'

# Loads the CSV file into the table
load_job = bigquery_client.load_table_from_uri(
    uri, table_ref, job_config=job_config
)

# Waits for the load job to complete
load_job.result()

This code will create a new BigQuery dataset and table, configure the schema of the table, and then load the CSV file from Google Cloud Storage into the table.

Code explanation

from google.cloud import bigquery: imports the Google Cloud client library.
dataset_id = 'my_new_dataset': sets the name for the new dataset.
dataset_ref = bigquery_client.dataset(dataset_id): prepares a reference to the new dataset.
dataset = bigquery_client.create_dataset(dataset): creates the new dataset.
table_id = 'my_new_table': sets the name for the new table.
table_ref = dataset_ref.table(table_id): prepares a reference to the new table.
schema = [bigquery.SchemaField('name', 'STRING'), bigquery.SchemaField('age', 'INTEGER')]: configures the schema of the table.
table.schema = schema: sets the schema of the table.
table = bigquery_client.create_table(table): creates the new table.
job_config = bigquery.LoadJobConfig(): configures the load job.
job_config.source_format = bigquery.SourceFormat.CSV: sets the source format of the load job to CSV.
job_config.skip_leading_rows = 1: skips the first row of the CSV file.
job_config.autodetect = True: enables autodetection of the data types in the CSV file.
uri = 'gs://my-bucket/data.csv': sets the source file for the load job.
load_job = bigquery_client.load_table_from_uri(uri, table_ref, job_config=job_config): loads the CSV file into the table.
load_job.result(): waits for the load job to complete.

Helpful links

Edit this code on GitHub

More of Google Big Query

See more codes...