Install Google cloud code for CLI and python


sudo apt install google-cloud-cli
python3 -m pip install pandas_gbq
python3 -m pip install google-cloud-bigquery

You will need to run `gcloud auth application-default login` in your terminal to create ADC and you should see output similar to:

Credentials saved to file: [/Users/parity/.config/gcloud/application_default_credentials.json]


gcloud auth application-default login

# or on a server without a browser use the special flag:
gcloud auth application-default login --no-launch-browser

# or to list all accounts
gcloud auth list

The full Google Cloud documentation can be found here.

Assuming that works you can begin with the python code by importing the necessary GCP module:

from google.cloud import bigquery

import pandas as pd

Finally you need to specify the GCP project id:

project_id = "parity-data-infra-evaluation"
client = bigquery.Client(project_id)

Now you can start running some queries. The actual code to run a query is client.query(sql).to_dataframe() where sql is the query string.

sql = """select * 
from `parity-data-infra-evaluation.dotlake_dev.acala`,
UNNEST(events) as e where event_id = 'Swap' 
limit 3;"""

client.query(sql).to_dataframe()

A second example:

sql = """select * 
from `parity-data-infra-evaluation.dotlake_dev.acala`,
UNNEST(extrinsics) as e 
where extrinsic_call_function = 'swap_with_exact_supply' 
limit 3;"""

client.query(sql).to_dataframe()

In the two examples above you will see some nested json in the events and extrinsics columns and we were still able to query the contents thanks to the UNNEST function.

Example of saving Pandas DataFrame to BigQuery table

Here is Google standard example of how to save data to a BigQuery table. The table name is stored in the table_id variable.

import datetime

from google.cloud import bigquery
import pandas
import pytz

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
table_id = "parity-data-infra-evaluation.roger.example"

records = [
    {
        "title": "The Meaning of Life",
        "release_year": 1983,
        "length_minutes": 112.5,
        "release_date": pytz.timezone("Europe/Paris")
        .localize(datetime.datetime(1983, 5, 9, 13, 0, 0))
        .astimezone(pytz.utc),
        # Assume UTC timezone when a datetime object contains no timezone.
        "dvd_release": datetime.datetime(2002, 1, 22, 7, 0, 0),
    },
    {
        "title": "Monty Python and the Holy Grail",
        "release_year": 1975,
        "length_minutes": 91.5,
        "release_date": pytz.timezone("Europe/London")
        .localize(datetime.datetime(1975, 4, 9, 23, 59, 2))
        .astimezone(pytz.utc),
        "dvd_release": datetime.datetime(2002, 7, 16, 9, 0, 0),
    },
    {
        "title": "Life of Brian",
        "release_year": 1979,
        "length_minutes": 94.25,
        "release_date": pytz.timezone("America/New_York")
        .localize(datetime.datetime(1979, 8, 17, 23, 59, 5))
        .astimezone(pytz.utc),
        "dvd_release": datetime.datetime(2008, 1, 14, 8, 0, 0),
    },
    {
        "title": "And Now for Something Completely Different",
        "release_year": 1971,
        "length_minutes": 88.0,
        "release_date": pytz.timezone("Europe/London")
        .localize(datetime.datetime(1971, 9, 28, 23, 59, 7))
        .astimezone(pytz.utc),
        "dvd_release": datetime.datetime(2003, 10, 22, 10, 0, 0),
    },
]
dataframe = pandas.DataFrame(
    records,
    # In the loaded table, the column order reflects the order of the
    # columns in the DataFrame.
    columns=[
        "title",
        "release_year",
        "length_minutes",
        "release_date",
        "dvd_release",
    ],
    # Optionally, set a named index, which can also be written to the
    # BigQuery table.
    index=pandas.Index(
        ["Q24980", "Q25043", "Q24953", "Q16403"], name="wikidata_id"
    ),
)
job_config = bigquery.LoadJobConfig(
    # Specify a (partial) schema. All columns are always written to the
    # table. The schema is used to assist in data type definitions.
    schema=[
        # Specify the type of columns whose type cannot be auto-detected. For
        # example the "title" column uses pandas dtype "object", so its
        # data type is ambiguous.
        bigquery.SchemaField("title", bigquery.enums.SqlTypeNames.STRING),
        # Indexes are written if included in the schema by name.
        bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
    ],
    # Optionally, set the write disposition. BigQuery appends loaded rows
    # to an existing table by default, but with WRITE_TRUNCATE write
    # disposition it replaces the table with the loaded data.
    write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(
    dataframe, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

Now that I have shown you the hard way, I should also show you the easy way, which makes use of the pandas_gbq package to_gbq function.


import pandas_gbq

outi = {'writing':1,
        'tutorials':2,
        'is':3,
        'fun':4}
tmp = pd.DataFrame(outi, index=[0])

pandas_gbq.to_gbq(tmp, 'parity-data-infra-evaluation.roger.example', if_exists="replace")
## 
  0%|          | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 16008.79it/s]

Note that while pd.DataFrame is pretty good at converting a JSON object into a DataFrame, in this case since the JSON object is a scalar, we had to add the index parameter to the DataFrame call.

Finally, we show the easy way to read data using the read_gbq function:


import pandas_gbq

pandas_gbq.read_gbq("select * from `parity-data-infra-evaluation.roger.example`;", progress_bar_type=None)
##    writing  tutorials  is  fun
## 0        1          2   3    4

TODO: Nothing, we are pretty much done here.