Install Google cloud code for CLI and python
sudo apt install google-cloud-cli
python3 -m pip install pandas_gbq
python3 -m pip install google-cloud-bigquery
You will need to run `gcloud auth application-default login` in your terminal to create ADC and you should see output similar to:
Credentials saved to file: [/Users/parity/.config/gcloud/application_default_credentials.json]
gcloud auth application-default login
# or on a server without a browser use the special flag:
gcloud auth application-default login --no-launch-browser
# or to list all accounts
gcloud auth list
The full Google Cloud documentation can be found here.
Assuming that works you can begin with the python code by importing the necessary GCP module:
from google.cloud import bigquery
import pandas as pd
Finally you need to specify the GCP project id:
= "parity-data-infra-evaluation"
project_id = bigquery.Client(project_id) client
Now you can start running some queries. The actual code to run a
query is client.query(sql).to_dataframe()
where
sql
is the query string.
= """select *
sql from `parity-data-infra-evaluation.dotlake_dev.acala`,
UNNEST(events) as e where event_id = 'Swap'
limit 3;"""
client.query(sql).to_dataframe()
A second example:
= """select *
sql from `parity-data-infra-evaluation.dotlake_dev.acala`,
UNNEST(extrinsics) as e
where extrinsic_call_function = 'swap_with_exact_supply'
limit 3;"""
client.query(sql).to_dataframe()
In the two examples above you will see some nested json in the
events
and extrinsics
columns and we were
still able to query the contents thanks to the UNNEST
function.
Example of saving Pandas DataFrame to BigQuery table
Here is Google standard example of how to save data to a BigQuery table. The table name is stored in the table_id variable.
import datetime
from google.cloud import bigquery
import pandas
import pytz
# Construct a BigQuery client object.
= bigquery.Client()
client
# TODO(developer): Set table_id to the ID of the table to create.
= "parity-data-infra-evaluation.roger.example"
table_id
= [
records
{"title": "The Meaning of Life",
"release_year": 1983,
"length_minutes": 112.5,
"release_date": pytz.timezone("Europe/Paris")
1983, 5, 9, 13, 0, 0))
.localize(datetime.datetime(
.astimezone(pytz.utc),# Assume UTC timezone when a datetime object contains no timezone.
"dvd_release": datetime.datetime(2002, 1, 22, 7, 0, 0),
},
{"title": "Monty Python and the Holy Grail",
"release_year": 1975,
"length_minutes": 91.5,
"release_date": pytz.timezone("Europe/London")
1975, 4, 9, 23, 59, 2))
.localize(datetime.datetime(
.astimezone(pytz.utc),"dvd_release": datetime.datetime(2002, 7, 16, 9, 0, 0),
},
{"title": "Life of Brian",
"release_year": 1979,
"length_minutes": 94.25,
"release_date": pytz.timezone("America/New_York")
1979, 8, 17, 23, 59, 5))
.localize(datetime.datetime(
.astimezone(pytz.utc),"dvd_release": datetime.datetime(2008, 1, 14, 8, 0, 0),
},
{"title": "And Now for Something Completely Different",
"release_year": 1971,
"length_minutes": 88.0,
"release_date": pytz.timezone("Europe/London")
1971, 9, 28, 23, 59, 7))
.localize(datetime.datetime(
.astimezone(pytz.utc),"dvd_release": datetime.datetime(2003, 10, 22, 10, 0, 0),
},
]= pandas.DataFrame(
dataframe
records,# In the loaded table, the column order reflects the order of the
# columns in the DataFrame.
=[
columns"title",
"release_year",
"length_minutes",
"release_date",
"dvd_release",
],# Optionally, set a named index, which can also be written to the
# BigQuery table.
=pandas.Index(
index"Q24980", "Q25043", "Q24953", "Q16403"], name="wikidata_id"
[
),
)= bigquery.LoadJobConfig(
job_config # Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
=[
schema# Specify the type of columns whose type cannot be auto-detected. For
# example the "title" column uses pandas dtype "object", so its
# data type is ambiguous.
"title", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField(# Indexes are written if included in the schema by name.
"wikidata_id", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField(
],# Optionally, set the write disposition. BigQuery appends loaded rows
# to an existing table by default, but with WRITE_TRUNCATE write
# disposition it replaces the table with the loaded data.
="WRITE_TRUNCATE",
write_disposition
)
= client.load_table_from_dataframe(
job =job_config
dataframe, table_id, job_config# Make an API request.
) # Wait for the job to complete.
job.result()
= client.get_table(table_id) # Make an API request.
table print(
"Loaded {} rows and {} columns to {}".format(
len(table.schema), table_id
table.num_rows,
) )
Now that I have shown you the hard way, I should also show you the
easy way, which makes use of the pandas_gbq
package
to_gbq
function.
import pandas_gbq
= {'writing':1,
outi 'tutorials':2,
'is':3,
'fun':4}
= pd.DataFrame(outi, index=[0])
tmp
'parity-data-infra-evaluation.roger.example', if_exists="replace") pandas_gbq.to_gbq(tmp,
##
0%| | 0/1 [00:00<?, ?it/s]
100%|##########| 1/1 [00:00<00:00, 16008.79it/s]
Note that while pd.DataFrame
is pretty good at
converting a JSON object into a DataFrame, in this case since the JSON
object is a scalar, we had to add the index parameter to the
DataFrame
call.
Finally, we show the easy way to read data using the
read_gbq
function:
import pandas_gbq
"select * from `parity-data-infra-evaluation.roger.example`;", progress_bar_type=None) pandas_gbq.read_gbq(
## writing tutorials is fun
## 0 1 2 3 4
TODO: Nothing, we are pretty much done here.