Live Ethereum and Bitcoin Data in Google BigQuery and Pub/Sub

Evgeny Medvedev
Google Cloud - Community
3 min readOct 7, 2019

--

The crypto_ethereum and crypto_bitcoin datasets in BigQuery are now updated using the streaming technology. You can also subscribe to public Pub/Sub topics that feed those tables.

The overall architecture is depicted below:

Blockchain ETL architecture

The following blockchains are covered:

  • Ethereum
  • Bitcoin
  • ZCash
  • Litecoin
  • Doge
  • Dash

We added delays for each blockchain that prevent streaming orphaned blocks resulting from chain reorganisations. You can look up how many blocks we lag behind the tip of the chain in the LAG_BLOCKS parameter in the configuration files in the Github repository https://github.com/blockchain-etl-streaming. Those values were calculated based on the longest orphaned chains within the last year, multiplied by a safety factor. For Ethereum we ended up with an offset equivalent to approximately 4-minute lag. For Bitcoin this value is equivalent to a 30-minute lag.

We are currently working on a solution that will handle chain reorganisations using additional message types and will allow subscribing to the most up-to-date data.

Subscribing to Live Data Feeds

Install Google Cloud SDK:

> curl https://sdk.cloud.google.com | bash
> exec -l $SHELL
> gcloud init

Create a Pub/Sub subscription for Ethereum internal transactions:

> gcloud pubsub subscriptions create crypto_ethereum.traces.test --topic=crypto_ethereum.traces --topic-project=crypto-public-data

Read a single message from the subscription to test it works:

> gcloud pubsub subscriptions pull crypto_ethereum.traces.test

Now you can run a subscriber and process the messages in the subscription, using this Python script:

subscribe.py

import timefrom google.cloud import pubsub_v1# TODO project_id = "Set Your Google Cloud Project ID Here"
subscription_name = "crypto_ethereum.traces.test"
subscriber = pubsub_v1.SubscriberClient()
# The `subscription_path` method creates a fully qualified identifier
# in the form `projects/{project_id}/subscriptions/{subscription_name}`
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
def callback(message):
print('Received message: {}'.format(message))
message.ack()
subscriber.subscribe(subscription_path, callback=callback)# The subscriber is non-blocking. We must keep the main thread from
# exiting to allow it to process messages asynchronously in the background.
print('Listening for messages on {}'.format(subscription_path))
while True:
time.sleep(60)

Install the dependencies and run the script:

> pip install google-cloud-pubsub==1.0.1
> python subscribe.py
Listening for messages...
Received message: Message {
data: b'{"type": "trace", "transaction_index": 158, "from_...'
attributes: {
"item_id": "trace_call_0xce2ce80594f7601726d03114366161a0050d4a0beedd8628655f1a19319f203d_"
}
}
Received message: Message {
data: b'{"type": "trace", "transaction_index": 159, "from_...'
attributes: {
"item_id": "trace_call_0x7d121090c65c93ac6ba99764bae40ef384c388a139578194c503b92228ccfb3d_"
}
}
...

You can also use Go, Java, Node.js or C#: https://cloud.google.com/pubsub/docs/pull.

The first 10GB of data in Pub/Sub is free, after that you’ll pay ~$40/TB. There is on average 40GB of Ethereum traces per month, which would amount to ~$1/month.

You can also subscribe to topics with Bitcoin blocks and transactions and Ethereum blocks, transactions, logs, contracts, and tokens. The topic names follow the naming convention of the BigQuery tables so you can easily locate them: projects/crypto-public-data/topics/crypto_{chain}.{table_name}, where:

  • chain can be one of ethereum, bitcoin, zcash, litecoin, dogecoin, or dash.
  • table_name can be one of blocks or transactions. Additionally for Ethereum: logs, token_transfers, traces, contracts, andtokens.

Next up is an article about a Twitter bot that posts anomalous transactions using Dataflow and Cloud Functions.

Also read:

Follow us on Twitter: https://twitter.com/BlockchainETL

--

--