Streaming data to BigQuery with Pub/Sub

Using Cloud Dataproc with Google Cloud Storage

Setup Cloud Spanner database and exporting it

Dark Light

In order to create a streaming data pipeline in Google Cloud Platform, we are going to make use of Pub/Sub. We are going to simulate a streaming data source with Pub/Sub and load the data into BigQuery for analysis. Let’s start using GCP cloud shell.

Prerequisites:

  1. Clone script file from Github
git clone https://github.com/GaryYiu/gcp-resource.git
cd ~/gcp-resource/pub_sub_demo
  1. Download sample data for streaming
gsutil cp gs://cloud-training-demos/sandiego/sensor_obs2008.csv.gz .

1. Create Pub/sub topic

First we need to create a Pub/sub topic. We are going to use Cloud shell to create this.

gcloud pubsub topics create sandiego

2. Create a subscription to the topic

Then we also need to create a subscription so we can receive data from the topic

gcloud pubsub subscriptions create --topic sandiego mySub1

We can check in the Pub/sub console view to verify that the topic and the subscription both exist.

3. Create the BigQuery table to store the streaming data

bq mk --dataset $DEVSHELL_PROJECT_ID:demos

4. Create bucket for Dataflow staging

Dataflow requires a staging ground to store temporary data before loading into BigQuery. So let’s create a GCS bucket for that.

gsutil mb gs://$DEVSHELL_PROJECT_ID

5. Run Dataflow job to subscribe to the topic

To run a Dataflow job, we are using a script to do that. Before running it, set the permission of the file to prevent any permission denial. After the build is complete, we can see success message showing in cloud shell.

sudo chmod +x run_oncloud.sh
./run_oncloud.sh $DEVSHELL_PROJECT_ID $DEVSHELL_PROJECT_ID AverageSpeeds

6. Start the streaming data simulation

Now we can start to stream our data from the topic to the subscription by running the simulation script that will stream data every 5 seconds to our topics. Then the Dataflow subscription will pull the data from the topic

python send_sensor_data.py --speedFactor=60 --project=gary-yiu-001

7. View data in BigQuery

SELECT
  *
FROM
  `gary-yiu-001.demos.average_speeds`
LIMIT
  1000

We successfully created our streaming data pipeline from Pub/sub to Dataflow to Bigquery. Now let’s wrap up by terminating our topics, subscriptions and Dataflow job.

Related Posts