Transforming data in BigQuery

Running BigQuery DAG in Airflow locally with Docker

Build a data pipeline with Airflow and GCP

Dark Light

We will go through three different transformations on a dataset in BigQuery.

  1. Filter for unique data
  2. Using REGEX_CONTAINS to look for data with specific pattern
  3. Combining longtitude and latitude to a GEOPOINT field

From a previous post where we import a text file to a dataframe in python then import it to BigQuery as a dataset. We will use the same dataset to perform some transformation, turning the raw data into a report table.

This piece of data are scrapped from Rightmove containing house/apartment listing information from the website including details of each listing. The data schema looks like this:

And the data itself looks like this:

1. Check for uniqueness

Before we perform any transformation on the dataset, we first need to create a filter to generate a unique set of data, i.e. removing any duplicates rows. To achieve this we can use a common table expression in our query, using ROW_NUMBER function to get the unique-row row number partition it by a unique column in the dataset (in our case it’s the url column) , then we use QUALIFY ROW_NUMBER as our filter for unique rows.

WITH raw AS 
(
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY url ORDER BY full_description DESC) AS rn
  FROM
    `gary-yiu-001.testing.properties_raw` 
  QUALIFY ROW_NUMBER() OVER (PARTITION BY url ORDER BY full_description DESC) = 1
)

We name these filtered data as raw.

2. Using REGEX_CONTAINS to look for data with specific pattern

Now we can perform some transformation on our data. As we have a description column in our data, it’s likely we can get some useful data out of it. While the description column is in paragraph, we can use REGEX_CONTAINS to capture data which we know to have particular format. For example, we know there is flat size information included in the description, and it is likely to be in the format of some digits follow by some keywords like “square feet” or “Sqft”, etc. So we can setup a regular expression to capture the size information from the description.

CAST(REGEXP_EXTRACT(lower(REPLACE(full_description, ',', '')), '([0-9]{1,})(?: sq ft| sq feet| square feet| square foot| sqft|sq ft|sq feet|square feet|square foot|sqft)') AS INT64) AS size_ft

If you are not familar with regex, using regex101 is a great way to start writing regex and learn different rules in regex

3. Combining longtitude and latitude to a GEOPOINT field

With logititude and latitude information in the dataset, we can combine it into a GEOPOINT geogrpahy column so we can later visualize the data in a map.

ST_GEOGPOINT(CAST(longitude AS FLOAT64),CAST(latitude AS FLOAT64)) AS geo_pt

Combining with other transformation, we now have the following query which we use the raw data to perform transformation then put it as inter in another CTE. From inter, we then did some calculations to create several calculated columns. Finally we have our complete query.

WITH raw AS 
(
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY url ORDER BY full_description DESC) AS rn
  FROM
    `gary-yiu-001.testing.properties_raw` 
  QUALIFY ROW_NUMBER() OVER (PARTITION BY url ORDER BY full_description DESC) = 1
),
inter AS
(
  SELECT
    PARSE_DATE("%Y%m%d" ,added) AS added_date,
    REGEXP_EXTRACT(url, r'(\d+)') property_id,
    address,
    postcode,
    regexp_extract(postcode, '([A-Z0-9]{2,4}) (?:[A-Z0-9]{3})') district,
    propertyType,
    propertySubType,
    agent_name,
    agent_address,
    retirement,
    preOwned,
    number_bedrooms,
    maxSizeFt,
    price,
    ST_GEOGPOINT(CAST(longitude AS FLOAT64),
      CAST(latitude AS FLOAT64)) geo_pt,
    TRIM(full_description) description,
    REGEXP_CONTAINS(lower(full_description), 'video tour|virtual tour|virtual viewing|video viewing|video') has_video_tour,
    REGEXP_CONTAINS(lower(full_description), 'energy efficiency rating|energy performance rating|epc') has_epc_rating,
    REGEXP_EXTRACT(lower(REPLACE(REPLACE(full_description, ':', ''), '-', '')), '(?:efficiency rating |energy efficiency rating |energy performance rating |epc |epc rating )([a-gA-g]{1})') epc_rating,
    REGEXP_CONTAINS(lower(REPLACE(full_description, ',', '')), 'sq ft|sq feet|square feet|square foot|sqft') has_size,
    CAST(REGEXP_EXTRACT(lower(REPLACE(full_description, ',', '')), '([0-9]{1,})(?: sq ft| sq feet| square feet| square foot| sqft|sq ft|sq feet|square feet|square foot|sqft)') AS INT64) size_ft
  FROM
    raw
  ORDER BY
    1
)

SELECT 
  * EXCEPT (size_ft),
  IF(size_ft >= 100, size_ft, NULL) AS size_ft,
  price/CAST(IF(size_ft >= 100, size_ft, NULL) AS INT64) AS ft_price
FROM inter

Saving this query as a view, we can then setup a schedule query to run it regularly and store it in a reporting table for analysis and visualization.

Related Posts