Introduction to Linear Regression with BigQuery ML
Recently, Google announced the release of BigQuery ML, a service for building machine learning models using data stored in BigQuery.
BigQuery is a serverless data warehouse designed for speed and ease of use. It can easily hold petabytes of data.
The great thing about this product is that you don’t have to export data from BigQuery to model it. You can create your model where the data is.
This post will walk you through the process of fitting, evaluating and interpreting machine learning models using BigQuery ML, using publicly available datasets.
Let’s get started.
Predicting Birth Weight
Step 0: Setting up
Step 1: Understand the data
This example will use natality dataset. A table describing all registered births in the United States from 1969 to 2008. It comprises 21.9 GB of data, more than will fit in RAM in a Macbook pro.
Our goal will be to predict the birth weight of a baby. Low birth weight has been linked to a higher risk of diabetes and obesity.
We will use the following columns to make our model:
- mother_age: Reported age of the mother when giving birth,
- mother_married: Whether the mother was married when she gave birth,
- mother_race: The race of the mother,
- gestation_weeks: The number of weeks of the pregnancy,
- weight_gain_pounds: How much weight the mother gained during pregnancy,
- is_male: whether the child is male.
Step 2: Create the model
We can use the CREATE MODEL function to create our model. The only options we need to specify are the model type (linear_regression) and the target variable (weight_pounds).
The following code will create and train our model. It should take about 15 minutes to train.
# Provide name of model CREATE OR REPLACE MODEL `bigquery_ml_example.simple_natality_model`# Specify options OPTIONS (model_type='linear_reg', input_label_cols=['weight_pounds']) AS# Provide training data SELECT mother_age, mother_married, CAST(mother_race as STRING) as mother_race, # race is a category, not a number. gestation_weeks, weight_gain_pounds, is_male, weight_pounds FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL # Filter for rows containing data we want to predict.
Step 3: Evaluate the model
SELECT * FROM ML.EVALUATE( MODEL `bigquery_ml_example.simple_natality_model`, # Model name # Table to evaluate against ( SELECT mother_age, mother_married, CAST(mother_race as STRING) as mother_race, gestation_weeks, weight_gain_pounds, is_male, weight_pounds FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL ) )
From the table above average, the model will be accurate to within a pound.
Here’s how to interpret the columns.
mean_absolute_errorThe average distance from the predicted value to the actual value. Lower is better.
mean_squared_errorUsed for evaluating statistical significance. Lower is better.
mean_squared_log_errorUsed as a numerically stable cost function by Gradient Descent for training the model. Lower is better.
median_absolute_errorA measure more robust to outliers. Lower is better.
r2_scoreCoefficient of determination. Higher is better.
explained_varianceThe fraction of variance explained. Higher is better.
Step 4: Interpreting the model
BigQuery standardizes numerical inputs to a regression model before it fits the model. This means we have to be very careful in interpreting the weights of the model. We can use the ML.WEIGHTS function together with the ML.FEATURE_INFO function to piece together an interpretation.
Without taking into account preprocessing, interpretation of weights is meaningless.
The following code gives us the information we need.
SELECT * FROM # Info for destandardizing ML.FEATURE_INFO(MODEL `bigquery_ml_example.simple_natality_model`) LEFT JOIN # Model coefficients ML.WEIGHTS(MODEL `bigquery_ml_example.simple_natality_model`) ON input = processed_input
We can interpret the first row as saying
“All else being equal, for every 5.89 (stddev) years the mother is above 26.27 (mean) years old, we would expect to see the birth weight increase by 0.059 (weight) pounds.”
Interpreting category weights involves comparing different possible values within the category.
We would interpret the second row as saying
“All else being equal, if the mother is married we would expect to see an additional 0.224 (1.902–1.678) pounds compared to if a mother is unmarried.”
Interpreting the model weights of a regression can be challenging, but can provide useful insight.
Step 5: Prediction
Making a prediction with the model is as easy as calling ML.PREDICT
Ensure the table you’re calling ML.PREDICT with has a compatible schema.
SELECT * FROM ML.PREDICT(MODEL `bigquery_ml_example.simple_natality_model`, ( SELECT mother_age, mother_married, CAST(mother_race as STRING) as mother_race, gestation_weeks, weight_gain_pounds, is_male, weight_pounds FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL) ) LIMIT 100
Postscript: Using Data Science Responsibly
As Data Scientists, a way we can add a lot of value is through using our models to recommend some kind of policy. Often our interpretations of our model coefficients will lead to some sort of causal explanation, such as
“Wait longer to have a baby, since the age of the mother is linked to higher birth weight.”
This is bad reasoning and irresponsible recommendation. In fact the opposite recommendation should probably be given.
Here’s some considerations to keep in mind when extracting recommendations from your model.
- Understand what policy decisions might be made as the result of your findings, and how it might be interpreted. Call out misguided interpretations explicitly.
- Statistics does not give a license to be unethical. No statistical measure can justify a policy that discriminates against people based on age, class, race, religion etc.