Machine Learning at Scale with Amazon Redshift

Machine Learning using only SQL

4 min readDec 15, 2022

Machine learning (ML) is a rapidly growing field that is transforming many industries, from finance to healthcare to retail. Amazon Redshift is a popular cloud-based data warehousing platform that offers a range of powerful tools and features for working with large data sets.

In this article, I will explore the use of Amazon Redshift for machine learning tasks. I will discuss how Redshift can be used to train and deploy ML models, and how its unique features make it well-suited for large-scale ML projects. I will also provide an overview of the key tools and technologies involved in using Redshift for ML, and offer some tips and best practices for getting started with this exciting technology.

How to use Redshift for Machine Learning

One way to use Redshift for ML is to use it as a data source for training models. Redshift allows users to quickly load and query large data sets, which can then be used as input for training ML algorithms.

For example, a user could integrate data stored in Redshift with Amazon SageMaker. This allows users to take advantage of Redshift’s fast query performance and scalable storage capabilities to train ML models quickly and efficiently. Once a SageMaker model has been trained, it can be deployed on Redshift for use in production.

Redshift offers a number of tools and features that make it well-suited for deploying ML models at scale. For example, users can use Amazon Redshift Spectrum to query data stored in external data lakes, allowing them to integrate ML models into their existing data pipelines seamlessly. Additionally, Redshift offers a number of machine learning algorithms and functions as part of its built-in SQL language, allowing users to easily incorporate ML predictions into their queries. This makes it easy to deploy trained ML models on Redshift and use them to generate predictions without the data ever having to leave the Redshift cluster.

Built-in Machine Learning Functions

Amazon Redshift offers a number of built-in machine learning (ML) functions that can be used to train and evaluate ML models directly within the Redshift SQL language. These functions provide a convenient and powerful way to incorporate ML into Redshift data pipelines and workflows.

Some of the key ML functions available in Redshift include k-means, linear regression and support vector machines. Additionally, users can use the Redshift Spectrum feature to query external data sources, such as Amazon S3, and use those data sets to train and evaluate ML models. This allows users to leverage the power of Redshift for large-scale ML projects without having to load all of their data into the Redshift cluster.

Building Machine Learning Models in Redshift using SQL

In this section, we will look at some examples of the inbuilt ML capabilities of Redshift by using SQL queries to create and use machine learning models.

This approach supports many common ML use cases and means the data never has to leave Redshift, we can therefore make use of the distributed compute power already available within the Redshift cluster.

1. Train an ML model using K-means clustering

CREATE MODEL my_model
    (col1 INT, col2 VARCHAR, col3 FLOAT)
FROM my_data_table
FUNCTION predict_kmeans 
IAM_ROLE default 
AUTO OFF 
MODEL_TYPE KMEANS 
PREPROCESSORS 'none' 
HYPERPARAMETERS DEFAULT
EXCEPT
(K '5') 
SETTINGS (S3_BUCKET '<YOUR-BUCKET-NAME>')

This query creates a new ML model named my_model using the K-means clustering algorithm. The model is trained on the data in the my_data_table, using the columns col1, col2, and col3 as input features and creates a function called predict_kmeans that we will use in the next section.

2. Score data using the trained model

SELECT predict_kmeans(col1, col2, col3) AS clusters
FROM my_test_table

This query scores the data in the my_test_table using the trained my_model model. The predict_kmeans function is used to generate predictions for each row of the data, and the resulting cluster assignments are added as a new column named clusters.

3. Evaluate the performance of the model

SELECT
    clusters,
    COUNT(*) AS events
FROM (
      SELECT predict_kmeans(col1, col2, col3) AS clusters
      FROM my_test_table
     )
GROUP BY 1;

This query looks at the number of points in each of the clusters. You can also use the EXPLAIN MODEL function to describe the underlying features of the model.

These are just a few examples of the kinds of SQL queries that can be used to work with machine learning models in Amazon Redshift. There are many more functions and features available, allowing users to perform a wide range of tasks related to ML on Redshift.

Conclusion

Overall, Amazon Redshift offers a number of powerful tools and features that make it a great platform for training and deploying ML models. Its ability to handle large data sets, fast query performance and support for machine learning algorithms make it an attractive option for those looking to incorporate ML into their data warehousing pipelines.