Serverless Prediction at Scale: Custom Model Deployment on Google Cloud AI Platform

9 min readMay 4, 2021

Deploy a real-world custom healthcare model to Google Cloud AI Platform, expose model as a secure REST API and verify model’s scalability with load testing

Model deployment is a critical component of the life cycle of machine learning model development and operations (MLOps). In today’s rapid growing cloud native IT environment, serverless deployment in public clouds has become increasingly popular when deploying a custom model to production. Serverless deployment in cloud offers many benefits of cloud computing, such as simplicity, cost efficiency, high availability and scalability.

In this post, I will share a real-world custom model deployment experience with Google Cloud AI Platform (CAIP)[1]. This deployment leverages the capability of CAIP’s custom prediction routine[2] that can automatically package and deploy a model’s artifacts to a Kubernete cluster. After deployment, the model is exposed as a predictive service via a REST API by means of a cloud function and Apigee proxy. The exposed model API is load tested and proven to be highly scalable. For comparison, I also plan to share another experience with CAIP’s custom container approach[3] to the same model in the near future.

Model Overview

The custom model I used for this experimentation is a prototype ML model built for the prediction of adverse drug reaction (ADR) risks in polypharmacy. It is a real-world ML model with a level of complexity commonly expected from the modeling of healthcare problems. For the purpose of this post, it is not important to understand the details of how the model was built, how it functions, or the accuracy of model predictions. It is just used as a realistic example to illustrate how a custom model can be deployed on CAIP in a serverless manner and provide online predictions at scale.

For readers who are interested in the model itself, this ADR risk model was developed based on an approach published in a recent Nature Scientific Report by Valeanu et al[4]. After the model is developed, the artifacts of this ADR risk model are composed of the following 3 custom model files and 1 feature transformation file:

frequency_model.pkl: A statistic ADR frequency model serialized in Python pickle format
hospitalization_model.h5: A neural network model built with TensorFlow and saved in h5 format. This model is used for the prediction of ADR hospitalization risk.
mortality_model.h5: A similar neural network model for the prediction of ADR mortality risk
transformer.pkl: A feature transformation file for preprocessing and encoding input data for the hospitalization and mortality models

The goal is to deploy these model artifacts to CAIP, tie them together with a custom serving code that implements the specific business logic outlined in the white paper[4], and then expose the model securely as a predictive service via a public REST API. The input data for the model API includes:

Patient’s age and gender
List of patient’s medical conditions
List of drugs that the patient is taking

Example:

The output from the model API are ADR risk scores at 3 different hierarchical levels:

Patient’s hospitalization, mortality and total risk scores
Ranking of risk scores at each MedDRA’s System Organ Class (SOC) level
Ranking of risk scores at each individual ADR level

Example:

Custom Routine Predictor

Google CAIP offers 2 alternatives to deploy a custom model on its platform. One is custom prediction routine, and another is custom container. Custom prediction routine is a simpler approach that requires minimum effort to enable a custom serving code to execute during prediction. Custom container, on the other hand, offers the maximum flexibility for a model deployment by creating its own docker container.

The first step in custom prediction routine deployment is to create a custom predictor class. This class is where the custom serving code lives. There are 2 methods to be implemented in the predictor class:

from_path(cls, model_dir): a class method to load model artifacts
predict(self, instances, **kwargs): an instance method that is invoked by every prediction request. The payload of the request object is passed into the method in the “instances” argument.

Don’t worry if you don’t fully understand the implementation details in the above predict() method of ADR_Predictor() class. This code is very specific to the ADR model serving. It uses several helper functions imported from a separate custom code module. However, it is important to understand that the model artifacts are loaded from a local directory inside a running container on CAIP. These model artifacts are copied from a Google Cloud Storage (GCS) location specified during the creation of model version resource as described in the section below.

I also recommend testing out your predictor class locally before trying to deploy to CAIP. It is much easier to debug and test your custom serving code in a local environment than in CAIP. It is very straightforward to create a local tester for this predictor class.

Source Distribution Package

After the custom predictor class has been created and tested locally, the next step is to create a source distribution package for deployment. Google CAIP provides a simple setup tool to create this distribution package. Using this tool, I just need to specify the package name, version, and all of the custom scripts needed at serving (including the predictor and helper script files).

Run the following command to execute the setup script:

It creates a source distribution sub-directory and a gzipped tarball package. After this execution, my local directory file structure looks like this:

Model Deployment on CAIP

Before you start to deploy your model to CAIP, ensure that you have installed Google’s gcloud SDK on your local machine and configured the SDK with your user account, GCP project and authentication.

The deployment on CAIP starts with the creation of a Google Cloud Storage (GCS) bucket and upload of the model artifacts and source distribution package folders. For my ADR model, I run:

<BUCKET_NAME> is the GCS bucket name in my GCP project for hosting the model artifacts and distribution package. After I have verified that these folders have been successfully uploaded to the GCS bucket, I run the following commands to create the model and version resources on CAIP:

On the GCP console, I navigate to CAIP’s model panel and verify that the model and version resources have been successfully created:

From the model version’s “Test & Use” tab on the GCP console, I run a quick test with the sample input data and verify that the model produces the output results as expected. Note that the input data needs to be wrapped with an “instances” element that CAIP requires as the root element of the request body.

I also run a test remotely from my local machine using the following gcloud command by creating a test example file of the same input data. It produces the same output as it is tested from the GCP console:

Cloud Function for Model Service

After deployment on CAIP, the ADR risk model can be exposed as a predictive service through a cloud function with a HTTP endpoint[5]. I create a cloud function named as “adr_http” in a “main.py” file in a “cloud_function” sub-directory that invokes the model’s predictive service and then deploy the cloud function with an HTTP endpoint:

To secure the invocation of “adr_http” cloud function, I create a service account on the GCP project, generates a key file for the service account, and then grant the service account with a role of “Cloud Functions Invoker” to the “adr_http” cloud function. This service account will be used by an Apigee proxy (in next section) to invoke the “adr_http” cloud function:

<REGION> and <PROJECT_ID> in the above commands need to be replaced by the actual GCP region and project ID. After this, I can test the cloud function’s HTTP endpoint with the following gcloud and curl commands:

Note that, in the above curl command, the HTTP authorization is provided by the identity token of the service account activated by gcloud auth. The model input data needs to be wrapped by an array of the “instances” element in the JSON body of the HTTP request.

Apigee Proxy for Model API

While a GCP cloud function can be exposed as a public HTTP endpoint itself, the best practice is to keep the cloud function endpoint private and wrap it with an Apigee proxy in order to expose it as a public REST API[6]. Apigee offers many built-in features and policies for public API development, such as authentication, logging, tracing, spike arrest etc.

For this ADR risk model, I create an Apigee proxy to expose it as a public REST API. The integration between Apigee proxy and cloud function is through a Google Cloud Function extension connector. First, I need to create an Apigee extension using the service account credentials from the key file generated in the previous step. Then, I create an Apigee proxy named as “adr_model” and added the following 3 policies in the default PreFlow steps of this proxy:

OA-VerifyToken: an OAuth2.0 token validation policy for the API call
ADR-Ext-CF: a ConnectorCallout policy using the Google cloud function extension
AM-AssignResponse: an Assign Message policy to retrieve the message from the cloud function response

With this Apigee proxy in place, I can now access the ADR risk model API securely with an access token generated from my Apigee client credentials.

Load Testing of Model API

To verify the API performance and scalability of the deployed ADR risk model, I configure 2 load test cases on SoapUI[7]. One is to simulate a single user (thread) making the API calls, and another is to simulates 100 concurrent users. Both test cases are configured to run for 10 minutes with 500 milliseconds delay between each request. Each load test is attempted 3 times in sequence. The results are captured in the screenshots below. For a single user, the API’s average response time is about 2.5 seconds with a very low transaction volume of 0.34 requests/second. For 100 concurrent users, the API’s average response time is about 4.2 second with a transaction volume of about 22 requests/second. As the results show, when the transaction volume is increased by 65 times, the API response time is only degraded by 1.7 seconds, a clear indication of high scalability of the deployed model API.

Load Test 1: Single User

Load Test 2: 100 Concurrent Users

The diagrams below capture the cloud function execution time and the number of active cloud function instances spawned by the test case of 100 concurrent users. To handle all 100 users’ concurrent requests, GCP activates about 70 cloud function instances. The majority of user requests (95%) are executed in less than 4.2 seconds, but It should also be noted that some requests take much longer to respond at the beginning of each test attempt. This initial time delay is mainly due to:

Warm-up time needed for the newly spawn cloud function instances to handle the request volume
Auto-scaling of model containers behind the CAIP’s Kubernete cluster

Conclusions

In this post, I shared an experience of deploying a real-world healthcare ML model on Google Cloud AI Platform (CAIP). The model was exposed as a secure REST API and verified that it is highly scalable for serving online predictions. Overall, I feel that Google CAIP’s custom prediction routine provides a very simple and convenient method for the serverless deployment of a custom ML model on GCP.

One lesson I learnt from this experimentation is that VPC service controls of a GCP Project may have significant impact on the availability of CAIP’s custom prediction routine. I had to switch to a different GCP Project with less VPC service controls in order to complete the deployment of this ADR risk model. If you run into a similar issue of the regional endpoint availability for CAIP’s custom prediction routine, check the VPC service controls on your GCP Project.