Using SageMaker Studio for Beginners - Simple Classification Application

Using SageMaker Studio for Beginners - Simple Classification Application
Photo by Kevin CASTEL / Unsplash


Introduction

In this comprehensive tutorial, you'll learn how to build, train, deploy, and evaluate a machine learning model using Python, scikit-learn, and Amazon SageMaker. We'll walk through each step in the machine learning pipeline, providing detailed explanations for all the code snippets involved.

For demonstration purposes, we will use the Iris dataset, a simple but widely-used dataset for classification problems. Our goal is to predict the species of Iris flowers based on four features: sepal length, sepal width, petal length, and petal width.

We will begin by loading and exploring the dataset using Python's scikit-learn library. Afterward, we'll split the data into training and testing sets and train a Decision Tree model as a baseline. We'll then move the trained model to Amazon SageMaker, a managed machine learning service, to see how it scales and performs in a cloud environment. Finally, we'll deploy the model and make real-time predictions.

Step 1: Import Dependencies

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sagemaker
import boto3
from sagemaker import get_execution_role

Purpose:

The purpose of this code snippet is to import all the necessary Python libraries that will be used throughout the tutorial. These libraries provide the tools we need for data manipulation, machine learning, and cloud services.

Description:

  • pandas: This is a powerful data manipulation and analysis library. We'll be using it for creating data frames and handling CSV files.
  • numpy: Stands for 'Numerical Python' and is used for numerical operations. It's especially good for handling arrays.
  • train_test_split: A utility function from scikit-learn to split datasets into training and test sets.
  • DecisionTreeClassifier: This is the machine learning model we'll initially use to classify Iris flowers locally. It is a part of scikit-learn.
  • sagemaker: Amazon SageMaker's Python SDK. This will be used to interact with the SageMaker service for training, deploying models, etc.
  • boto3: The AWS SDK for Python. It allows you to create, configure, and manage AWS services.
  • get_execution_role: A utility function from SageMaker's Python SDK to retrieve the execution role for training and deploying models.

Each of these imported modules and functions will serve specific roles as we proceed with the tutorial.

Step 2: Load and Prepare the Dataset

from sklearn.datasets import load_iris

iris_data = load_iris()

X, y = iris_data['data'], iris_data['target']

Purpose:

The purpose of this code snippet is to load the Iris dataset into the Python environment, and then separate its features and target labels into two separate variables (X and y). This is a necessary step before any data preprocessing, model training, or model evaluation activities can take place.

Description:

from sklearn.datasets import load_iris

This line imports the load_iris function from the sklearn.datasets module. The load_iris function provides a convenient way to load the Iris flower dataset, a standard machine learning dataset used for classification tasks.

iris_data = load_iris()

Here, load_iris() is called, and its return value is stored in the iris_data variable. The load_iris() function returns a dictionary-like object with several fields:

  • data: The features for each sample (an array of shape [n_samples, n_features]).
  • target: The labels for each sample (an array of shape [n_samples]).
  • feature_names: The names of the dataset columns.
  • target_names: The names of the target classes (species of Iris flowers in this case).
  • DESCR: A full description of the dataset.

X, y = iris_data['data'], iris_data['target']

This line unpacks the data and target fields from the iris_data object into X and y, respectively.

  • X: A 2D array containing the features for each sample in the dataset. Each row corresponds to a sample, and each column corresponds to a feature (e.g., sepal length, sepal width, petal length, petal width). The shape of X is [n_samples, n_features].
  • y: A 1D array containing the target labels for each sample in the dataset. The labels are integers that correspond to the species of Iris flower that each sample represents. The shape of y is [n_samples].

By separating the data into X and y, you set up the conventional representation for features and target labels used in supervised machine learning. This will make it easier to perform subsequent steps like data splitting and model training.

Step 3: Splitting the Dataset into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Purpose:

The purpose of this line of code is to divide the Iris dataset into two subsets: a training set and a testing set. This is a crucial step in the data preparation process for supervised machine learning, as it allows you to evaluate how well your trained model will generalize to new, unseen data.

Description:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let's break down this function call and its arguments:

  • train_test_split: This function comes from Scikit-Learn's sklearn.model_selection module. It is commonly used for quickly and easily dividing datasets into training and testing subsets.
  • X, y: These are the feature and target variables that we previously extracted from the iris_data object. X contains the features, and y contains the labels.
  • test_size=0.2: This argument specifies that 20% of the data will be reserved for the test set, and the remaining 80% will be used for training. You can also use an integer to specify the absolute number of test samples.
  • random_state=42: This is the seed used by the random number generator for shuffling the data before the split. Setting random_state to an integer will make the output deterministic, meaning that running the code multiple times will produce the same train/test split. This is often useful for reproducibility in scientific research.

The function returns four variables:

  • X_train: The subset of the features used for training. This is a 2D array with shape [n_training_samples, n_features].
  • X_test: The subset of the features used for testing. This is a 2D array with shape [n_testing_samples, n_features].
  • y_train: The subset of labels corresponding to the X_train features. This is a 1D array with shape [n_training_samples].
  • y_test: The subset of labels corresponding to the X_test features. This is a 1D array with shape [n_testing_samples].

By performing this split, you're setting up a more robust evaluation procedure for your machine learning model. After training your model on X_train and y_train, you can evaluate it on X_test and y_test to get an unbiased estimate of its performance on unseen data.

Step 4: Training and Evaluating a Local Model Using a Decision Tree Classifier

local_model = DecisionTreeClassifier()
local_model.fit(X_train, y_train)

# Test the model locally
local_model.score(X_test, y_test)

Purpose:

The purpose of this section is to create a simple local machine learning model using a Decision Tree Classifier. It's beneficial to train a local model before deploying it to a more complex environment like AWS SageMaker to verify that the dataset is correctly prepared and to get a baseline measure of performance.

Description:

local_model = DecisionTreeClassifier()

  • DecisionTreeClassifier: This is a class from Scikit-Learn's sklearn.tree module that allows you to create a decision tree classifier object.
  • local_model: This is the variable where we're storing the classifier instance. At this point, the classifier is not yet trained—it's just an empty model waiting to be trained on a dataset.

local_model.fit(X_train, y_train)

  • fit(): This method trains the Decision Tree Classifier on the dataset provided. It modifies the local_model object in-place, optimizing its parameters based on the training data.
  • X_train, y_train: These are the subsets of the features and labels that were allocated for training during the train-test split.

After this line, local_model becomes a trained decision tree classifier.

local_model.score(X_test, y_test)

  • score(): This method evaluates the trained model on a testing set and returns the mean accuracy of the model. The mean accuracy is the fraction of correctly classified samples.
  • X_test, y_test: These are the subsets of the features and labels that were allocated for testing during the train-test split.

After running this line, you get a single floating-point number between 0 and 1. This number represents the proportion of test instances that were correctly classified by local_model. If the score is close to 1, the model has high accuracy on the test set; if it's closer to 0, the model performs poorly.

By using this local model, you get a quick and easy way to assess the quality of your data, the appropriateness of your features, and the effectiveness of the Decision Tree Classifier for this specific problem before you move on to more complex and resource-intensive tasks such as training the model in SageMaker.

Step 5: Saving the Train and Test Data to CSV Files

# Save train and test data to CSV
# Combine labels and features before saving
train_data = np.column_stack((y_train, X_train))
test_data = np.column_stack((y_test, X_test))  # Optional, you can keep this line if you want labels in test data

Purpose:

The purpose of this section is to save the training and testing datasets into CSV files. These files will later be uploaded to Amazon S3 so that they can be accessed by the SageMaker training job. In this specific implementation, we're also combining the labels (y_train, y_test) and the features (X_train, X_test) into single data arrays.

Description:

train_data = np.column_stack((y_train, X_train))

np.column_stack(): This is a function from NumPy that stacks 1-D arrays as columns into a 2-D array. This is used to combine the labels (y_train) and the features (X_train) for the training data into a single 2-D array.

  • The first argument y_train consists of the labels or target values for the training data.
  • The second argument X_train consists of the features for the training data.
  • train_data: This variable stores the resulting 2-D array where each row represents a single sample and the first column consists of the label for that sample.

test_data = np.column_stack((y_test, X_test))

np.column_stack(): Similar to above, this function is used to stack 1-D arrays as columns into a 2-D array. This time it's combining the labels (y_test) and the features (X_test) for the testing data into a single 2-D array.

  • The first argument y_test consists of the labels or target values for the testing data.
  • The second argument X_test consists of the features for the testing data.
  • test_data: This variable stores the resulting 2-D array where each row represents a single sample and the first column consists of the label for that sample.

Note: Including labels in the test data is optional and depends on how you're going to use the test set. If you're only using it to make predictions and evaluate the model manually, including the labels can be helpful. Otherwise, you may choose to exclude them.

After running this section, you will have two 2-D NumPy arrays, train_data and test_data, each containing both features and labels, ready to be saved into CSV files and used by SageMaker.

Step 6: Saving the Combined Datasets to CSV Files using Pandas

pd.concat([pd.DataFrame(y_train), pd.DataFrame(X_train)], axis=1).to_csv('train_data.csv', header=False, index=False)
pd.concat([pd.DataFrame(y_test), pd.DataFrame(X_test)], axis=1).to_csv('test_data.csv', header=False, index=False)

Purpose:

The purpose of this code section is to save the training and testing data into CSV files using Pandas. These CSV files will later be uploaded to Amazon S3 to be accessible for the SageMaker training job. The training and testing data are saved in a manner where the first column is the label, followed by the feature columns.

Description:

pd.concat([pd.DataFrame(y_train), pd.DataFrame(X_train)], axis=1).to_csv('train_data.csv', header=False, index=False)

pd.DataFrame(y_train), pd.DataFrame(X_train): The labels and the features for the training data are first converted into Pandas DataFrames.

  • pd.DataFrame(y_train) creates a new DataFrame containing the labels (y_train).
  • pd.DataFrame(X_train) creates a new DataFrame containing the features (X_train).
  • pd.concat([..., ...], axis=1): The pd.concat function is used to concatenate the two DataFrames along axis=1, which means the concatenation happens column-wise. This results in a DataFrame where the first column is the label (y_train), followed by the features (X_train).

.to_csv('train_data.csv', header=False, index=False): Finally, this DataFrame is saved to a CSV file called train_data.csv.

  • header=False specifies that the CSV file should not include header information.
  • index=False specifies that row indices should not be saved into the CSV file.

pd.concat([pd.DataFrame(y_test), pd.DataFrame(X_test)], axis=1).to_csv('test_data.csv', header=False, index=False)

pd.DataFrame(y_test), pd.DataFrame(X_test): Similar to the training data, labels and features for the testing data are converted into Pandas DataFrames.

  • pd.DataFrame(y_test) creates a new DataFrame containing the labels (y_test).
  • pd.DataFrame(X_test) creates a new DataFrame containing the features (X_test).
  • pd.concat([..., ...], axis=1): As before, pd.concat is used to concatenate these DataFrames along axis=1 to produce a single DataFrame with the first column as the label, followed by the feature columns.

.to_csv('test_data.csv', header=False, index=False): This DataFrame is saved as test_data.csv.

  • The flags header=False and index=False serve the same purpose as described above for the training data.

After this step, you will have two CSV files: train_data.csv and test_data.csv, each containing the label as the first column followed by feature columns. These files are ready to be uploaded to Amazon S3.

Step 7: Creating an Amazon S3 Bucket using boto3

import boto3

def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region.

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """
    s3_client = boto3.client('s3', region_name=region)

    try:
        if region is None or region == 'us-east-1':
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except Exception as e:
        print(f"An error occurred: {e}")
        return False
    return True

# Set the bucket name and optionally a region
bucket_name = 'test-sterling-bucket-newcollar' 
region = boto3.Session().region_name  # or set it to a specific region as a string like 'us-east-1'

# Create the bucket
if create_bucket(bucket_name, region):
    print(f"S3 bucket {bucket_name} created successfully.")
else:
    print(f"Failed to create S3 bucket {bucket_name}.")

Purpose:

The purpose of this code section is to programmatically create an Amazon S3 bucket, which will be used to store the training and testing data as well as the output model artifacts generated by the Amazon SageMaker training job.

Description:

import boto3

  • This imports the boto3 library, which allows you to create, configure, and manage AWS services.

def create_bucket(bucket_name, region=None):

  • This defines a function named create_bucket that takes two parameters: bucket_name and region. bucket_name is the name you want to give to the S3 bucket, and region is the AWS region where you want to create the bucket. The default value for region is None.

s3_client = boto3.client('s3', region_name=region)

  • This creates an S3 client object from boto3. The client will operate in the region specified by the region parameter. If region is None, the client will use the default region.

try: ... except Exception as e:

  • This try-except block attempts to create the S3 bucket and captures any exceptions that may occur, printing an error message if something goes wrong.

s3_client.create_bucket(...)

  • This is the command that actually attempts to create the S3 bucket.
  • If region is either None or 'us-east-1', it creates the bucket without specifying a region. Otherwise, it sets the LocationConstraint to the provided region.

bucket_name = 'test-sterling-bucket-newcollar'

  • This sets the name of the S3 bucket to 'test-sterling-bucket-newcollar'.

region = boto3.Session().region_name

  • This gets the region of the current boto3 session. Alternatively, you could set region to a specific AWS region as a string.

if create_bucket(bucket_name, region): ... else: ...

  • This calls the create_bucket function and checks if it returns True (indicating that the bucket was created successfully) or False (indicating that the bucket creation failed).

After this step, you should have an S3 bucket ready, where you can upload your training and testing data as well as store the model artifacts generated by SageMaker.

Step 8: Initialize SageMaker Session, Upload Data to S3

# Initialize a SageMaker session
sagemaker_session = sagemaker.Session()

# Get the role
role = get_execution_role()

# Upload the dataset to an S3 bucket
input_train = sagemaker_session.upload_data(path='train_data.csv', bucket=bucket_name, key_prefix='iris/data')

Purpose:

This section of code prepares for the SageMaker training job by performing several tasks:

  1. Initializing a SageMaker session that provides methods for interacting with SageMaker and other AWS services.
  2. Retrieving the IAM (Identity and Access Management) role that SageMaker can assume to carry out tasks on your behalf.
  3. Uploading the training data to the previously created S3 bucket.

Description:

sagemaker_session = sagemaker.Session()

  • Initializes a new SageMaker session using the default settings. This session will be used for operations such as uploading data to S3 and launching training jobs. You can also use it to fetch details about your SageMaker environment.

role = get_execution_role()

  • Calls the get_execution_role() method from the sagemaker package. This retrieves the IAM role that you've set up for your SageMaker service. This role is crucial as it defines permissions for SageMaker, such as access to S3 buckets, launching training instances, etc.

input_train = sagemaker_session.upload_data(...)

  • Invokes the upload_data method on the sagemaker_session object.
  • path='train_data.csv': Indicates the local file that you want to upload to S3. This should be your training data saved in a CSV format.
  • bucket=bucket_name: Specifies the name of the S3 bucket to which the data should be uploaded. This should be the same bucket that you created earlier.
  • key_prefix='iris/data': This is a directory-like path in your bucket where the data file will be saved. Think of it as creating a folder called 'iris' and within that, another folder called 'data' in your S3 bucket.

The method returns the S3 URL of the uploaded data, which is stored in the input_train variable. This URL will be used later when specifying the input for the SageMaker training job.

By the end of this step, you will have an initialized SageMaker session, a retrieved IAM role, and your training data uploaded to an S3 bucket, all of which are prerequisites for launching a SageMaker training job.

Step 9: Configure and Train SageMaker XGBoost Model

Purpose:

# Import SageMaker's XGBoost
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

container = image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

output_path = f"s3://{bucket_name}/iris/output"

# Initialize the SageMaker estimator
estimator = sagemaker.estimator.Estimator(container,
                                          role,
                                          instance_count=1,
                                          instance_type='ml.m4.xlarge',
                                          output_path=output_path,
                                          sagemaker_session=sagemaker_session)

# Set the XGBoost hyperparameters including the num_round
estimator.set_hyperparameters(objective='multi:softmax', num_class=3, num_round=10)

# Specify the training data and its type
input_data = {
    'train': TrainingInput(input_train, content_type='text/csv')
}

# Fit the model
estimator.fit(input_data)

This section of the code sets up, configures, and trains an XGBoost model using Amazon SageMaker. It performs multiple tasks:

  1. Imports necessary modules and retrieves the XGBoost container.
  2. Sets the output path where the model artifacts will be stored.
  3. Initializes the SageMaker estimator with the training configurations.
  4. Sets hyperparameters for the XGBoost model.
  5. Specifies the training data.
  6. Finally, trains the model using the data.

Description:

from sagemaker import image_uris, TrainingInput

  • Imports the image_uris and TrainingInput modules from the SageMaker library.

container = image_uris.retrieve(...)

  • This function retrieves the image URI for the specified ML framework (XGBoost in this case). The function takes the current AWS region, the framework's name, and version. It returns the URI of the Docker container to be used.

output_path = f"s3://{bucket_name}/iris/output"

  • Specifies where to save the trained model artifacts in the S3 bucket. This path will hold the output of the training job.

estimator = sagemaker.estimator.Estimator(...)

  • Initializes the SageMaker estimator, which is a high-level object that handles end-to-end Amazon SageMaker training and deployment tasks.
  • container: The Docker image URI that was retrieved earlier.
  • role: The IAM role with permissions to launch the SageMaker training job.
  • instance_count: Number of EC2 instances to use for training.
  • instance_type: The type of EC2 instance to use.
  • output_path: Where to save the output (model artifacts).
  • sagemaker_session: The SageMaker session object created earlier.

estimator.set_hyperparameters(...)

  • Sets the hyperparameters for the XGBoost model.
  • objective='multi:softmax': The loss function to minimize.
  • num_class=3: Number of unique classes in the label.
  • num_round=10: Number of rounds for boosting.

input_data = {'train': TrainingInput(input_train, content_type='text/csv')}

  • Prepares the training data input for the training job. TrainingInput is used to specify the source and type of the training data.
  • input_train: The S3 location of the uploaded training data.
  • content_type='text/csv': Specifies that the content is in CSV format.

estimator.fit(input_data)

  • Finally, this line triggers the training of the model. It specifies the input data and kicks off the model training, using all the configurations and data specified.

This step will output a trained model artifact, which will be saved in the specified S3 bucket. This trained model can then be deployed to make predictions.

Step 10: Deploy the Trained Model

predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Purpose:

This section deploys the trained machine learning model to a real-time endpoint, allowing you to make predictions against the model.

Description:

predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

This line deploys the trained model to a SageMaker endpoint.

  • initial_instance_count=1: Specifies that the endpoint should be backed by a single instance. This is the minimum number of instances you want your model to be deployed to. You can increase this number for high-availability scenarios.
  • instance_type='ml.m4.xlarge': Specifies the type of machine that you want to deploy your model to. This is typically the same or similar to the instance used for training, although it doesn't have to be.

The deploy method returns a Predictor object (predictor in this case), which you can use to invoke the endpoint and make real-time predictions. The SageMaker endpoint is a managed, fully scalable RESTful API that you can call to get predictions from your model.

Once this line is executed, behind the scenes, SageMaker will do several things:

  • Provision the specified EC2 instance(s).
  • Deploy the Docker container onto the instance(s).
  • Deserialize the saved model artifacts from the S3 bucket.
  • Load the model into memory on the instance(s).
  • Finally, expose an HTTPS endpoint for making model predictions.

You can then call this endpoint to make real-time predictions by sending a POST request that contains input data. The trained model at the endpoint will process the data and return predictions.

Step 11: Make Predictions Using the Deployed Model

from sagemaker.serializers import CSVSerializer

predictor.serializer = CSVSerializer()
result = predictor.predict(X_test[0:2].tolist())
print("Predictions: ", result)

Purpose:

The purpose of this code block is to demonstrate how to use the deployed model to make real-time predictions. Here, you make predictions on a small sample of the test dataset (the first two records) and print the results.

Description:

from sagemaker.serializers import CSVSerializer

  • Import the CSVSerializer class from SageMaker's serializers module. This will be used to serialize Python lists into the CSV string format before making HTTP POST requests to the endpoint.

predictor.serializer = CSVSerializer()

  • Assign an instance of CSVSerializer to the serializer attribute of the predictor object. This means that any Python list passed to predictor.predict will automatically be converted (serialized) into a CSV-formatted string before being sent to the SageMaker endpoint.

result = predictor.predict(X_test[0:2].tolist())

  • The predict method makes a real-time inference request to the SageMaker endpoint.
  • X_test[0:2].tolist() slices the first two records from the test data and converts them into a list. These two records are what we want to get predictions for.

print("Predictions: ", result)

  • Print the prediction results. The predict method returns the model's prediction as a bytestring, which is then printed to the console. In the context of the Iris dataset, these predictions will indicate the type of Iris flower that each record most likely represents.

After running this block, you will see the model's predictions for the first two test samples. You can then compare these predictions to the actual labels to see how well your model is performing.

Conclusion

Congratulations! You have successfully navigated through the entire machine learning workflow, from data preparation and local model training to deploying your model on Amazon SageMaker. By following this tutorial, you've gained a solid understanding of several crucial machine learning and cloud computing concepts.

This guide aimed to provide you with an end-to-end understanding, making you well-equipped to scale this process for more complex datasets and algorithms. You learned how to use scikit-learn for initial data processing and model training and then shifted to Amazon SageMaker for more scalable training and deployment. You also got a glimpse of how to evaluate your model by making real-time predictions.

Remember, the journey doesn't stop here. There are many more algorithms to try, hyperparameters to tune, and techniques to learn to improve your model's performance further. Happy machine learning!