Detecting playful animal behavior in videos using Amazon Rekognition Custom Labels

Historically, humans have observed animal behaviors and applied them for different purposes. For example, behavioral observation is important in animal ecology, such as how often the behaviors are, when the behaviors occur, or whether there is individual difference or not. However, identifying and monitoring these behaviors and movements can be hard and can take a long time. To provide an automation for this workflow, a team from the agile members of pharmaceutical customer (Sumitomo Dainippon Pharma Co., Ltd.) and AWS Solutions Architects created a solution with Amazon Rekognition Custom Labels. Amazon Rekognition Custom Labels makes it easy to label specific movements in images, and train and build a model that detects these movements.

In this post, we show you how machine learning (ML) can help automate this workflow in a fun and simple way. We trained a custom model that detects playful behaviors of cats in a video using Amazon Rekognition Custom Labels. We hope to contribute to the afore-mentioned fields, biology and others by publicizing the architecture, our building process, and the source code for this solution.

About Amazon Rekognition Custom Labels

Amazon Rekognition Custom Labels is an automated ML feature that enables you to quickly train your own custom models for detecting business-specific objects and scenes from images—no ML experience required. For example, you can train a custom model to find your company logos in social media posts, identify your products on store shelves, or classify unique machine parts in an assembly line.

Amazon Rekognition Custom Labels builds off the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images or less) that are specific to your use case. If your images are already labeled, Amazon Rekognition Custom Labels can begin training in just a few clicks. If not, you can label them directly within the Amazon Rekognition Custom Labels labeling interface, or use Amazon SageMaker Ground Truth to label them for you.

After Amazon Rekognition begins training from your image set, it can produce a custom image analysis model for you in just a few hours. Amazon Rekognition Custom Labels automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Amazon Rekognition Custom Labels API and integrate it into your applications.

Solution overview

The following diagram shows the architecture of the solution. When you have model in place, the whole process of detecting specific behaviors in a video is automated; all you need to do is upload a video file (.mp4).

The workflow contains the following steps:

  1. You upload a video file (.mp4) to Amazon Simple Storage Service (Amazon S3), which invokes AWS Lambda, which in turn calls an Amazon Rekognition Custom Labels inference endpoint and Amazon Simple Queue Service (Amazon SQS). It takes about 10 minutes to launch the inference endpoint, so we use a deferred run of Amazon SQS.
  2. Amazon SQS invokes a Lambda function to do a status check of the inference endpoint, and launches Amazon Elastic Compute Cloud (Amazon EC2) if the status is Running.
  3. Amazon CloudWatch Events detects the Running status of Amazon EC2 and invokes a Lambda function, which runs a script on Amazon EC2 using the AWS Systems Manager Run
  4. On Amazon EC2, the script calls the inference endpoint of Amazon Rekognition Custom Labels to detect specific behaviors in the video uploaded to Amazon S3 and writes the inferred results to the video on Amazon S3.
  5. When the inferred result file is uploaded to Amazon S3, a Lambda function launches to stop Amazon EC2 and the Amazon Rekognition Custom Labels inference endpoint.


For this walkthrough, you should have the following prerequisites:

  • An AWS account – You can create a new account if you don’t have one yet.
  • A key pair – You need a key pair to log in to the EC2 instance that uses Amazon Rekognition Custom Labels to detect specific behaviors. You can either use an existing key pair or create a new key pair. For more information, see Amazon EC2 key pairs and Linux instances.
  • A video for inference – This solution uses a video (.mp4 format) for inference. You can use your own video or the one we provide in this post.

Launching your AWS CloudFormation stack

Launch the provided AWS CloudFormation

After you launch the template, you’re prompted to enter the following parameters:

  • KeyPair – The name of the key pair used to connect to the EC2 instance
  • ModelName – The model name used for Amazon Rekognition Custom Labels
  • ProjectARN – The project ARN used for Amazon Rekognition Custom Labels
  • ProjectVersionARN – The model version name used for Amazon Rekognition Custom Labels
  • YourCIDR – The CIDR including your public IP address

For this post, we use the following video to detect whether a cat is punching or not. For our object detection model, we prepared an annotated dataset and trained it in advance, as shown in the following section.

This solution uses the US East (N. Virginia) Region, so make sure to work in that Region when following along with this post.

Adding annotations to images from the video

To annotate your images, complete the following steps:

  1. To create images that the model uses for learning, you need to split the video into a series of still images. For this post, we prepared 377 images (the ratio of normal videos to punching videos is about 2:1) and annotated them.
  2. Store the series of still images in Amazon S3 and annotate them. You can use Ground Truth to annotate them.
  3. Because we’re creating an object detection model, select Bounding box for the Task type.
  4. For our use case, we want to tell if a cat is punching or not in the video, so we create a labeling job using two labels: normal to define basic sitting behavior, and punch to define playful behavior.
  5. For annotation, you should surround the cat with the normal label bounding box when the cat isn’t punching, and surround the cat with the punch label bounding box when the cat is punching.

When the cat is punching, the image of the cat’s paws should look blurred, so based on how blurred the image is, you can determine whether the cat is punching or not and annotate the image.

Training a custom ML model

To start training your model, complete the following steps:

  1. Create an object detection model using Amazon Rekognition Custom Labels. For instructions, see Getting Started with Amazon Rekognition Custom Labels.
  2. When you create a dataset, choose Import images labeled by SageMaker Ground Truth for Image location
  3. Set the output.manifest file path that was output by the Ground Truth labeling job.

To find the path out the output.manifest file, on the Amazon SageMaker console, on the Labeling jobs page, choose your video. The information is located on the Labeling job summary page.

  1. When the model has finished learning, save the ARN listed in the Use your model section at the bottom of the model details page. We use this ARN later on.

For reference, the F1 score for normal and punch was above 0.9 in our use case.

Uploading a video for inference on Amazon S3

You can now upload your video for inference.

  1. On the Amazon S3 console, navigate to the bucket you created with the CloudFormation stack (it should include rekognition in the name).
  2. Choose Create folder.
  3. Create the folder inputMovie.
  4. Upload the file you want to infer.

Setting up a script on Amazon EC2

This solution calls the Amazon Rekognition API to infer the video on Amazon EC2, so you need to set up a script on Amazon EC2.

  1. Log in to Amazon EC2 via SSH with the following code and the key pair you created:
ssh -i <Your key Pair> ubuntu@<EC2 IPv4 Public IP>
Are you sure you want to continue connecting (yes/no)? yes
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-1065-aws x86_64)
ubuntu@ip-10-0-0-207:~$ cd code/
ubuntu@ip-10-0-0-207:~/code$ vi

It takes approximately 30 minutes to install and build the necessary libraries.

  1. Copy the following code to and replace <BucketName> with your S3 bucket name created by AWS CloudFormation. This code uses OpenCV to split the video into frames and throws each frame to the inference endpoint of Amazon Rekognition Custom Labels to perform behavior detection. It merges the inferred behavior detection result with each frame and puts the frames together to reconstruct a video.
import boto3
import cv2
import json
from decimal import *
import os
import ffmpeg

def get_parameters(param_key):
    ssm = boto3.client('ssm', region_name='us-east-1')
    response = ssm.get_parameters(
    return response['Parameters'][0]['Value']

def analyzeVideo():
    ssm = boto3.client('ssm',region_name='us-east-1')
    s3 = boto3.resource('s3')
    rekognition = boto3.client('rekognition','us-east-1')
    parameter_value = get_parameters('/Movie/<BucketName>')
    dirname, video = os.path.split(parameter_value)
    bucket = s3.Bucket('<BucketName>')
    bucket.download_file(parameter_value, video)

    customLabels = []
    cap = cv2.VideoCapture(video)
    frameRate = cap.get(cv2.CAP_PROP_FPS)
    width = cap.get(cv2.CAP_PROP_FRAME_WIDTH)
    height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
    fourcc = cv2.VideoWriter_fourcc(*'XVID')
    writer = cv2.VideoWriter( video + '-output.avi', fourcc, 18, (int(width), int(height)))

        frameId = cap.get(cv2.CAP_PROP_POS_FRAMES)
        print("Processing frame id: {}".format(frameId))
        ret, frame =
        if (ret != True):
        hasFrame, imageBytes = cv2.imencode(".jpg", frame)

            response = rekognition.detect_custom_labels(
                    'Bytes': imageBytes.tobytes(),
                ProjectVersionArn = get_parameters('ProjectVersionArn')

            for output in response["CustomLabels"]:
                Name = output['Name']
                Confidence = str(output['Confidence'])
                w = output['Geometry']['BoundingBox']['Width']
                h = output['Geometry']['BoundingBox']['Height']
                left = output['Geometry']['BoundingBox']['Left']
                top = output['Geometry']['BoundingBox']['Top']
                w = int(w * width)
                h = int(h * height)
                left = int(left*width)
                top = int(top*height)

                output["Timestamp"] = (frameId/frameRate)*1000
                if Name == 'Moving':
                    cv2.putText(frame,Name + ":" +Confidence +"%",(left,top),cv2.FONT_HERSHEY_SIMPLEX,0.5,(0, 0, 255), 1, cv2.LINE_AA)
                    cv2.putText(frame,Name + ":" +Confidence +"%",(left,top),cv2.FONT_HERSHEY_SIMPLEX,0.5,(0, 255, 0), 1, cv2.LINE_AA)


    with open(video + ".json", "w") as f:
    bucket.upload_file(video + ".json",'output-json/ec2-output.json')
    stream = ffmpeg.input(video + '-output.avi')
    stream = ffmpeg.output(stream, video + '-output.mp4', pix_fmt='yuv420p', vcodec='libx264')
    stream = ffmpeg.overwrite_output(stream)
    bucket.upload_file( video + '-output.mp4','output/' +video + '-output.mp4')



Stopping the EC2 instance

Stop the EC2 instance after you create the script in it. The EC2 instance is automatically launched when a video file is uploaded to Amazon S3.

The solution is now ready for use.

Detecting movement in the video

To implement your solution, upload a video file (.mp4) to the inputMovie folder you created. This launches the endpoint for Amazon Rekognition Custom Labels.

When the status of the endpoint changes to Running, Amazon EC2 launches and performs behavior detection. A video containing behavior detection data is uploaded to the output folder in Amazon S3.

When you log in to Amazon EC2, you can see that a video file that merged the inferred results was created under the code folder.

The video file is stored in the output folder created in Amazon S3. This causes the endpoint for Amazon Rekognition Custom Labels and Amazon EC2 to stop.

The following video is the result of detecting a specific movement (punch) of the cat:

Cleaning Up

To avoid incurring future charges, delete the resources you created.

Conclusion and next steps

This solution automates detecting specific actions in a video. In this post, we created a model to detect specific cat behaviors using Amazon Rekognition Custom Labels, but you can also use custom labels to identify cell images (such data is abundant in the research field). For example, the following screenshot shows the inferred results of a model that learned leukocytes, erythrocytes, and platelets. We had the model learn from 20 datasets, and it can now detect cells with distinctive features that are identifiable with human eyes. Its accuracy can increase as more high-resolution data is added and as annotations are done more carefully.

Amazon Rekognition Custom Labels has a wide range of use cases in the research field. If you want to try this in your organization and have any questions, please reach out to us or your Solutions Architects team and they will be excited to assist you.

About the Authors

Hidenori Koizumi is a Solutions Architect in Japan’s Healthcare and Life Sciences team. He is good at developing solutions in the research field based on his scientific background (biology, chemistry, and more). His specialty is machine learning, and he has recently been developing applications using React and TypeScript. His hobbies are traveling and photography.





Mari Ohbuchi is a Machine Learning Solutions Architect at Amazon Web Services Japan. She worked on developing image processing algorithms for about 10 years at a manufacturing company before joining AWS. In her current role, she supports the implementation of machine learning solutions and creating prototypes for manufacturing and ISV/SaaS customers. She is a cat lover and has published blog posts, hands-on content, and other content that involves both AWS AI/ML services and cats.


Read More