Scenario: Detecting objects in videos

The goal of this scenario is to create a deep learning model to monitor traffic on a busy road.

This scenario uses a video that displays the traffic during the day. This video is used to determine how many cars are on the road and to determine what the peak traffic times are.

The video file that is used in this scenario is available for download here: Download video file.

Follow these steps to create a deep learning model:

  1. Import a video.
  2. Label objects in a video.
  3. Train a model.
  4. Deploy a trained model.
  5. Automatically label frames in a video.

Step 1: Import a video and create a data set

Create a data set and add videos to it.

  1. Log in to Maximo® Visual Inspection.
  2. Click Data Sets in the side bar to open the Data Sets page. You can choose from several ways to create a new data set. For this example, create a new, empty data set.
  3. From the Data set page, click the icon and name the data set Traffic Video.
  4. To add a video to the data set, click the Traffic Video data set and click Import file or drag the video to the + area.
Important: Do not leave the Maximo Visual Inspection page, close the tab or window, or refresh until the upload completes. You can go to different pages within Maximo Visual Inspection during the upload.

Step 2: Label objects in a video

Label objects in the video. For object detection, you must have at minimum five labels for each object. Create "Car" and "Motorcycle" objects and label at least five frames in the video with cars and at least five frames with motorcycles.

  1. Select the video from your data set and select Label Objects.
  2. Capture frames by using one of these methods:
    • Click Auto capture frames and specify a value for Capture Interval (Seconds) that results in at least five frames. Select this option and specify 10 seconds. Depending on the length and size of the video and the interval you specified to capture frames, the process to capture frames can take several minutes.
    • Click Capture frame to manually capture frames. If you use this option, you must capture a minimum of five frames from the video.
  3. If you used Auto capture frames, verify that the video frames contain enough of each object type. If not, follow these steps to add new frames to the existing data set. In this scenario, the motorcycle is only in a single automatically captured frame at 40 seconds. Therefore, you must capture at least four more frames with the motorcycle. The motorcycle comes into view at 36.72 seconds. To correctly capture the motorcycle in motion, create extra frames at 37.79 seconds, 41.53 seconds, and 42.61 seconds. Play the video. When the frame you want is displayed, click pause. Click Capture Frame.
  4. Create new object labels for the data set by clicking Add new by the Objects list. Enter Car and click Add. Then, enter Motorcycle and click OK. If you later want to delete the label, it must be done at the data set level. It cannot be done from an individual frame or image.
  5. Label the objects in the frames:
    • Select the first frame in the carousel.
    • Select the correct object label, for example, "Car".
    • Choose Box or Polygon, depending on the shape you want to draw around each object. Boxes are faster to label and train, but less accurate. Only Detectron or High resolution models support polygons. However, if you use polygons to label your objects, then use this data set to train a model that does not support polygons, bounding boxes are defined and used. Draw the appropriate shape around the object. When Box or Polygon is selected, you hold down the Alt key for non-drawing interactions in the image. These interactions include trying to selecting, moving, or editing previously drawn shapes in the image, and panning the image by using the mouse. To return to the normal mouse interactions, deselect the Box or Polygon button.
    For more information about identifying and drawing objects in video frames, see Guidelines for identifying and drawing objects.

The following figure displays the captured video frame at 41.53 seconds with object labels of Car and Motorcycle. Figure 1 also displays a box around the five frames (four of the frames were added manually) in the carousel that required object labels for the motorcycle that is in each frame.

Figure 1. Labeling objects in Maximo Visual Inspection
The image displays GUI interface for Maximo Visual Inspection. The image displays a screen capture of the video frame with object labels for the cars and motorcycle. After this video frame is an image carousel that has frames from the video with timestamps.

Step 3: Train a model

With all the object labels that are identified in your data set, you can now train your deep learning model. To train a model, complete the following steps:

  1. From the Data set page, click Train.
  2. Complete the fields on the Train Data set page, ensuring that you select Object Detection. Choose Accuracy (faster R-CNN) for Model selection
  3. Click Train.
  4. (Optional - Only supported when training for object detection.) Stop the training process by clicking Stop training > Keep Model > Continue.You can wait for the entire training model process complete. However, you can opt to stop the training process when the lines in the training graph start to flatten out, as shown in Figure 2. You might opt to stop the training process because improvements in quality of training might plateau over time. Therefore, the fastest way to deploy a model and refine the data set is to stop the process before quality stops improving. Use the early stop functionality carefully when training segmented object detection models (such as models that use the Detectron or High resolution model types). Larger iteration counts and training times can improve accuracy even when the graph indicates that the accuracy is plateauing. The precision of the label can continue to improve even when the accuracy of identifying the object location stops improving.
    Figure 2. Model training graph
    The image a loss on the vertical axis and iterations on the horizontal axis. The more iterations that occur the line for loss converge to a flat line.
    If the training graph converges quickly and has 100% accuracy, the data set does not have enough information. The same is true if the accuracy of the training graph fails to rise or the errors in the graph do not decrease at the end of the training process. For example, a model with high accuracy might be able to discover all instances of different race cars. However, the same model might be unable to differentiate between specific race cars or cars that have different colors. In this situation, add more images, video frames, or videos to the data set. Then, label those objects and try the training again.

Step 4: Deploy a trained model

GPU usage depends on the model type:

  • Each High resolution, Structured segment network (SSN), Anomaly optimized, or custom deployed model takes one GPU. The GPU group is listed as '-', which indicates that this model uses a full GPU and does not share the resource with any other deployed models.
  • Multiple Faster R-CNN, GoogLeNet, SSD, YOLO v3, Tiny YOLO v3, and Detectron2 models are deployed to a single GPU. That is, the model is deployed to the GPU that has the most models deployed on it, if sufficient memory is available on the GPU. The GPU group can be used to determine which deployed models share a GPU resource. To free up a GPU, all deployed models in a GPU group must be deleted or undeployed. IBM® Maximo Visual Inspection leaves a variable buffer on the GPU. This depends on the combination of models that are currently deployed.

To deploy the trained model, complete the following steps.

  1. Click Models from the menu.
  2. Select the model that you created in the previous section and click Deploy.
  3. Specify a name for the model, and click Deploy. The Deployed Models page is displayed, and the model is deployed when the status column displays Ready.
  4. Double-click the deployed model to get the API endpoint and test other videos or images against the model.
    Note: For more information about APIs, see REST APIs.
Note: Because High resolution models are compute-intensive, they take much longer than other models to perform video and image inference.

Step 5: Automatically label frames in a video

You can use the auto-label function to automatically identify objects in the frames of a video after you deploy a model.

In this scenario, you have only nine frames. To improve the accuracy for your deep learning model, you can add more frames to the data set. Remember, you can rapidly iterate by stopping the training on a model and checking the results of the model against a test data set. You can also use the model to auto-label more objects in your data set. This process improves the overall accuracy of your final model.

To use the auto-label function, complete the following steps:

Note: Any frames that were previously captured by using auto-capture but were not manually labeled are deleted before auto-labeling. Deleting these frames helps avoid labeling duplicate frames. Manually captured frames are not deleted.
  1. Click Data sets from the menu, and select the data set that you used to create the previously trained model.
  2. Select the video in the data set that had nine frames, and click Label Objects.
  3. Click Auto label.
  4. Specify how often you want to capture frames and automatically label the frames. Select the name of the trained model that you deployed in step 3 of the deployment phase then click Auto label. In this scenario, you previously captured frames every 10 seconds. To improve the accuracy of the deep learning model by capturing and labeling more frames, you can specify 6 seconds.
  5. After the auto-label process completes, the new frames are added to the carousel. Click the new frames and verify that the objects have the correct labels. The object labels that were automatically added are green and the object labels you manually added are in blue. In this scenario, the carousel now has 17 frames.

Next steps

You can continue to refine the data set as much as you want. When you are satisfied with the data set, you can retrain the model by repeating the first three phases (import, label, train).

This time when you retrain the model, you might want to train the model for a longer time to improve the overall accuracy of the model. The goal is for the loss lines in the training model graph to converge to a stable flat line. The lower the line value, the better.

After the training is completed, you can redeploy the model by completing steps 1 - 3 of the deployment phase. You can double-click the deployed model to get the API endpoint and test other videos or images against the model.