Human pose estimation refers to a technology – fairly new, yet evolving quickly – that’s playing a significant part in fitness and dance applications, allowing us to place digital content over the real world.
In short, the concept of human pose estimation is a computer vision-based technology able to detect and process human posture. The most important and central part of this technology is human body modeling. Three body models are most prominent within current human pose estimation systems – skeleton-based, contour-based, and volume-based.
This model is made up of a set of joints (keypoints), such as knees, ankles, wrists, elbows, shoulders, and the orientation of the body’s limbs. This model is notable for its flexibility, and as such is suitable for both 3-dimensional and 2-dimensional human pose estimation. With 3-dimensional modeling, the solution uses an RGB image and finds the joints’ X, Y, and Z coordinates. With 2-dimensional modeling, it’s the same analysis of an RGB image, but using the X and Y coordinates.
This model makes use of the contours of the torso and limbs of the body, as well as their rough width. Here, the solution takes the body frame’s silhouette and renders body parts as rectangles and boundaries within that framework.
This model generally uses a series of 3-dimensional scans to capture the shape of the body and converts it into a framework of shapes and geometric meshes. These shapes create a 3D series of poses and body representations.
How 3D Human Pose Estimation Works
Fitness applications tend to rely on 3-dimensional human pose estimation. For these apps, the more information on the human pose, the better. With this technique, the user of the app will record themselves participating in an exercise or workout routine. The app will then analyze the user’s body movements, offering corrections for mistakes or inaccuracies.
This type of app’s flowchart typically follows this pattern:
- First, gather data on the user’s movements while they perform the exercise.
- Next, determine how correct or incorrect the user’s movements were.
- Finally, show the user via the interface what mistakes they may have made.
Right now, the standard in human pose technology is COCO topology. COCO topology is made up of 17 landmarks across the body, ranging from the face to the arms to the legs. Note that COCO is not the only human body pose framework, merely the one most commonly used.
This type of process typically makes use of deep machine learning technology for the extraction of joints in estimating the user’s pose. It then employs geometry-based algorithms to make sense of what it’s found (analyze relative positions of the detected joints). While using a dynamic video as its source data, the system can use a series of frames, not just a single image, to capture its keypoints. The result is a far more accurate rendering of the user’s real movements since the system can use information from the adjacent frames to resolve any uncertainties regarding the position of the human body in the current frame.
Out of the current techniques for using 3D pose estimation in fitness applications, the most accurate approach is to first apply a model to detect 2D keypoints and subsequently process the 2D detection with another model to convert them into 3D keypoint predictions.
In the research we posted recently, a single video source was used, with convolutional neural networks with dilated temporal convolutions applied to perform the 2D -> 3D keypoint conversion.
After analyzing the models currently out there, we determined that VideoPose3D is the solution best tailored to the needs of most AI-driven fitness applications. The input using this system should allow for a 2D set of keypoints to be detected, where a model, pre-trained on COCO 2017 dataset, is applied as a 2D detector.
For the most precise prediction of the position of a current joint or keypoint, VideoPose3D can use multiple frames over a short sequence of time to generate 2D pose information.
To further boost the accuracy of 3D pose estimation, more than one camera can gather alternate viewpoints of the user performing the same exercise or routine. Note, however, that it requires greater processing power as well as specialized model architecture to deal with multiple video stream inputs.
Recently, Google unveiled their BlazePose system, a mobile device-oriented model for estimating human pose by increasing the number of keypoints analyzed to 33, a superset of the COCO keypoint set and two other topologies – BlazePalm and BlazeFace. As a result, the BlazePose model can produce pose prediction results consistent with hand models and face models by articulating body semantics.
Each component within a machine-learning-based human pose estimation system needs to be fast, taking a maximum of a couple of milliseconds per frame for pose detection and tracking models.
Due to the fact that BlazePose pipeline (which includes pose estimation and tracking components) has to operate on a variety of mobile devices in real-time, each individual part of the pipeline is designed to be very computationally efficient and run at 200-1000 FPS.
Pose estimation and tracking in the video where it is not known whether and where the person is present is typically done in two stages.
At the first stage, an object detection model is run to locate the presence of a human or to identify their absence. After the person has been detected, the pose estimation module can process the localized area containing the person and predict the position of the keypoints.
A downside of this setup is that it requires both object detection and pose estimation modules to run for every frame which consumes extra computational resources. The authors of the BlazePose, however, devised a clever way of getting around this issue and efficiently utilize it in other keypoint detection modules such as FaceMesh and MediaPipe Hand.
The idea is that an object detection module (face detector in the case of BlazePose) can be used only to kickstart the pose tracking in the first frame while the subsequent tracking of the person can be done using exclusively the pose predictions after some pose alignment, parameters for which are predicted using the pose estimation model.
The face produces the strongest signal as to the torso’s position for the neural network, as a result of the relatively small variance in appearance and high contrast in its features. Consequently, it is possible to create a quick, low-overhead system for pose detection through a series of justifiable assumptions grounded in the idea that the human head will be locatable in every personal use case.
Overcoming Challenges of Human Pose Estimation
Making use of pose estimation in fitness apps faces the challenge of the sheer volume of range of human poses, for instance, the hundreds of asanas in most yoga regimens.
Further, the body will sometimes block certain limbs as captured by any given camera, users may wear varied outfits obscuring body features and personal looks.
While making use of any pre-trained models, note that unusual body movements or strange camera angles can lead to errors in human pose estimation. We can mitigate this problem to a certain extent by using synthetic data from a 3D human body model render, or by fine-tuning with data specific to the domain in question.
The good news is that we can avoid or mitigate the majority of weaknesses. The key to doing so is picking out the right training data and model architecture. Further, the tendency of development in the field of human pose estimation technology suggests that some of the issues we face now will be less relevant in the coming years.
The final word
Human pose estimation holds a variety of potential future uses outside the area of fitness apps and tracking human movements, from gaming to animation to Augmented Reality to robotics. That doesn’t represent a full list of the possibilities but does highlight some of the most likely areas where human pose estimation will contribute to our digital landscape.