Worker Posture Analysis
Evaluating computer vision-based markerless motion capture systems for real-world worker posture analysis and ergonomic assessment.
Evaluation of Computer Vision-Based Markerless Motion Capture for Worker Posture Assessment
Wang J., Guzowski T., Barnett A., Fethke N., Baek S.

Multi-camera pose estimation comparison
Overview
Can markerless motion capture replace traditional marker-based systems for workplace ergonomic assessment?
Challenges
- 1Limited accuracy of pose estimation in real-world conditions
- 2Lack of ground truth data for industrial settings
- 3Varying camera configurations and viewpoints
- 4Occlusions and complex movements in workplace environments
Methodology
The study compared three camera configurations against a gold-standard 10-camera optical motion capture system in a controlled reaching task with 46 participants.
Participants
Twenty-two males (age: 34.8 ± 15.5, height: 1.79 ± 0.08 m) and 24 females (age: 31.5 ± 10.9, height: 1.64 ± 0.07 m) were recruited from the University of Iowa community. All were at least 18 years of age with no recent upper extremity injury or pain.
Experimental Procedure
Participants performed a repetitive reaching task with the dominant arm, placing a pin into each of 16 holes spaced at 22.5° intervals around a ring. Posture variation was defined as "low" (47 cm ring at 50% reach, waist height) or "high" (85–100 cm ring at 75% reach, knee height). Two trials per condition were performed in randomized sequences.
Camera Systems
Three configurations were evaluated: (1) monocular — MediaPipe (v0.10.10) applied to a Zed 2i camera for 3D measurement; (2) stereo — Stereolabs' proprietary 3D body tracking (SDK v3.8.2) on the Zed 2i; (3) monocular+depth — MediaPipe on Intel RealSense D455 for 2D coordinates with depth. A 10-camera OptiTrack Flex 13 system at 120 Hz provided reference data. Cameras were positioned at 45° angles on dominant and non-dominant sides at 2.4 m distance.
Data Processing
Joint angles were calculated using the cosine rule from key points generated by each system for trunk, shoulder, elbow, and knee. Motion capture data were low-pass filtered (4th-order Butterworth, 6 Hz cutoff) and down-sampled to 30 Hz. Synchronization was achieved by manual labeling of start/end frames refined by minimizing RMSE.
Approach
- 1Compared multiple state-of-the-art pose estimation algorithms
- 2Evaluated single-camera vs multi-camera configurations
- 3Tested in both controlled lab and real workplace settings
- 4Developed metrics for biomechanical accuracy assessment
Results & Demos

Real-world workplace analysis

Posture tracking results
Findings
Analysis of 3,832 observations revealed significant effects of camera configuration, posture variation, and viewpoint on joint angle measurement accuracy.
Overall Error
Across all observations, the median RMSE was 15.07° (IQR: 9.63°–26.49°). Each main effect (posture variation, camera configuration, view) and two-way interactions were statistically significant (p < 0.01 for all effects).
RMSE by Camera Configuration
The monocular configuration produced the lowest median RMSE, while the monocular+depth configuration showed the highest error. Stereo performed comparably to monocular overall.
| Configuration | Median RMSE | IQR |
|---|---|---|
| Monocular (MediaPipe) | 12.96° | 8.72°–19.70° |
| Stereo (Zed 2i) | 13.33° | 8.80°–26.17° |
| Monocular+Depth (RealSense) | 21.07° | 13.11°–33.48° |
Posture Variation Effects
Higher posture variation increased median RMSE from 11.83° (low) to 17.54° (high) across all configurations. The difference was smallest for monocular (~2°), followed by stereo (~7°) and monocular+depth (~12°). Mean-adjusted analysis revealed a three-fold increase in effect size, indicating camera-specific baselines had masked the true impact of movement complexity.
Viewpoint Effects
The dominant-side camera view produced a median RMSE of 13.27° versus 16.67° for the non-dominant view. The non-dominant view increased RMSE by ~4° for monocular and ~9° for monocular+depth, but less than 0.2° for stereo.
Joint-Specific Error
Joint type was the largest source of error variability. After mean adjustment, the hierarchy was: elbow (24.82°) > trunk (9.51°) > shoulder (7.62°) > knee (6.45°). Camera configuration showed a modest decrease in effect size with mean adjustment, suggesting ~11% of apparent performance differences were attributable to systematic biases.
Key Outcomes
- Identified optimal camera configurations for workplace deployment
- Quantified accuracy limitations of current pose estimation methods
- Established guidelines for practical implementation
- Published findings in peer-reviewed venues
Discussion
The results provide practical guidance for deploying computer vision-based posture assessment in occupational settings.
Monocular as Default
Despite the intuitive expectation that stereo or depth-augmented systems would outperform monocular approaches, the monocular MediaPipe configuration produced the lowest overall error. This suggests that mature 2D-to-3D lifting algorithms can compensate for the absence of direct depth measurement, at least for the joint angles and movement patterns evaluated in this study.
Camera Placement Matters
The dominant-side oblique view consistently produced lower errors than the non-dominant view across configurations. This finding has direct implications for workplace deployment: positioning the camera on the active side of the worker at a 45° angle provides the best joint angle estimates, particularly for monocular systems where viewpoint sensitivity is highest.
Practical Deployment Considerations
Off-the-shelf computer vision systems can support occupational posture assessment when deployed with appropriate expectations. The median RMSE of 15° is within acceptable ranges for many ergonomic screening applications, though joint-specific limitations (particularly at the elbow) should be considered. Movement complexity significantly affects accuracy, suggesting that dynamic, high-variation tasks may require more robust tracking solutions or multi-camera setups.