
free keypoint position measurements, and grows only
slightly faster than linearly with the level of keypoint
noise. For measurement error of 1 mm, joint angle es-
timation error on the order of 0.21
o
can be expected.
This suggests that the proposed method, although cer-
tainly not as accurate as dedicated angular encoders
on the joints, could be used for configuration estima-
tion for the purposes of controlling the mechanism.
Ultimately, the most important question is how
measurement noise in the 3D positions of keypoints
affects the performance of a PBVS controller using
the proposed OBIK scheme for configuration esti-
mation. To investigate this, a proportional-derivative
(PD) controller was applied to the 2-DOF mechanism
with the objective of moving it from initial configura-
tion (π/4, π/4) to goal configuration (−π/4,−π/4).
The PD controllers for the two joints were indepen-
dent, with proportional gains K
p1
= K
p2
= 1 and
derivative gains K
d1
= 1 and K
d2
= 0.5 for links 1 and
2, respectively. The keypoint positions corresponding
to the goal configuration were used as reference for
the OBIK method, meaning that the PBVS PD con-
troller was essentially trying to bring the estimated
joint configuration to the origin, subject to measure-
ment noise in the keypoint positions at each control
step. Joint angle trajectories for the true angles of
both joints, as recorded by the physics engine, are
shown in Fig. 7 for several levels of measurement
noise. The controller reaches the setpoint reliably and
smoothly even for significant noise, as high as 20 mm.
For noise on the order of 5 mm, which is already more
than what is typical of modern RGB-D cameras, even
in the depth dimension, the joint trajectories are vir-
tually indistinguishable from those corresponding to
when no measurement noise is present.
6 CONCLUSION AND FUTURE
WORK
We introduced a method for learning compact repre-
sentations of the configuration of articulated mecha-
nisms from sequences of keypoint positions tracked
in camera images. The approach relies on analyz-
ing temporal variations in pairwise distances between
keypoints to statistically determine which ones satisfy
the RBA, thereby identifying groups that belong to
the same rigid body. By examining the rank of matri-
ces that capture the translational and rotational com-
ponents of estimated poses over time, the algorithm
infers the kinematic chain and the types of joints in it.
The constructed configuration vector is as compact as
that of the actual joint positions, effectively function-
ing as a joint observer without requiring prior knowl-
edge of the mechanism’s kinematics or appearance.
In future work, we aim to apply this observer to
real-time monitoring and control of robotic systems
and other articulated mechanisms. We also plan to in-
vestigate its robustness to noise and keypoint tracking
errors, as in a real environment, changes in illumina-
tion and color, as well as lack of texture can lead to
false matches and imprecise measurements.
REFERENCES
Chaumette, F., Hutchinson, S., and Corke, P. (2016). Visual
servoing. Handbook of Robotics, pages 841–866.
Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and
Abbeel, P. (2016). Deep spatial autoencoders for vi-
suomotor learning. In 2016 IEEE International Con-
ference on Robotics and Automation (ICRA), pages
512–519. IEEE.
Franklin, G. F., Powell, J. D., Emami-Naeini, A., and Pow-
ell, J. D. (2015). Feedback control of dynamic systems.
Prentice hall Upper Saddle River.
Jonschkowski, R. and Brock, O. (2015). Learning state rep-
resentations with robotic priors. Autonomous Robots,
39:407–428.
Kipf, T., Elsayed, G. F., Mahendran, A., Stone, A., Sabour,
S., Heigold, G., Jonschkowski, R., Dosovitskiy, A.,
and Greff, K. (2021). Conditional object-centric learn-
ing from video. arXiv preprint arXiv:2111.12594.
Lesort, T., D
´
ıaz-Rodr
´
ıguez, N., Goudou, J.-F., and Filliat,
D. (2018). State representation learning for control:
An overview. Neural Networks, 108:379–392.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-
to-end training of deep visuomotor policies. Journal
of Machine Learning Research, 17:1–40.
Mojallizadeh, M. R., Brogliato, B., and Prieur, C. (2023).
Modeling and control of overhead cranes: A tutorial
overview and perspectives. Annual Reviews in Con-
trol, 56:100877.
Murtagh, F. and Contreras, P. (2012). Algorithms for hierar-
chical clustering: an overview. Wiley Interdisciplinary
Reviews: Data Mining, 2(1):86–97.
Nelles, O. (2020). Nonlinear dynamic system identification.
Springer.
Siciliano, B. and Khatib, O. (2016). Springer handbook of
robotics. Springer International Publishing.
Szeliski, R. (2022). Computer vision: algorithms and ap-
plications. Springer Nature.
Todorov, E., Erez, T., and Tassa, Y. (2012). MuJoCo: A
physics engine for model-based control. In Interna-
tional Conference on Intelligent Robots and Systems,
pages 5026–5033. IEEE.
Umeyama, S. (1991). Least-squares estimation of transfor-
mation parameters between two point patterns. IEEE
Transactions on Pattern Analysis & Machine Intelli-
gence, 13(04):376–380.
Wahlstr
¨
om, N., Sch
¨
on, T. B., and Deisenroth, M. P. (2015).
Learning deep dynamical models from image pixels.
IFAC-PapersOnLine, 48(28):1059–1064.
Observation-Based Inverse Kinematics for Visual Servo Control
207