
To address current limitations, including the re-
liance on predefined voice commands and the lack
of environmental awareness, future work will focus
on integrating semantic mapping frameworks (Rollo
et al., 2023a) to enable contextual understanding
and support advanced loco-manipulation skills (Rollo
et al., 2024a). In addition, incorporating natural lan-
guage understanding techniques is expected to en-
hance communication flexibility and user intuitive-
ness. These developments aim to evolve the frame-
work into a more scalable, autonomous, and compre-
hensive HRI solution for multi-robot collaboration in
complex and dynamic environments.
REFERENCES
Amadio, F., Donoso, C., Totsila, D., Lorenzo, R., Rouxel,
Q., Rochel, O., Hoffman, E. M., Mouret, J.-B., and
Ivaldi, S. (2024). From vocal instructions to house-
hold tasks: The inria tiago++ in the eurobin service
robots coopetition. arXiv preprint arXiv:2412.17861.
Belpaeme, T., Vogt, P., Van den Berghe, R., Bergmann,
K., G
¨
oksun, T., De Haas, M., Kanero, J., Kennedy,
J., K
¨
untay, A. C., Oudgenoeg-Paz, O., et al. (2018).
Guidelines for designing social robots as second lan-
guage tutors. International Journal of Social Robotics,
10:325–341.
Budiharto, W., Cahyani, A. D., Rumondor, P. C., and
Suhartono, D. (2017). Edurobot: intelligent humanoid
robot with natural interaction for education and enter-
tainment. Procedia computer science, 116:564–570.
Carr, C., Wang, P., and Wang, S. (2023). A human-friendly
verbal communication platform for multi-robot sys-
tems: Design and principles. In UK Workshop on
Computational Intelligence, pages 580–594. Springer.
Dahiya, A., Aroyo, A. M., Dautenhahn, K., and Smith, S. L.
(2023). A survey of multi-agent human–robot inter-
action systems. Robotics and Autonomous Systems,
161:104335.
Del Bianco, E., Torielli, D., Rollo, F., Gasperini, D., Lau-
renzi, A., Baccelliere, L., Muratore, L., Roveri, M.,
and Tsagarakis, N. G. (2024). A high-force gripper
with embedded multimodal sensing for powerful and
perception driven grasping. In 2024 IEEE-RAS 23rd
International Conference on Humanoid Robots (Hu-
manoids), pages 149–156. IEEE.
Heppner, G., Oberacker, D., Roennau, A., and Dillmann, R.
(2024). Behavior tree capabilities for dynamic multi-
robot task allocation with heterogeneous robot teams.
arXiv preprint arXiv:2402.02833.
Kumatani, K., Arakawa, T., Yamamoto, K., McDonough, J.,
Raj, B., Singh, R., and Tashev, I. (2012). Microphone
array processing for distant speech recognition: To-
wards real-world deployment. In Proceedings of The
2012 Asia Pacific Signal and Information Processing
Association Annual Summit and Conference, pages 1–
10. IEEE.
Marin Vargas, A., Cominelli, L., Dell’Orletta, F., and
Scilingo, E. P. (2021). Verbal communication in
robotics: A study on salient terms, research fields and
trends in the last decades based on a computational
linguistic analysis. Frontiers in Computer Science,
2:591164.
Muratore, L., Laurenzi, A., De Luca, A., Bertoni, L.,
Torielli, D., Baccelliere, L., Del Bianco, E., and
Tsagarakis, N. G. (2023). A unified multimodal in-
terface for the relax high-payload collaborative robot.
Sensors, 23(18):7735.
Padmanabha, A., Yuan, J., Gupta, J., Karachiwalla, Z.,
Majidi, C., Admoni, H., and Erickson, Z. (2024).
Voicepilot: Harnessing llms as speech interfaces for
physically assistive robots. In Proceedings of the 37th
Annual ACM Symposium on User Interface Software
and Technology, pages 1–18.
Papavasileiou, A., Nikoladakis, S., Basamakis, F. P., Aivali-
otis, S., Michalos, G., and Makris, S. (2024). A voice-
enabled ros2 framework for human–robot collabora-
tive inspection. Applied Sciences, 14(10):4138.
Pyo, S., Lee, J., Bae, K., Sim, S., and Kim, J. (2021).
Recent progress in flexible tactile sensors for human-
interactive systems: from sensors to advanced appli-
cations. Advanced Materials, 33(47):2005902.
Rizk, Y., Awad, M., and Tunstel, E. W. (2019). Cooperative
heterogeneous multi-robot systems: A survey. ACM
Computing Surveys (CSUR), 52(2):1–31.
Rogowski, A. (2022). Scenario-based programming of
voice-controlled medical robotic systems. Sensors,
22(23):9520.
Rollo, F., Raiola, G., Tsagarakis, N., Roveri, M., Hoffman,
E. M., and Ajoudani, A. (2024a). Semantic-based
loco-manipulation for human-robot collaboration in
industrial environments. In European Robotics Forum
2024, pages 55–59. Springer Nature Switzerland.
Rollo, F., Raiola, G., Zunino, A., Tsagarakis, N., and
Ajoudani, A. (2023a). Artifacts mapping: Multi-
modal semantic mapping for object detection and 3d
localization. In 2023 European Conference on Mobile
Robots (ECMR), pages 1–8. IEEE.
Rollo, F., Zunino, A., Raiola, G., Amadio, F., Ajoudani,
A., and Tsagarakis, N. (2023b). Followme: a ro-
bust person following framework based on visual re-
identification and gestures. In 2023 IEEE Interna-
tional Conference on Advanced Robotics and Its So-
cial Impacts (ARSO), pages 84–89. IEEE.
Rollo, F., Zunino, A., Tsagarakis, N., Hoffman, E. M., and
Ajoudani, A. (2024b). Continuous adaptation in per-
son re-identification for robotic assistance. In 2024
IEEE International Conference on Robotics and Au-
tomation (ICRA), pages 425–431. IEEE.
Stone, P. and Veloso, M. (2000). Multiagent systems: A sur-
vey from a machine learning perspective. Autonomous
Robots, 8:345–383.
Su, H., Qi, W., Chen, J., Yang, C., Sandoval, J., and Laribi,
M. A. (2023). Recent advancements in multimodal
human–robot interaction. Frontiers in Neurorobotics,
17:1084000.
A Scalable Robot-Agnostic Voice Control Framework for Multi-Robot Systems
201