Trajectory Prediction in First-Person Video: Utilizing a Pre-Trained

Bird’s-Eye View Model

Masashi Hatano, Ryo Hachiuma and Hideo Saito

Graduate School of Science and Technology, Keio University, Yokohama, Japan

Keywords:

Trajectory Prediction, Egocentric Video.

Abstract:

In recent years, much attention has been paid to the prediction of pedestrian trajectories, as they are one of the

key factors for a better society, such as automatic driving, a guide for blind people, and social robots inter-

acting with humans. To tackle this task, many methods have been proposed but few are from the ﬁrst-person

perspective because of the lack of a publicly available dataset. Therefore, we propose a method that uses

egocentric vision, which does not need to be trained with a ﬁrst-person video dataset. We made it possible to

utilize existing methods, which predict from a bird’s-eye view. In addition, we propose a novel way to consider

semantic information without changing the shape of the input to apply to all existing bird’s-eye methods that

use only past trajectories. Therefore, there is no need to create a new dataset from egocentric vision. The ex-

perimental results demonstrate that the proposed method makes it possible to predict from an egocentric view

via existing methods of bird’s-eye view. The proposed method qualitatively improves trajectory predictions

without aggravating quantitative accracy, and the effectiveness of predicting the trajectories of multiple people

simultaneously.

1 INTRODUCTION

Although future trajectory prediction is a challenging

task, it is one of the key factors for a better society,

such as automatic driving, a guide for blind people,

and social robots interacting with humans. This task

has been attracted many researchers (Bhattacharyya

et al., 2019; Gupta et al., 2018; Sadeghian et al., 2019;

Deo and Trivedi, 2020; Lee et al., 2017; Liang et al.,

2020; Mangalam et al., 2020; Zhao et al., 2019). See

(Rudenko et al., 2020) for an overview of future tra-

jectory prediction. Various solutions from different

perspectives have been proposed; however, few have

tried to predict the future location of pedestrians from

egocentric vision because the ﬁeld of egocentric fu-

ture prediction is very young and still developing, as

mentioned in (Rodin et al., 2021).

The existing methods for predicting the trajecto-

ries of pedestrians can roughly be divided into two

main categories: a third-person viewpoint or bird’s-

eye view (Alahi et al., 2016; Gupta et al., 2018; Zhang

et al., 2019; Pang et al., 2021), and the other pre-

dicts trajectories from a ﬁrst-person perspective (Yagi

et al., 2018; Qiu et al., 2021; Huynh and Alagh-

band, 2020). While the former method can accurately

use past trajectories in the world coordinate system,

which are the most important information for future

trajectories, as inputs, it cannot use detailed informa-

tion such as a pedestrian’s posture state. In addition,

the former method is, in fact, impractical in terms

of the ubiquitousness and availability of surveillance

cameras, which provide images from the third-person

perspective. The latter method can use detailed in-

formation such as human poses; however, models

predicting trajectories from a ﬁrst-person perspective

must be trained with an egocentric vision dataset. To

the best of our knowledge, there are few publicly

available datasets of trajectory predictions from ego-

centric vision as ﬁrst-person video contains private in-

formation such as pedestrians’ faces. In our work, we

made it possible to utilize existing methods of bird’s-

eye coordinates to apply to the egocentric view, which

does not require the creation of a dataset or training

model.

Moreover, none of the previous work (Liang et al.,

2019; Kosaraju et al., 2019; Sadeghian et al., 2019;

Salzmann et al., 2020) tried to change the shape of

the input to adapt to the model structure. Researchers

changed the structure of the model instead, as it is

easy to extract information and integrate it with the

prediction model. Nonetheless, this method of con-

sidering semantic information requires training with

a dataset. There is a publicly available dataset for

bird’s-eye coordinate models; therefore, the segmen-

624

Hatano, M., Hachiuma, R. and Saito, H.

Trajectory Prediction in First-Person Video: Utilizing a Pre-Trained Bird’s-Eye View Model.

DOI: 10.5220/0011683300003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

624-630

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

tation prediction model can be obtained without a

problem. In contrast, the model from egocentric vi-

sion has a problem in terms of the dataset. We ad-

dressed this issue by proposing a novel method of

considering semantic information that transforms the

boundary information, obtained from the ﬁrst-person

viewpoint image, into the same shape as the inputs

of the existing model, predicting from a bird’s-eye

view. This leads to unnecessary training with the

dataset from a ﬁrst-person perspective, but it enhances

the prediction qualitatively while maintaining quanti-

tative accuracy.

To demonstrate the effectiveness of the proposed

method, three main comparisons were conducted.

The ﬁrst compares prediction accuracy with and with-

out the homographic transformation, the second is a

comparison with and without the terrain information,

and the third comparison is for with and without using

SP (Social Pooling). Regarding the socially accept-

able predictor to make these comparisons, the pre-

trained model of SocialGAN (Gupta et al., 2018) was

used.

In this paper, there are mainly three contributions.

• We made it possible to apply existing methods

that use bird’s-eye coordinates as input to make

forecast them from a ﬁrst-person viewpoint. We

can beneﬁt from multi-person prediction from ex-

isting methods that considers human-interaction.

• We came up with a novel way of taking semantic

information into account, which applies to all so-

cially acceptable trajectory predictors, using only

the past trajectory, resulting in qualitatively im-

proving the trajectory prediction.

• We realized the proposed method without training

and ﬁne-tuning with an egocentric vision dataset.

2 RELATED WORK

Trajectory prediction can be used for a variety of pur-

poses. One is Autonomous driving. Autonomous

driving must predict the trajectories of pedestrians to

avoid colliding with them. Recently researchers (Cai

et al., 2022; Rasouli et al., 2019; Marchetti et al.,

2020; Poibrenski et al., 2020; Makansi et al., 2020)

have been tackling this issue, and most methods pre-

dict from a driving car’s camera as they would like to

apply the methods to intelligent driving. In this do-

main, a signiﬁcant number of approaches have been

introduced, as many publicly available datasets have

been provided. In contrast, in this paper, we fo-

cus on predicting from a head-mounted camera of

a pedestrian because the objective is to apply to as-

sistive technologies and social robots that interact

with human. This task, which we would like to ad-

dress, is much more challenging than the one in au-

tonomous driving because few public datasets are pro-

vided, and head-mounted cameras are unsteady com-

pared to driving cameras.

Turning to human trajectory prediction from the

ﬁrst-person viewpoint, few researchers (Yagi et al.,

2018; Qiu et al., 2021; Huynh and Alaghband, 2020)

have addressed this problem, which aims at socially

acceptable and efﬁcient trajectory navigation. future

person localization (FPL) (Yagi et al., 2018) is the

ﬁrst work to predict the future locations of people (not

the camera-wearer) in egocentric videos from wear-

able cameras. Then, indoor future person localization

(IFPL) (Qiu et al., 2021), which is the latest work in

this task, was introduced to adapt for indoor scenes.

Both methods take detailed information such as hu-

man body pose into account and predict points of fu-

ture locations or the entire bounding box for pedestri-

ans. Training an end-to-end model for trajectory pre-

diction from an egocentric view is difﬁcult due to the

lack of a dataset. Both sets of authors created datasets

and provide them on request, but these datasets are

less diverse in terms of scene diversity and scale than

existing datasets in areas such as automated driving.

However, the proposed method solves these issues by

performing coordinate transformations using homog-

raphy. This transformation enables us to use existing

methods, with only the past trajectory from a bird’s-

eye viewpoint; therefore, there is no need to train a

model with ﬁrst-person videos, which creates privacy

issues.

Many more researchers have been attracted by

predicting from a third-person viewpoint or bird’s-

eye viewpoint. The approach can be divided into

two types: using rich information such as a semantic

map of the image (Liang et al., 2019; Kosaraju et al.,

2019; Sadeghian et al., 2019; Salzmann et al., 2020)

and using only the past trajectory (Gupta et al., 2018;

Alahi et al., 2016; Pang et al., 2021; Amirian et al.,

2019; Choi and Dariush, 2019; Zhang et al., 2019;

Katyal et al., 2020). Most of the previous work uses

LSTM-based (Hochreiter and Schmidhuber, 1997) or

Generative Adversarial Network-based (Goodfellow

et al., 2014) models. Using a third-person or bird’s-

eye viewpoint is helpful as they provide accurate past

trajectories in world coordinates without missing in-

formation caused by the occlusion. Nevertheless, they

are impractical due to the availability and ubiquitous-

ness of bird’s-eye viewpoint videos. In contrast, al-

though the proposed method is based on these third-

person viewpoint predictors, it predicts from an ego-

centric view; therefore it is, in fact, practical. As for

Trajectory Prediction in First-Person Video: Utilizing a Pre-Trained Bird’s-Eye View Model

625

the consideration of semantic information, we devel-

oped the novel idea of considering boundary informa-

tion.

3 METHOD

3.1 Problem Formulation

Let p

∈ R

denote the position of pedestrian i at

time t in a single image frame where there are

n pedestrians in total. The past trajectory of a

pedestrian i is p

{

, t = 1, 2, ..., t

past

}

, and P =

{

, i = 1, 2, ..., n

}

represents the past trajectories of

all pedestrians in a scene. Similar to this represen-

tation, the future trajectory of pedestrian i at time

t is denoted as q

. q



, t = t

past

+ 1, ..., t

pred



and Q =

{

, i = 1, 2, ..., n

}

represent the future trajec-

tory of pedestrian i and all future trajectories, respec-

tively. In addition, ˆq

, t = t

past

+ 1, ..., t

pred

and

Q =

{

ˆq

, i = 1, 2, ..., n

}

denote the estimated fu-

ture trajectory of pedestrian i and all estimated fu-

ture trajectories in a scene with the existing socially

acceptable trajectory prediction model f

, using only

the past trajectories of all pedestrians:

Q = f

(P), (1)

where θ is the parameters of a pre-trained trajectory

prediction model. In this work, θ remains unchanged.

In addition, e

∈ R

denotes the position of a ﬁc-

titious pedestrian (deﬁned in Section 3.3) i at time t,

which is the position of an extracted point depicted in

a blue rectangle in Figure 1. The past trajectory of a

ﬁctitious pedestrian i is e

{

, t = 1, 2, ..., t

past

}

and

E =

{

, i = 1, 2, ..., m

}

represents the past trajectories

of all ﬁctitious pedestrians in a scene where there are

m extracted points in total. As the ﬁctitious pedestri-

ans are assumed to be at a standstill, the well-formed

formula

∀i, ∃c ∈ R

, ∀t, e

= c, (2)

is established, where c is a two-dimensional bird’s-

eye coordinate. Then, the estimated future trajectories

that consider semantic segmentation information are

Q = f

(P, E) . (3)

In the following section, we detail the part of co-

ordinate transformation and boundary extraction via

a semantic segmentation map. As our objective is

to utilize an existing pre-trained model without ad-

ditional training because of the lack of ﬁrst-person

videos for trajectory prediction, the explanation of the

trajectory prediction model is not given.

3.2 Coordinate Transformation

Transforming the coordinates is important in the pro-

posed method because existing methods, which use

bird’s-eye coordinates, cannot be applied to infer

from the ﬁrst-person perspective without this technol-

ogy, as proved in Section 4.2. One of the advantages

of using methods that predict from the world coordi-

nate system is that past trajectories are well consid-

ered.

Coordinate transformation is applied to pedestri-

ans’ foot coordinates and the extracted points using

semantic segmentation, as shown in Figure 1. As

can be seen in Figure 1, for each image frame, ob-

ject detection (Carion et al., 2020; Redmon et al.,

2016) is applied to detect the pedestrians in the im-

age, and then the center of the two bottom corners of

the bounding box is regarded as each detected per-

son’s foot coordinate in the screen coordinate system.

After the screen coordinates are found, Equation 4 is

applied to transform the coordinates from an image

screen system to a world system:

[R|t]

−1

















, (4)

where K is the intrinsic matrix of the camera, R is the

rotation matrix, t is the translation vector, u and v are

two-dimensional coordinates in the image, and X , Y,

and Z are three-dimensional coordinates in the world.

The height, Y, in the world coordinate system is un-

necessary for the input data as we use the trajectory

predictor, forecasting from a bird’s-eye view.

By performing this transformation for each image

frame with the corresponding camera parameters, the

input data for trajectory prediction is prepared. In the

same way, the extracted points, which are detailed in

Section 3.3, are also transformed to bird’s-eye coordi-

nates. This is done every d frame, unlike the detected

pedestrians, as a single image provides boundary in-

formation within 15 m from the camera-wearer.

3.3 Boundary Extraction

Normally, semantic segmentation (Xie et al., 2021;

Yuan et al., 2020) is integrated with the prediction

module when the semantic information should be

considered; however, in the proposed method, seman-

tic information is obtained by converting the inference

results of semantic segmentation to the same input as

the trajectory predictor. This is shown in Figure 2.

This aims to avoid pedestrians’ trajectories from be-

ing in inappropriate areas such as outside the road or

on the wall. To do that, we regarded extracted bound-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

626

Coordinate

Transformation

Center(left bottom, right bottom)

Bird’s-eye view

every frame

SS model

Boundary Extraction

Binarization

Coordinate

Transformation

every d frames

Trajectory

Prediction

Model

Inverse

Coordinate

Transformation

Figure 1: An overview of the proposed method for a single agent. The part surrounded by a red rectangle at the top left of

the image shows how to transform the human past trajectories into bird’s-eye coordinates. The part surrounded by a blue

rectangle at the bottom left of the image shows how the border information is transformed and compressed to the bird’s-eye

coordinates in the proposed method. (SS stands for semantic segmentation).

Figure 2: The concept of the proposed consideration of

boundary information. The boundary information is ob-

tained by regarding each extracted point as a standing ﬁcti-

tious pedestrian.

ary points, which are between inappropriate areas or

not, as standing pedestrians. We call them ﬁctitious

pedestrians because there are no pedestrians on the

boundary in the real world. As the existing trajectory

predictor we used considers the avoidance of collision

between pedestrians, the predicted trajectories will be

in the appropriate area by using these ﬁctitious stand-

ing pedestrians.

As can be seen in the blue rectangle in Figure

1, a ﬁrst-person view image goes through semantic

segmentation neural networks to be segmented into

several classes, including roads and sidewalks. Af-

ter obtaining the semantic segmented image, a mask

image, with which it is determined whether the class

is a walkable area or not, is generated. In our case,

road and sidewalk classes are walkable areas, and the

others are unwalkable areas. Then, many points of

the boundary between prohibited and permitted areas

are extracted from the masked image, and each point

is transformed into the world coordinate system and

given a speciﬁc identiﬁcation number. In this way,

boundary information is transformed into the same

shape as the input of simple trajectory predictors such

as SocialGAN.

As the trajectory prediction method, we used the

same method as SocialGAN: 3.2 s for observation and

the same amount of time for future prediction. A one-

time step is deﬁned as 0.4 s; therefore, there are eight

time steps for observation and prediction, resulting in

the need for 16 time steps to evaluate the predicted

trajectory.

4 EXPERIMENTS

4.1 Overview

To prove the effectiveness of the proposed method, we

compared the prediction accuracy with and without

coordinate transformation, with and without consider-

ation of boundary information, and with and without

consideration of social interaction. The ﬁrst and third

comparisons were made with the metrics of two ma-

jor trajectory prediction accuracy: average displace-

ment error (ADE) and ﬁnal displacement error (FDE),

whereas the second comparison was made with the

number of times that a pedestrian goes in a prohibited

area, as well as the two distance metrics.

For the dataset to perform these experiments, we

collected it ourselves, using an application that pro-

vides image sequence and camera parameters via

Trajectory Prediction in First-Person Video: Utilizing a Pre-Trained Bird’s-Eye View Model

627

Naive SocialGAN

Ground Truth

SocialGAN

Figure 3: Qualitative results of trajectory predictions with

and without coordinate transformation.

ARKit

. The dataset has a variety of scenes in terms

of the number of pedestrians and the type of prohib-

ited area around pedestrians, such as bushes next to

roads or buildings walls. This dataset contains 10 dif-

ferent scenes and around 100 input sequences in total.

4.2 Necessity of Coordinate

Transformation

To accurately transform the coordinates from the

screen system, to the world system, we utilized the in-

trinsic and external matrices provided by ARKit. As

shown in (Seiskari et al., 2022), the performance of

ARKit is comparable to the state of the art; thus, it

enables us to get precise camera parameters. How-

ever, even if ARKit provides precise intrinsic, rota-

tion, and translation matrices, it is difﬁcult to get the

exact world coordinates at points that are far from the

camera-wearer. The transformation becomes worse

and less accurate when the object points are more than

15 m from the camera-wearer. Although ARKit has

this problem, which has a negative effect on the in-

put data for trajectory prediction, the transformation

is almost perfect if we use object points within 15 m

of the camera.

To make this comparison, the difference is two-

dimensional coordinates of input, and there are no

differences in terms of the model or dataset. For the

non-coordinate transformation, the pedestrian’s foot

coordinate in the image frame was used as the input

for the trajectory predictor. The results are summa-

rized in Table 1, and qualitative results are shown in

Figure 3.

As Table 1 indicates, the average and ﬁnal errors

are huge if we do not apply the coordinate transfor-

mation. The trajectory prediction without coordinate

transformation is inconsistent because the sequence

of two-dimensional coordinates also lacks consis-

tency, as can be seen in Figure 3. Each coordinate

in the image depends on where the camera-wearer is

and which direction the camera faces. In this sense,

coordinate transformation, from the screen coordinate

system to the world coordinate system, helps to main-

url:https://developer.apple.com/documentation/arkit/

Ours

Ground Truth

SocialGAN

Figure 4: Qualitative results of trajectory predictions with

and without the addition of ﬁctitious pedestrians that are on

the boundaries between appropriate and inappropriate ar-

eas.

tain consistency between image frames, as the world

coordinates are independent of the translation or rota-

tion matrices of the camera.

4.3 Effectiveness of Distributing

Fictitious Pedestrians

Similar to the previous experiment, the comparison

between with and without using boundary informa-

tion was performed in the same environment except

for the input of the trajectory predictor. The input

with and without the boundary information has the

same input in terms of real pedestrians, but the for-

mer input contains the world coordinates of ﬁctitious

pedestrians as well as those of non-ﬁctitious pedestri-

ans. The results are summarized in Table 1, in which

how many times pedestrians were predicted to pene-

trate prohibited areas (we call the percentage of this

ﬁgure “area error”) is shown in addition to the quan-

titative metrics, and qualitative results are shown in

Figure 4.

As can be seen in Table 2, both quantitative met-

rics, ADE and FDE, are neither improved nor de-

graded; however, the area score is substantially en-

hanced with the addition of the boundary informa-

tion to the input. It helps reduce the number of times

pedestrians are forecast to be in the inappropriate area

by half.

4.4 Effectiveness of Predicting Multiple

Pedestrians Simultaneously

To show the effectiveness of considering multiple

pedestrians at once, we compared between with and

without social pooling in the prediction model. As

can be seen in Table 1, ADE and FDE are improved if

we consider the social interaction in a scene at once.

In addition, as shown in Figure 5, the model that

considers social interaction improves trajectory pre-

diction qualitatively: A collision can be observed if

we do not use social pooling. Social pooling was in-

vented several years ago; however, most of the previ-

ous work of ﬁrst-person perspective (Yagi et al., 2018;

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

628

Table 1: Results for the comparison in terms of three metrics: ADE, FDE, and area error.

Naive Social-

GAN

SocialGAN w/o SP SocialGAN(Gupta

et al., 2018)

Ours

ADE 27.22 10.65 9.07 8.93

FDE 50.08 13.12 9.34 9.33

area error - - 0.417 0.202

Table 2: Comparison of the techniques for each method.

Naive SocialGAN SocialGAN w/o SP SocialGAN Ours

Coordinate transformation ✓ ✓ ✓

Social pooling ✓ ✓

Boundary information ✓

SocialGAN w/o SP

Ground Truth

SocialGAN

Figure 5: Qualitative results for trajectory predictions with

and without social pooling.

Qiu et al., 2021) does not consider social interaction

because they created the dataset on their own. How-

ever, the annotation was done by only one pedestrian

in a scene, even if there are several pedestrians. In

this sense, the proposed method, which utilizes exist-

ing methods, is much less expensive in terms of cost,

as we realized multiple pedestrians prediction without

high annotation costs.

5 CONCLUSIONS

In this work, we present a method that takes egocen-

tric view video as input but applies the existing pro-

cess, predicting from a bird’s-eye coordinate to avoid

being hindered by privacy issues and to utilize the cur-

rent fairly good predictors. This approach improves

the drawbacks of the ﬁrst and third-person perspec-

tives: the huge cost of creating an egocentric vision

dataset, and the lack of ubiquitousness and availabil-

ity of surveillance cameras, respectively. Moreover,

we propose a novel method for considering boundary

information, resulting in reducing the percentage of

predictions that pedestrians will be in an inappropri-

ate area by half. In addition, our method can predict

multiple pedestrians at once from the ﬁrst-person per-

spective, leading to better prediction accuracy and the

avoidance of collisions among pedestrians.

REFERENCES

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-

Fei, L., and Savarese, S. (2016). Social lstm: Human

trajectory prediction in crowded spaces. In 2016 IEEE

Conference on Computer Vision and Pattern Recogni-

tion.

Amirian, J., Hayet, J.-B., and Pettr

e, J. (2019). Social ways:

Learning multi-modal distributions of pedestrian tra-

jectories with gans. In 2019 IEEE/CVF Conference on

Computer Vision and Pattern Recognition Workshops.

Bhattacharyya, A., Hanselmann, M., Fritz, M., Schiele, B.,

and Straehle, C.-N. (2019). Conditional ﬂow vari-

ational autoencoders for structured sequence predic-

tion. arXiv.

Cai, Y., Dai, L., Wang, H., Chen, L., Li, Y., Sotelo, M. A.,

and Li, Z. (2022). Pedestrian motion trajectory pre-

diction in intelligent driving from far shot ﬁrst-person

perspective video. IEEE Transactions on Intelligent

Transportation Systems, 23(6):5298–5313.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. In 2020 European Confer-

ence on Computer Vision.

Choi, C. and Dariush, B. (2019). Learning to infer relations

for future trajectory forecast. In 2019 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops.

Deo, N. and Trivedi, M. M. (2020). Trajectory forecasts

in unknown environments conditioned on grid-based

plans. arXiv.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. Advances

in neural information processing systems, 27.

Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., and Alahi,

A. (2018). Social gan: Socially acceptable trajecto-

ries with generative adversarial networks. In 2018

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Trajectory Prediction in First-Person Video: Utilizing a Pre-Trained Bird’s-Eye View Model

629

Huynh, M. and Alaghband, G. (2020). AOL: adaptive on-

line learning for human trajectory prediction in dy-

namic video scenes. In 31st British Machine Vision

Conference 2020.

Katyal, K. D., Hager, G. D., and Huang, C.-M. (2020).

Intent-aware pedestrian prediction for adaptive crowd

navigation. In 2020 IEEE International Conference

on Robotics and Automation.

Kosaraju, V., Sadeghian, A., Mart

ın-Mart

ın, R., Reid, I.,

Rezatoﬁghi, H., and Savarese, S. (2019). Social-bigat:

Multimodal trajectory forecasting using bicycle-gan

and graph attention networks. In Advances in Neural

Information Processing Systems, volume 32. Curran

Associates, Inc.

Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H. S.,

and Chandraker, M. (2017). Desire: Distant future

prediction in dynamic scenes with interacting agents.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition.

Liang, J., Jiang, L., and Hauptmann, A. (2020). Simaug:

Learning robust representations from simulation for

trajectory prediction. In 2020 European Conference

on Computer Vision, Cham. Springer International

Publishing.

Liang, J., Jiang, L., Niebles, J. C., Hauptmann, A. G., and

Fei-Fei, L. (2019). Peeking into the future: Predict-

ing future person activities and locations in videos. In

2019 IEEE/CVF Conference on Computer Vision and

Pattern Recognition.

Makansi, O., Cicek, O., Buchicchio, K., and Brox, T.

(2020). Multimodal future localization and emergence

prediction for objects in egocentric view with a reach-

ability prior. In Proceedings of the IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition.

Mangalam, K., Girase, H., Agarwal, S., Lee, K.-H., Adeli,

E., Malik, J., and Gaidon, A. (2020). It is not the

journey but the destination: Endpoint conditioned tra-

jectory prediction. In 2020 European Conference on

Computer Vision, Cham. Springer International Pub-

lishing.

Marchetti, F., Becattini, F., Seidenari, L., and Del Bimbo,

A. (2020). Multiple trajectory prediction of mov-

ing agents with memory augmented networks. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence.

Pang, B., Zhao, T., Xie, X., and Wu, Y. N. (2021). Trajec-

tory prediction with latent belief energy-based model.

In 2021 IEEE/CVF Conference on Computer Vision

and Pattern Recognition.

Poibrenski, A., Klusch, M., Vozniak, I., and M

uller, C.

(2020). M2P3: Multimodal Multi-Pedestrian Path

Prediction by Self-Driving Cars with Egocentric Vi-

sion, page 190–197. Association for Computing Ma-

chinery, New York, NY, USA.

Qiu, J., Lo, F. P.-W., Gu, X., Sun, Y., Jiang, S., and Lo, B.

(2021). Indoor future person localization from an ego-

centric wearable camera. In 2021 IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems.

Rasouli, A., Kotseruba, I., Kunic, T., and Tsotsos, J. (2019).

Pie: A large-scale dataset and models for pedestrian

intention estimation and trajectory prediction. In 2019

IEEE/CVF International Conference on Computer Vi-

sion.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time ob-

ject detection. In 2016 IEEE Conference on Computer

Vision and Pattern Recognition.

Rodin, I., Furnari, A., Mavroeidis, D., and Farinella, G. M.

(2021). Predicting the future from ﬁrst person (ego-

centric) vision: A survey. Computer Vision and Image

Understanding, 211:103252.

Rudenko, A., Palmieri, L., Herman, M., Kitani, K. M.,

Gavrila, D. M., and Arras, K. O. (2020). Human mo-

tion trajectory prediction: a survey. The International

Journal of Robotics Research, 39(8):895–935.

Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N.,

Rezatoﬁghi, H., and Savarese, S. (2019). Sophie: An

attentive gan for predicting paths compliant to social

and physical constraints. In 2019 IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition.

Salzmann, T., Ivanovic, B., Chakravarty, P., and Pavone,

M. (2020). Trajectron++: Dynamically-feasible tra-

jectory forecasting with heterogeneous data. In 2020

European Conference on Computer Vision, Cham.

Springer International Publishing.

Seiskari, O., Rantalankila, P., Kannala, J., Ylilammi, J.,

Rahtu, E., and Solin, A. (2022). Hybvio: Pushing the

limits of real-time visual-inertial odometry. In Pro-

ceedings of the IEEE/CVF Winter Conference on Ap-

plications of Computer Vision.

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M.,

and Luo, P. (2021). Segformer: Simple and efﬁcient

design for semantic segmentation with transformers.

In Advances in Neural Information Processing Sys-

tems, volume 34.

Yagi, T., Mangalam, K., Yonetani, R., and Sato, Y. (2018).

Future person localization in ﬁrst-person videos. In

2018 IEEE/CVF Conference on Computer Vision and

Pattern Recognition.

Yuan, Y., Chen, X., and Wang, J. (2020). Object-contextual

representations for semantic segmentation. In 2020

European Conference on Computer Vision.

Zhang, P., Ouyang, W., Zhang, P., Xue, J., and Zheng, N.

(2019). Sr-lstm: State reﬁnement for lstm towards

pedestrian trajectory prediction. In 2019 IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion.

Zhao, T., Xu, Y., Monfort, M., Choi, W., Baker, C., Zhao,

Y., Wang, Y., and Wu, Y. N. (2019). Multi-agent ten-

sor fusion for contextual trajectory prediction. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

630