LAMV: Learning to Predict Where Spectators Look in Live Music

Performances

Arturo Fuentes

1,2 a

, F. Javier S

anchez

1 b

, Thomas Voncina

and Jorge Bernal

1 c

Computer Vision Center and Computer Science Department, Universitat Aut

onoma de Barcelona,

Bellaterra (Cerdanyola del Vall

es), 08193, Barcelona, Spain

Lang Iberia, Carrer Can Pobla, 3, 08202, Sabadell, Spain

Keywords:

Object Detection, Saliency Map, Broadcast Automation, Spatio-temporal Texture Analysis.

Abstract:

The advent of artiﬁcial intelligence has supposed an evolution on how different daily work tasks are performed.

The analysis of cultural content has seen a huge boost by the development of computer-assisted methods that

allows easy and transparent data access. In our case, we deal with the automation of the production of live

shows, like music concerts, aiming to develop a system that can indicate the producer which camera to show

based on what each of them is showing. In this context, we consider that is essential to understand where

spectators look and what they are interested in so the computational method can learn from this information.

The work that we present here shows the results of a ﬁrst preliminary study in which we compare areas of

interest deﬁned by human beings and those indicated by an automatic system. Our system is based on the

extraction of motion textures from dynamic Spatio-Temporal Volumes (STV) and then analyzing the patterns

by means of texture analysis techniques. We validate our approach over several video sequences that have been

labeled by 16 different experts. Our method is able to match those relevant areas identiﬁed by the experts,

achieving recall scores higher than 80% when a distance of 80 pixels between method and ground truth is

considered. Current performance shows promise when detecting abnormal peaks and movement trends.

1 INTRODUCTION

The commercial segment of the production of live

performances is undergoing a transformation due to

the growing automation of work tools. More fre-

quently, television studios are incorporating the use

of robotic cameras and replacing the camera opera-

tors for automation managers. This offers greater ver-

satility and provides more complete motion control.

In addition, its use also implies a saving in the costs

related to the production, as a single operator can con-

trol several cameras.

This is different to what is being done currently,

where a camera operator is required for each of the

cameras installed. Nevertheless, these advances have

a counterpart as the absence of human operators could

lead to a loss of additional and subjective information

that operators could bring to the ﬁlmmaker.

The scope of our work is the automatic live pro-

duction of cultural events, in this case we focus on

https://orcid.org/0000-0002-1813-4766

https://orcid.org/0000-0002-9364-3122

https://orcid.org/0000-0001-8493-9514

music concerts. This kind of projects has already been

tackled by other researchers in different domains,

such as the works of (Chen et al., 2013) where they

aim at the live production of soccer matches, simi-

lar to the system that Mediapro uses when broadcast-

ing Spanish National Soccer League. Outside cultural

events, the work of (Pang et al., 2010) deals with the

automatic selection of the camera to show in video-

conferences so the speaker is always on screen.

The domain in which our work can be enclosed

has the advantage of having additional sources of in-

formation that can help in building the automatic sys-

tem; we can integrate both video and audio analysis to

determine which areas of the image should be shown

in the main screen. It has to be noted that we aim to

go beyond simple frame selection, our objective is to

develop a system that can also generate a dynamic in

terms of take selection, composition and length. Such

a system, apart from its potential commercial future,

allows us to generate knowledge related on how au-

diovisual narrative should work from a computational

perspective which could be potentially extended to

another domains.

500

Fuentes, A., Sánchez, F., Voncina, T. and Bernal, J.

LAMV: Learning to Predict Where Spectators Look in Live Music Performances.

DOI: 10.5220/0010254005000507

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

500-507

ISBN: 978-989-758-488-6

Related to this, we cannot forget the spectator,

which is the one that receives the information; this

information should contemplate both enuntiative and

explicative components so the viewer can understand

what is happening and the way the information is ex-

plained can generate in the spectator some kind of en-

gagement. Our ﬁrst approach, named Live Automa-

tion of Musical Videos (LAMV), deals with the enun-

tiative analysis of the action.

We rely on the use of texture descriptors along

several patches of the image along several frames to

search for those signatures that allows us to identify

those image areas where big changes (associate to

movements) are produced.

The objectives of this paper are two-fold: 1/ Iden-

tify where spectators look and 2/ Signature search.

The ﬁrst one deals with the acquisition of several an-

notations by experts and their integration to deﬁne

which areas of the image are more relevant from an

expert point of view. The second part is related with

the analysis of texture information to deﬁne those sig-

natures that can be attained to relevant image areas.

2 RELATED WORK

The ﬁrst part of our system must deal with ﬁnding

what is important in the image. In this case, we should

not only look for what is relevant under a computa-

tional point of view; we also have to consider what

would be important for a given person.

This is covered in the literature under the research

domain known ans visual saliency (Cazzato et al.,

2020). The salience (also called saliency) of an item

-be it a an object, a persona, a pixel, etc. -is the state

or quality by which it stands out from its neighbors.

Saliency detection is considered to be a key at-

tentional mechanism that facilitates learning and sur-

vival by enabling organisms to focus their limited per-

ceptual and cognitive resources on the most pertinent

subset of the available sensory data. When attention

deployment is driven by stimuli, it is considered to be

bottom-up, memory-free, and reactive. Conversely,

attention can also be guided by top-down, memory-

dependent, or anticipatory mechanisms.There are sev-

eral computational saliency models already proposed

in the literature, such as the works of Itti et al. (Borji

et al., 2012; Itti et al., 1998), Tsotsos et al (Bruce and

Tsotsos, 2009) or Seo et al. (Seo and Milanfar, 2009),

just to mention a few of them.

In our case we must deal with the difﬁculty of

deﬁning which are the salient regions in an image

where all elements (i.e. music performers) look alike.

To solve this, we propose to incorporate the deﬁni-

tion of visual differences between frames to our def-

inition of saliency. The use of this type of saliency

has already been proposed in the literature, as shown

in the works of Rudoy et al. (Rudoy et al., 2013). In

this case, the authors use the ﬁxations from previous

frames to predict the gaze in the new frame. The nov-

elty here is the combination of within frame saliency

with dynamics of human gaze transitions.

As mentioned before, we will also have to deal

with temporal coherence of system output. The in-

corporation of temporal information has attracted the

attention of several researchers of different domains,

from the analysis of colonoscopy sequences (Anger-

mann and Bernal, 2017) to tracking moving objects

(such as driving lanes) in the development of assisted

driving systems (Sotelo et al., 2004).

Regarding the domain of application, one of

things to take into account is that the events we are

dealing with are live. This means that, despite the

fact that there is a lot of rehearsals (their amount de-

pends on available resources) there is always a certain

amount of improvisation associated with making de-

cisions in very short periods of time and the necessity

of adapting to unexpected events.

There are several available solutions in the sports

domain such as the Multicamba system (Yus et al.,

2015) used for instance in sailing to track the boats.

With respect to soccer matches Mediapro has devel-

oped Automatic TV

, a complete commercial solu-

tion that allows to select the best shot and send it di-

rectly to the viewers.

It has to be noted that the context we are dealing

with presents some differences with the sports one.

The technical director in front of a live concert has

to integrate both video and audio information which

adds more complexity to the task.

As LAMV will also integrate music channel pro-

cessing, works regarding polyphonic music recogni-

tion (Gururani et al., 2018; Toghiani-Rizi and Wind-

mark, 2017; Costa et al., 2017; Costa et al., 2011)

score image recognition (Dorfer et al., 2016) natu-

ral synchronization of the visual and sound modalities

(Zhao et al., 2018) should be taken into account.

3 METHODOLOGY

We expose in this section the two main parts of this

contribution: the deﬁnition of the relevant areas of

image as a result of the analysis of the income pro-

vided by several experts and the automatic extraction

automatic.tv

LAMV: Learning to Predict Where Spectators Look in Live Music Performances

501

Figure 1: Flow chart of our ﬁrst approach of LAMV, which shows the process followed to evaluate the minimum distance

between the selected areas by the subjects with respect to our candidate.

of relevant features from the video sequence to esti-

mate the key areas to look at.

The whole pipeline of our methodology can be ob-

served in Fig. 1. As it can be seen, we have two dif-

ferent sources of data: selected areas by subjects and

the input video. With respect to the former, we ana-

lyze a 5 seconds interval and we calculate the mean

vertical value of the selections, which will be used as

a seed in the image analysis module.

The second source of data is based on two modes

of 5 seconds long input video -original frames and

a difference of these frames (t, t-1) - whereby the

two STT 125 pixels height are generated and analyzed

through GLCM considering two descriptors: contrast

and dissimilarity. The point candidate is represented

by the maximum value of each descriptor -the X value

-, and by the Y value - the pre-calculated mean value-

. The minimum distance between the candidate and

the closest ground truth candidate is considered to as-

sess the performance of LAMV system.

3.1 Where Do Spectators Look at?

To gather as much useful information as possi-

ble, we ran an online experiment (available at

https://videoexp.dtabarcelona.com) in which several

experts were asked to observe different sequences

(extracted from real music concerts) and indicate

which image are was more relevant to them.

We established two different scenarios. The ﬁrst

one already proposed the participants ﬁve predeter-

mined regions of interest whereas the second one al-

lows them to freely browse the scene to search for the

area that attracted their attention.

In both cases, the subject had to move a green box

over the image and locate it over the area of interest,

making the spectator act as a live producer. The vol-

unteers were required to select a box (or a position)

at least once every 10 seconds. Contrary to the au-

tomatic system, the participants also had the help of

the audio channel to assist them in their decision. In

our case, we are only interested (as of now) in ﬁnding

those visual cues that could also be attractive to the

attention of the potential spectator.

The aggregation of the selections made by the in-

dividual experts is the basis to generate the ground

truth which will be used to assess the performance of

our method. In this case, we add each individual ﬁx-

ation as an individual box over the image and we se-

lect as the relevant area in the image the one in which

more of these individual selections coincide.

3.2 Deﬁning Signatures

Still frame analysis can be useful to detect which ar-

eas are more salient in an individual image but, in our

case, we are more interested on those sudden changes

in the image that could be associated, for instance, to

an instrument that has just started to be played or any

unprecedented happening in the plateau.

To build this automatic relevant region extraction

module, we propose to use the so-called Spatio Tem-

poral Texture (STT) slices. To generate them ﬁrst we

set a vertical position in the image, which can be used

as reference. This position is set (in this experiment)

as the mean y-value of the participant selections for

this particular set of frames but the analysis that we

present in this work could be easily extended for any

given vertical position. Anyway, as selections ﬁxa-

tions can vary over time (and they should do, as rele-

vant events can appear in other image positions), we

recalculate this vertical position each 5 seconds.

Once this vertical position is set, we generate next

the slice by rendering the information of the pixels

along the horizontal axis; the concatenation of this in-

formation over time generates the ﬁnal slice. It has

to be noted that we generate two types of images, de-

pending on the source of information.

The ﬁrst one directly renders the grey scale values

of the pixels that share the same vertical coordinate;

before doing this, contrast of the image is adjusted

previously so the dynamic range of the image is re-

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

502

Figure 2: Graphical example on the automatic extraction of image patches. Image at the top shows the ﬁve predetermined

regions that were shown to the volunteers. The blue horizontal line represents the mean value for the vertical coordinate of

the selections over 5 seconds of video. The image in the middle represents the SST volume whereas the image in the bottom

displays the 27 uniformly distributed patches over the volume.

duced. The second one uses as source information the

difference between consecutive frames; apart from

the previously mentioned contrast adjustment, Gabor

ﬁltering is applied to improve efﬁciency and motion

ﬁdelity (Hao et al., 2017). While our method has sim-

ilarities to this work, we add as novelty the calculation

of the slice image from difference of frames.

Considering these two different sources, we build

our hypothesis over the fact that those image regions

that present higher variation over time would corre-

spond to the most interesting regions in the opinion

of the volunteers. It has to be noted that, if we ana-

lyze the STT slices vertically, we are able to explore

variations over time in a speciﬁc region in the image.

To assist us on the deﬁnition of image signatures

we propose the use of Grey Level Co-occurrence Ma-

trix (GLCM) (Gao and Hui, 2010), as it is meant to

be useful on detecting information levels with high

semantic content, which is the one we are looking for.

In this case GLCM does not suffer from illumination

changes as the lighting of the scene is uniform for the

whole scene; we are dealing with pre-recorded musi-

cal concerts where the illumination conditions have

been previously set and tested properly to allow a

clear view of the music sheets by the musicians.

From the several descriptors that could be used,

we have chosen dissimilarity and contrast as the ones

that could lead us to easily determine areas with high

change (and, therefore, to be prone to attract the at-

tention of the viewer). It has to be noted that dissimi-

larity increases linearly with differences on grey level

whereas contrast has an exponential increase.

We apply these descriptors over predetermined re-

gions of the STT; in our experiments we have deﬁned

27 different patches of size 50x50 that are spread

equally over the slice. We calculate both descriptors

for each of the patches and we select as ﬁnal rele-

vant region proposal the one where their maxima is

achieved. We show in Fig. 2 a graphical example on

how these patches are created.

4 EXPERIMENTAL SETUP

We validate our ﬁrst iteration of the LAMV system by

comparing the regions of interest that it automatically

provides with the ones labelled by the participants in

the online experiment.

Volunteers had two sets of sequences to analyze:

one containing predetermined RoIs and another one

where they could freely explore the image. The to-

tal length of the video was of 5 minutes and 30 sec-

onds; spectators should indicate a region in the image

at least once every 10 seconds.

It is important to mention that both sets were

extracted from different music concerts, recorded at

April 2019 and January 2020 at Auditori Sant Cugat

and L’Auditori de Barcelona, respectively.

Each of these selections was saved with a time

stamp. The ground truth generation experiment was

performed with the support of p5js Javascript library.

As of now, 16 different individuals have taken part

in the annotation experiment. We show in Fig. 4 an

example of ground truth creation for a given frame.

LAMV: Learning to Predict Where Spectators Look in Live Music Performances

503

Figure 3: Example of the annotation experiment. Image at the top shows an annotation task where the user has to place the

box (in green) over ﬁve predetermined boxes. The red box represents a selection by the user. The image at the bottom presents

a full scene, using the same notation to display user selection.

Figure 4: Example of ground truth creation from the annotation experiment.

We propose as validation metric the use of Re-

call, deﬁned as the percentage of correct RoI detection

over the total of RoIs in the sequence. We calculate

the RoI provided by LAMV as follows: we select 125

lines of the STT slice which correspond to 5 seconds

of the video sequence. Over this sub-volume, we de-

ﬁne the before mentioned 27 50x50 patches and we

calculate dissimilarity and contrast descriptors over

them (using as the source image the original one and

the difference of images).

The comparison of the values of the descriptors

over these 27 patches provides a candidate. To assess

Figure 5: Performance of LAMV system over video se-

quences with predetermined Regions of Interest.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

504

Figure 6: Performance of LAMV system over video se-

quences without predetermined Regions of Interest.

if this candidate correspond to the actual RoI deﬁned

by the expert, we calculate the euclidean distance be-

tween the centroids of the candidate and maxima of

the ground truth for this particular set of frames. To

label a detection as a True Positive we set a distance

threshold that should not be surpassed.

Given that we analyze temporal frames of 5 sec-

onds, the maximum number of TP over a video se-

quence will be equal to max

P =

, where the video

length (vl) should be expressed in seconds.

5 RESULTS

Figures 5 and 6 shows overall performance of the sys-

tem. Several conclusions can be extracted by the anal-

ysis of these graphs.

First, it appears that there is a big difference in

performance associated to the way information is pre-

sented to the experts. Surprisingly the method is able

to match better the ground truth when free selection

of the region of interest is prescribed to the experts.

It has to be noted that the sequences were differ-

ent in complexity, being easier to determine a sudden

change in the scene when we have the whole picture.

In the predetermined region analysis, a small change

in a reduced region in the image weights more than in

the full image, beneﬁting isolated sudden and small

changes rather than actual changes in the scene.

This difference in complexity affects also the re-

call scores and the impact of the different descrip-

tors. We associate this to the following: when dealing

with predetermined regions where small changes af-

fect more, the value of the contrast descriptor is more

sensible to these slight changes as its value has an ex-

ponential proﬁle whereas, for the analysis of the full

images, these small changes tend to be mitigated.

We do not appreciate remarkable differences

among descriptors and between image source though

it is interesting to observe the increase of relevance

of the dissimilarity descriptor when higher distances

between centroids are allowed.

As can be seen from these results and, though the

performance is promising, there are some cases in

which our method fails to correctly determine the re-

gion of interest. We show in Fig. 7 an example of

a patch which shows a trivial, routine movement of

a wind instrument. In this case the system mislabels

the preparation movement of the musician (just before

starting to actually play the instrument) with an ac-

tual relevant movement in the scene. This shows that

more effort should be made to incorporate semantics

(and also to characterize particular movements such

as the one shown in the example) in order to improve

the performance of LAMV.

Figure 7: Example of one of the erroneous detection pro-

vided by the system.

We want to highlight some of the ﬁndings that we

have observed by the analysis of the ﬁxations pro-

vided by the volunteers.

Fig. 8 shows the difference in how selections are

performed depending on the content of the video se-

quence, more precisely, depending on the intensity of

the movements/actions that are happening in different

time intervals during a same concert.

The blue distribution of selections is associated

with an interval of the concert where the level of

action is low whereas the orange distribution corre-

sponds to a part of the concert when lots of move-

ments are happening at once (for instance, a certain

part of the musicians start to play the instrument).

The image on the left shows the amount of clicks

during both periods; we can observe that the distri-

butions present clear differences being the number of

clicks higher for the more intense interval. By look-

ing at the image on the right we can clearly make an

association between the number of those clicks and

the image area covered by them.

With respect to clicks occurrences, our hypoth-

esis was that when the music is more intense, it is

reﬂected with a more complex scene under a visual

LAMV: Learning to Predict Where Spectators Look in Live Music Performances

505

Figure 8: Example of the difference in the distribution of the number of selections and their coverage over the scene in two

different time intervals.

point of view so a higher number of different areas

are selected. We tested this using a paired sample t-

test and the results conﬁrmed our initial assumption,

yielding a p-value < 0.05.

Regarding selection distribution over the image,

our hypothesis was that, apart from a higher number

of selections, these selections would be more spread

over the image. Again we tested this using a paired

sample t-test and the results showed that, in fact,

there are statistically signiﬁcant differences between

the two distributions (p-value < 0.05).

6 CONCLUSIONS

6.1 Main Conclusions

We have presented in this paper our ﬁrst module of

the LAMV system. This system aims to assist techni-

cians when doing a live production of a music concert.

The objective of the system is to indicate which areas

of the image are more relevant (not only in terms of

visual information) in order them to be broadcasted.

Though the complete system would incorporate

audio information, key in this context, we wanted ﬁrst

to explore if the analysis of pure image information

could be useful to indicate those relevant image areas.

To do so we have prepared a ﬁrst experiment

which compares the areas provided automatically by

our system with those indicated by several volunteers

that took part in an online experiment. LAMV system

determines the area of interest by integrating image

information over time using a STT volume and then

extracting features from several patches.

Preliminary results show good correspondence

between LAMV system and the ground truth though it

also shows that semantic information should be added

to improve its performance.

6.2 Future Work

With respect to the following steps in the develop-

ment of the LAMV system, it has to be noted that

the work in this article corresponds to one of the parts

that compounds the LAMV system, speciﬁcally the

pre-attentional analysis of the scene which is capable

of detecting rapid changes that may arise on the scene

as well as the unbalance between image areas, spe-

cially those with low activity.

At its current stage, the system works with using

as starting seed for the analysis a vertical position cal-

culated from the acquired ground truth. In the ﬁnal

system this seed will not be provided and several val-

ues for the vertical position will be considered.

Preliminary results show that LAMV is a good

candidate to point out where to add more processing

in speciﬁc areas of the scene but its current analysis

is mainly based on changes over time. However, we

cannot forget the role that audio information might

have played in ground truth generation by experts. We

have observed that experts tend to correctly select the

areas where musicians are playing the predominant

instruments, which can be clearly heard in the audio

channel but that cannot be easily seen.

This clearly indicate us that part of the effort

in the next steps should be devoted to the selection

of descriptors with greater semantic feature structure

which can help to locate the instruments even if they

are not completely visible. This, for sure, would re-

quire of more advanced image processing techniques

to perform a more accurate object detection.

In this context, we plan to study the incorpora-

tion of trending techniques such as those Convolu-

tional Neural Neworks integrating optical ﬂow esti-

mation such as FlowNet (Dosovitskiy et al., 2015).

We would also like to explore the potential of recent

methodologies such as LSTM or RNN, paying spe-

cial attention to Transformers architecture and Vision-

Transformers for image recognition

https://github.com/lucidrains/vit-pytorch

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

506

Finally, we would also like to build a robust vali-

dation framework for the whole pipeline which would

requiere acquiring and annotatting more concerts.

ACKNOWLEDGEMENTS

This work was supported by the Secretaria

d’Universitats i Recerca de la Generalitat de

Catalunya (SGR-2017-1669), by the Ag

encia de

Gesti

o d’Ajuts Universitaris i de Recerca (AGAUR)

through the Doctorat Industrial programme and by

CERCA Programme/Generalitat de Catalunya.

The authors want to thank Auditori de Sant Cugat

and L’Auditori de Barcelona for granting us access to

recordings of music concerts.

REFERENCES

Angermann, Q. and Bernal, J. e. a. (2017). Towards real-

time polyp detection in colonoscopy videos: Adapting

still frame-based methodologies for video sequences

analysis. In Computer Assisted and Robotic En-

doscopy and Clinical Image-Based Procedures, pages

29–41. Springer.

Borji, A., Sihite, D. N., and Itti, L. (2012). Quantitative

analysis of human-model agreement in visual saliency

modeling: A comparative study. IEEE Transactions

on Image Processing, 22(1):55–69.

Bruce, N. D. and Tsotsos, J. K. (2009). Saliency, attention,

and visual search: An information theoretic approach.

Journal of vision, 9(3):5–5.

Cazzato, D., Leo, M., Distante, C., and Voos, H. (2020).

When i look into your eyes: A survey on computer

vision contributions for human gaze estimation and

tracking. Sensors, 20(13):3739.

Chen, C., Wang, O., Heinzle, S., Carr, P., Smolic, A., and

Gross, M. (2013). Computational sports broadcast-

ing: Automated director assistance for live sports. In

2013 IEEE International Conference on Multimedia

and Expo (ICME), pages 1–6. IEEE.

Costa, Y. M., Oliveira, L. S., Koericb, A. L., and Gouyon, F.

(2011). Music genre recognition using spectrograms.

In 2011 18th International Conference on Systems,

Signals and Image Processing, pages 1–4. IEEE.

Costa, Y. M., Oliveira, L. S., and Silla Jr, C. N. (2017). An

evaluation of convolutional neural networks for music

classiﬁcation using spectrograms. Applied soft com-

puting, 52:28–38.

Dorfer, M., Arzt, A., and Widmer, G. (2016). Towards

score following in sheet music images. arXiv preprint

arXiv:1612.05050.

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas,

C., Golkov, V., Van Der Smagt, P., Cremers, D., and

Brox, T. (2015). Flownet: Learning optical ﬂow with

convolutional networks. In Proceedings of the IEEE

international conference on computer vision, pages

2758–2766.

Gao, C.-C. and Hui, X.-W. (2010). Glcm-based texture fea-

ture extraction. Computer Systems & Applications,

6(048).

Gururani, S., Summers, C., and Lerch, A. (2018). In-

strument activity detection in polyphonic music using

deep neural networks. In ISMIR, pages 569–576.

Hao, Y., Xu, Z., Wang, J., Liu, Y., and Fan, J. (2017). An

effective video processing pipeline for crowd pattern

analysis. In 2017 23rd International Conference on

Automation and Computing (ICAC), pages 1–6. IEEE.

Itti, L., Koch, C., and Niebur, E. (1998). A model of

saliency-based visual attention for rapid scene anal-

ysis. IEEE Transactions on pattern analysis and ma-

chine intelligence, 20(11):1254–1259.

Pang, D., Madan, S., Kosaraju, S., and Singh, T. V. (2010).

Automatic virtual camera view generation for lecture

videos. In Tech report. Stanford University.

Rudoy, D., Goldman, D. B., Shechtman, E., and Zelnik-

Manor, L. (2013). Learning video saliency from hu-

man gaze using candidate selection. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 1147–1154.

Seo, H. J. and Milanfar, P. (2009). Nonparametric bottom-

up saliency detection by self-resemblance. In 2009

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition Workshops, pages 45–

52. IEEE.

Sotelo, M. A., Rodriguez, F. J., Magdalena, L., Bergasa,

L. M., and Boquete, L. (2004). A color vision-based

lane tracking system for autonomous driving on un-

marked roads. Autonomous Robots, 16(1):95–116.

Toghiani-Rizi, B. and Windmark, M. (2017). Musical in-

strument recognition using their distinctive charac-

teristics in artiﬁcial neural networks. arXiv preprint

arXiv:1705.04971.

Yus, R., Mena, E., Ilarri, S., Illarramendi, A., and Bernad,

J. (2015). Multicamba: a system for selecting camera

views in live broadcasting of sport events using a dy-

namic 3d model. Multimedia Tools and Applications,

74(11):4059–4090.

Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., Mc-

Dermott, J., and Torralba, A. (2018). The sound of

pixels. In Proceedings of the European conference on

computer vision (ECCV), pages 570–586.

LAMV: Learning to Predict Where Spectators Look in Live Music Performances

507