
 
2 LANGUAGE MODEL TUNING 
In fact, each sport is unique with specific 
expressions, phrases and a manner of speech of the 
TV commentator. That is why a language model 
should be based on many transcriptions of the TV 
commentary of the given sport. 
We did manual transcriptions of 90 ice-hockey 
matches, both international and Czech league 
matches. These transcriptions contain the names of 
players and teams involved in the match, but also 
other names that the commentators talked about. To 
make a general language model suitable for all ice-
hockey matches, we would need to add all names of 
ice-hockey players in the world. This would swell 
the vocabulary of the recognition system and slow 
the system down, moreover the accuracy of 
transcription would drop. The only way is to prepare 
a language model specifically for each match by 
adding only names of players of two competing 
teams. A class-based language model takes the role 
in this task. 
During manual transcriptions of TV ice-hockey 
commentaries, some words were labelled with the 
tags representing several semantic classes. The first 
class represents the names of players that take part in 
the match. The second class is used for the names of 
competing teams or countries and the next class for 
the designations of sport places (stadium, arena etc.). 
The names which do not relate to the transcribed ice-
hockey match were not labelled (for example 
legendary players like "Jagr"), because they are 
more or less independent of the match. Since the 
Czech language is highly inflectional, further 27 
classes were used for the names in other 
grammatical cases and their possessive forms. 
Finally, taking into account the above mentioned 
tags instead of the individual names, two class-based 
trigram language models were trained - one for in-
game commentary and one for studio interviews. 
The manual transcriptions of 90 commentaries 
contain 750k tokens with 25k unique words. These 
data cannot cover commentary of forthcoming ice-
hockey matches. To make the vocabulary and 
language model more robust, other data from 
newspapers (175M tokens) and TV news 
transcriptions (9M tokens) were used. Only data 
with automatically detected topic of sport 
(Skorkovská et al., 2011) were used and mixed with 
ice-hockey commentaries based on perplexity of the 
test data. For in-game language model, the weights 
were 0.65 for ice-hockey commentaries, 0.30 for 
newspaper data and 0.05 for TV news transcriptions, 
while for studio interviews the weights were 0.20, 
0.65 and 0.15, respectively. The final vocabulary of 
the recognition system contains 455k unique words 
(517k baseforms). 
Finally, before recognition of each ice-hockey 
match, the language model classes have to be filled 
with the actual words. The names of players of two 
competing teams (line-ups) are acquired and 
automatically declined into all possible word forms. 
Since the player can be referred to by his full name 
or surname only, both representations are generated. 
Other language model classes are filled by the 
names of teams and designations of sport places 
corresponding to the given ice-hockey match. 
3 DIRECT RECOGNITION 
The acoustic data for direct subtitling (subtitling 
from the original audio track) was collected over 
several years especially from the Ice-hockey World 
Championships as well as from the Winter Olympic 
Games and the Czech Ice-hockey League matches. 
All these matches were broadcasted by the Czech 
Television. Sixty nine matches were transcribed for 
the acoustic modelling purposes. All these matches 
were manually annotated and carefully revised 
(using annotation software Transcriber). Total 
amount of data was more than 100 hours of speech. 
The digitalization of an analogue signal was 
provided at 44.1 kHz sample rate, 16-bit resolution. 
The front-end processor was based on the PLP 
parameterization (Hermansky, 1990) with 27 band 
pass filters and 16 cepstral coefficients with both 
delta and delta-delta sub-features. Therefore one 
feature vector contains 48 coefficients. The feature 
vectors were calculated each 10ms. Many noise 
reduction techniques were tested to compensate for 
very intense background noise. The J-RASTA 
(Koehler et al., 1994) seems to be the best noise-
reduction technique in our case (see details in Psutka 
et al., 2003). 
The individual basic speech unit in all our 
experiments was represented by a three-state HMM 
with a continuous output probability density 
function assigned to each state. As the number of 
possible Czech triphones is too large, phonetic 
decision trees were used to tie states of the 
triphones. Several experiments were performed to 
determine the best recognition results depending on 
the number of clustered states and also on the 
number of mixtures per state. The best recognition 
results were achieved for 16 mixtures of multivariate 
Gaussians for each of 7700 states (see Psutka, 2007 
for methodology). 
SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications
152