Authors:
Zhi-Chen Yan
1
and
Stephanie A. Yu
2
Affiliations:
1
Facebook Research, 1 Hacker Way, Menlo Park, CA 94025, U.S.A.
;
2
West Island School, 250 Victoria Road, Pokfulam, Hong Kong, Republic of China
Keyword(s):
Attention, Convolution, Deep Learning, LSTM, Text Recognition.
Abstract:
Recognizing texts in real-world scenes is an important research topic in computer vision. Many deep learning based techniques have been proposed. Such techniques typically follow an encoder-decoder architecture, and use a sequence of feature vectors as the intermediate representation. In this approach, useful 2D spatial information in the input image may be lost due to vector-based encoding. In this paper, we formulate scene text recognition as a spatiotemporal sequence translation problem, and introduce a novel attention based spatiotemporal decoding framework. We first encode an image as a spatiotemporal sequence, which is then translated into a sequence of output characters using the aforementioned decoder. Our encoding and decoding stages are integrated to form an end-to-end trainable deep network. Experimental results on multiple benchmarks, including IIIT5k, SVT, ICDAR and RCTW-17, indicate that our method can significantly outperform conventional attention frameworks.