AutoML cannot beat humans in situation in which 
extraordinary results are required. This is again 
shown by our discovery, that concerning the cases in 
which AutoML outperforms humans, this 
outperformance is rather little and in most of the cases 
only one or two AutoML tools manage to do so 
(although the average is 2.5, which is attributable to 
the first task where all four AutoML tools beat human 
performance).  
Finally, we want to address usage of the considered 
AutoML tools and AutoML benchmark. Occasionally, 
bold human intervention is required in order to make 
them work properly. Clearly, this is understandable as 
AutoML in general is a relatively new field and the 
tools are partly in early stage of development. 
However, it contradicts the idea of automated machine 
learning and we see great potential regarding stability, 
reliability and function range. 
5 CONCLUSION AND OUTLOOK 
The present works contributes to the standard of 
knowledge concerning AutoML performance in text 
classification. Our research interests were two-fold; 
comparison of performance between AutoML tools 
and confrontation with human performance. The 
results show that, in most cases, AutoML is not able 
to outperform humans in text classification. However, 
there are text classification tasks that can be solved 
better or equally by AutoML tools. With automated 
approaches becoming increasingly sophisticated, we 
see this disparity shrink in the future. 
We see great potential in future development of 
specific text classification modules within AutoML 
tools. Such modules would further facilitate usage of 
machine learning by beginners and establishing a 
baseline for advanced users. 
In the future, we will focus on investigating impact 
of different pre-processing techniques for texts 
(including more embedding types) for conclusive 
usage in AutoML tools. Evidently, there are more 
AutoML tools which should be evaluated, too.  
Furthermore, testing AutoML for other NLP tasks 
like named entity recognition is an interesting topic 
for further research. Additionally, we will analyse 
performance of commercial cloud services that come 
with ready-to-use text classification functionality. 
REFERENCES 
Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011). 
Contributions to the Study of SMS Spam Filtering: 
New Collection and Results. In DocEng ’11, 
Proceedings of the 11th ACM Symposium on 
Document Engineering (pp. 259–262). Association for 
Computing Machinery. 
https://doi.org/10.1145/2034691.2034742 
Drori, I., Liu, L., Nian, Y., Koorathota, S., Li, J., Moretti, 
A., Freire, J., & Udell, M. (2019). AutoML using 
Metadata Language Embeddings. 
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, 
P., Li, M., & Smola, A. (2020). AutoGluon-Tabular: 
Robust and Accurate AutoML for Structured Data. 
ArXiv Preprint ArXiv:2003.06505. 
Estevez-Velarde, S., Gutiérrez, Y., Montoyo, A., & 
Almeida-Cruz, Y. (2019). AutoML Strategy Based on 
Grammatical Evolution: A Case Study about 
Knowledge Discovery from Text. In Proceedings of the 
57th Annual Meeting of the Association for 
Computational Linguistics (pp. 4356–4365). 
Association for Computational Linguistics. 
https://doi.org/10.18653/v1/P19-1428 
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., 
Blum, M., & Hutter, F. (2015). Efficient and Robust 
Automated Machine Learning. In C. Cortes, N. D. 
Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett 
(Eds.), Advances in Neural Information Processing 
Systems 28 (pp. 2962–2970). Curran Associates, Inc. 
http://papers.nips.cc/paper/5872-efficient-and-robust-
automated-machine-learning.pdf 
Gijsbers, P., LeDell, E., Poirier, S., Thomas, J., Bischl, B., 
& Vanschoren, J. (2019). An Open Source AutoML 
Benchmark. ArXiv Preprint ArXiv:1907.00909 
[Cs.LG]. 
Go, A., Bhayani, R., & Huang, L. (2009). Twitter Sentiment 
Classification using Distant Supervision. Processing, 
1–6. http://www.stanford.edu/alecmgo/papers/ 
TwitterDistantSupervision09.pdf 
H2O.ai. (2017). H2O AutoML. 
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/ 
automl.html 
Hanussek, M., Blohm, M., & Kintz, M. (2020). Can 
AutoML outperform humans? An evaluation on 
popular OpenML datasets using AutoML Benchmark. 
He, X., Zhao, K., & Chu, X. (2019, August 2). AutoML: A 
Survey of the State-of-the-Art. 
http://arxiv.org/pdf/1908.00709v4  
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., 
& Potts, C. (2011). Learning Word Vectors for 
Sentiment Analysis. In Proceedings of the 49th Annual 
Meeting of the Association for Computational 
Linguistics: Human Language Technologies (pp. 142–
150). Association for Computational Linguistics. 
http://www.aclweb.org/anthology/P11-1015 
Madrid, J., Escalante, H. J., & Morales, E. (2019). Meta-
learning of textual representations. CoRR, 
abs/1906.08934. 
Mehmood, K., Essam, D., Shafi, K., & Malik, M. K. (2019). 
Sentiment Analysis for a Resource Poor Language—
Roman Urdu. ACM Trans. Asian Low-Resour. Lang. 
Inf. Process., 19(1). https://doi.org/10.1145/3329709 
Mishra, A. Amazon Comprehend. In Machine Learning in