University of Pécs enables text and speech processing in Hungarian, builds the BERT-large model with just 1,000 euro with Azure

Everyone prefers to use their mother tongue when communicating with chat agents and other automated services. However, for languages like Hungarian—spoken by only 15 million people—the market size will often be viewed as too small for large companies to create software, tools or applications that can process Hungarian text as input. Recognizing this need, the Applied Data Science and Artificial Intelligence team from University of Pécs decided to step up. Using Microsoft AI Solutions and ONNX Runtime solutions, it built and trained its own BERT-large model in native Hungarian in under 200 hours and total build cost of 1,000 euro.

Since its partnership with Microsoft’s AI Knowledge Center in 2019, the University of Pécs has been focusing on artificial intelligence and cloud-based education. It was looking to create natural language processing (NLP) applications to process large amounts of Hungarian language data. The solution came in the form of a Hungarian-language BERT-large model (HILBERT), which is an open-source machine learning framework designed to help computers understand the meaning of ambiguous language in text by using the surrounding text to establish context.

The team decided to use the Microsoft AI Solutions environment to create their own Hungarian BERT-large model. “Microsoft holds the leading position in training language models. Naturally, we wanted to work with the top technology for this project,” says Robert Hajdu, Artificial Intelligence Architect, Applied Data Science and Artificial Intelligence Centre at University of Pécs. The team’s familiarity with Microsoft AI Solutions was another reason to use the Microsoft solution for this project.

The team wanted to ensure that they have high quality text to train their Hungarian-language BERT-large model. They sought the help of the Research Institute of Linguistics to prepare the corpus, instead of collecting low quality data from the internet. Researchers from the Research Institute for Linguistics annotated the corpus and checked it based on their knowledge of the language.

“As we are a small group, we didn’t want to invest in expensive hardware. Instead, we accessed resources and working hardware through Azure. It made everything easier and faster for us.”

Robert Hajdu, Artificial Intelligence Architect, Applied Data Science and Artificial Intelligence Centre, University of Pécs

Accelerating machine learning training

For training the model, the team wanted a fast and cost-effective solution. University of Pécs opted for Microsoft’s ONNX Runtime library with DeepSpeed to train the model and ran it on the Azure Machine Learning (AML) platform. The AML platform allowed them to build, deploy, manage, and track their AI models effectively and efficiently. This freed the team to focus on other things such as data processing. “Azure Machine Learning makes it very easy to train and prepare data sets and create textual sources,” says Hajdu. Although the team was working from home due to the pandemic, they faced no issues throughout the BERT-large training process on Azure.

“As we are a small group, we didn’t want to invest in expensive hardware,” explains Hajdu. “Instead, we accessed resources and working hardware through Azure. It made everything easier and faster for us.”

“The training was completed in 200 hours and it is the cheapest BERT-large in the world, built under 1,000 euro,” shares a proud Dr. Ádám Feldmann, Head of Data Science and AI research group at University of Pécs. “Without using ONNX Runtime, training our HILBERT-large model would have taken 1,500 hours, which is approximately two months.”

“The training was completed in 200 hours and it is the cheapest BERT-large in the world, built under 1,000 euro.”

Dr. Ádám Feldmann, Head of Data Science and AI research group, University of Pécs

More about this diagram

Data preparation:

1.A corpus of at least 3.5 billion words of flowing text required for the BERT model was compiled from the following sources:

MNSZ: the Hungarian National Dictionary prepared at the Institute of Linguistics, containing six styles of text (press, fiction, scientific, official, personal, spoken) and divided into five regional language versions.
Subcorpus: covering certain materials of MR1 Kossuth radio from 2004-2012.
JSI: comprising news collected from internet sources by Slovenian Jožef Stefan Institute since 2013.
NOL: containing Népszabadság online material received from Mediaworks.
ANCESTOR: the Hungarian part of the freely accessible subtitle database, opensubtitles.org.
KM: significant text material from public social media posts and comments from Neticle Ltd.
Dictionary: For internal representation of words, it uses a dictionary in which words are statistically broken down into word elements, in extreme cases up to individual characters.

2. Raw dataset was cleaned so that only alphanumeric characters and punctuation remain.

3. Each sentence was entered as a new, separate line.

4. The entries such as paragraphs or articles were separated by a blank line.

5. The following steps were taken to bring the dataset into a form compatible with the training script:

Formatting
Tokenizing
Filtering
Creating vocab
Sharding
Creating binaries

6. To use the converted files for training, they were uploaded to Azure Blob Storage.

Training:
For training the AI model, University of Pécs used the ONNX Runtime-based solution containing DeepSpeed training optimizations, which was the fastest and the cheapest solution available. The team ran the training in Azure Machine Learning, leveraging cost-effective multi-GPU clusters.

With AML tools, the team was able to build an integrated environment with several acceleration solutions, which was ideal to complete BERT-large training in a short timeframe.

New opportunities and perspectives

The University of Pécs’ Hungarian BERT-large model is set to create large opportunities in the fields of text and speech processing, intelligent search, entity detection, and document classification. HILBERT will also be useful in creating high-performance, novel chatbots, and dialog agents with robust performance. This will help provide Hungarians access to relevant information that is easy to understand, especially in fighting the battle of misinformation regarding COVID-19. Several stakeholders within the healthcare and government sector have already shown their interest in the HILBERT-large model.

The University of Pécs’ team has several use cases for HILBERT-large such as:

A search engine
A tool for named entity recognition
Q&A engine
Text classification
Extractive summarizer
Obstructive text summarizer
Tax anonymization

“Our plan is to build Generative Pre-trained Transformer 2 (GPT2) or Russian GPT3 model along with ONNX Runtime,” said Feldmann. “These models will also be based on Microsoft solutions,” Hungarian NLP will go a long way to support the scientific and research community as well as the public at large.