This post summarizes our journey through 2+ years of building an accurate and scalable model that doesn’t require a fortune to train or use in inference.
This is not a guide on how to train or deploy ML models Instead, it’s more of a guide of guides containing our findings that we discovered along the road and where we discovered them, we share them with the community hoping to make the journey shorter for others.
Sentiment analysis is the task of analyzing a piece of text (Document - Article - sentence) and classifying the polarity of its content, sentiment analysis has been a prominent task in NLP and was used to measure NLP models’ accuracy as it can be easily defined with clear metrics with real-life applications in many industries.
Historically Sentiment analysis was solved using traditional ML like SVMs or NB (Naive-Bayes) , using either a lexicon that defines negative neutral and positive words or using word embeddings that assign each word a vector representing it.
In recent years advancements in deep learning accelerated progress in the sentiment analysis task, RNNs, LSTMs, and more recently Transformer models started to dominate NLP and were proved to achieve better results especially with challenging data.
At AIM Technologies our data sources are very diverse they vary in length ranging from Long Articles to short tweets and in dialects including (MSA, Egyptian, Magharbi, gulf, Levantine).
Using deep learning models as opposed to lexicon-based ones was essential when using our models on dialectical Arabic text coming from social media.
DL models have the advantage of understanding not only words but also their context allowing them to understand words with multiple meanings or new words that were never seen before.
Our models need to understand all the nuances of the Arabic language in all of its forms wheather it's: (MSA -> Modern Standard Arabic which is used in formal mediums like news websites) or (Dialectical Arabic which is used in social media and can greatly differ from one Arab country to the other), our data also varies by source wheather it's social media, reviews from websites or news articles) so our training/testing data needs to contain samples of all of these forms. We had two types of data (unlabeled & labeled data) we will cover why later in the post. For our unlabeled data our training data contained > 9 billion words from a multitude of domains and even historical periods.
For labeled data we labeled our own training/dev/test sets in-house, this allowed us to ensure the quality of the data and its labels we managed to collect data that spans all types of text that we would encounter in real life like social media (Facebook, Twitter, Youtube, Instagram), reviews for products, books, and hotels in all Arabic dialects And also ensuring our data is representative of the real-life split of sentiments (Positive, Negative, Neutral).
A big challenge is keeping our data up-to-date, especially for social media ensuring not only that our data is capturing the latest trends but also incorporating our users’ feedback on the model predictions into future versions.
Text Preprocessing & Tokenization:
Text sourced from dialectical social media can be messy, users write in many different forms normalizing these differences goes a long way in making the model’s life easier and reducing our vocab significantly.
Some highlights of our preprocessing pipeline include:
Normalizing similar characters for example: (أ,إ,ا) should all be (ا).
Removing tashkeel for example (“وَصيَّة”) should be (“وصية”).
Normalizing mentions and links to a standard form for example: (@vodafone سعر الباقة كام؟) should be (XmentionX سعر الباقة كام؟).
Removing unnecessary or repeated punctuation or characters for example: (!!! جداااااا) should be (! جدا).
For text processing, many parts are common and shared between languages but it can also have many language-specific parts that need familiarity with the language and the writing forms of your sourced text.
Arabic poses an interesting challenge, words like (أكل, أكلة, أكلت ) should ideally be all be mapped to the same word (اكل) this can be done in multiple ways a popular method is using lemmatization (A popular Arabic lemmatizer is Farasa). Lemmatization can remove prefixes or postfixes But as we noted before dialectical Arabic doesn’t adhere to many of the rules of the Arabic grammar so a word like (ماقولتلهومش) wouldn’t be lemmatized with a traditional lemmatizer.
That’s where SentencePiece comes :
Quoting from their GitHub Readme:
SentencePiece is an unsupervised text tokenizer that implements subword units by training directly from raw sentences.
We use a unigram language model for subword units based on Wikipedia + Dialectical Arabic that learns a vocabulary of tokens together with their probability of occurrence. It assumes that tokens occur independently (hence the unigram in the name). During tokenization, this method finds the most probable segmentation into tokens from the vocabulary.
Subwords more easily represent inflections, including common prefixes and suffixes, and are thus well-suited for morphologically rich languages.
Subword tokenization is a good fit for open-vocabulary problems and eliminates out-of-vocabulary tokens, as the coverage is close to 100% tokens.
Using Sentencepiece we can tokenize the words(ماقولتش, ماقولتلهومش) to (ماقولت).
For our model, we opted to use the Universal Language Model Fine-tuning (ULMFiT) architecture, You can read the UMLFIT paper or read these excellent blogs [Understanding building blocks of ULMFIT, ULMFiT State-of-the-Art in Text Analysis] detailing how it works.
We won’t go into detail about ULMFIT but in summary, it is an architecture and a transfer learning method that can be applied to NLP tasks. It uses the AWD-LSTM architecture for its representations.
ULMFIT Architecture source: https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit
ULMFIT has proved to get the same or better results for text classification vs Transformers models which we will cover later in the post.
We used FastAI to train the model in 4-steps:
1- Creating a tokenizer:
Using a mix of MSA and Dialectical Arabic to train a Sentencepiece tokenizer and use it in all subsequent steps.
2- Language Model pretraining:
LM Pretraining is like teaching the model how to understand language, we provide the model with a huge corpus of text and we mask random words of this corpus and the model has to learn how to predict these masked words.
After this step the model should be able to generate coherent text given a prompt:
So a prompt like:
عمرو دياب هو مطرب مصري
عمرو دياب هو مطرب مصري الجنسيه. ولد في 26 اب اغسطس 1948 م
ودرس في جامعه القاهره وتخرج منها سنه 1973 م
وحصل علي شهاده الليسانس من مدرسه الفنون الجميله بمرتبه الشرف في نفس السنه
The text is factually incorrect but it’s coherent and adheres to Arabic grammar rules.
For the LM pretraining step using only longer text (Wikipedia Articles, books) in MSA proved to get much better results than using all of our data, social media and dialectical text are very noisy and short which makes it hard for the model to learn useful representations for each token.
But now our model needs to see the other forms of data we have to address that in the next step.
3- Finetuning with unlabeled data:
Fine-tuning the LM with our dialectical data is where the model gets to know our other forms of data, it still has the same task (predicting masked words) but to alleviate the risk of catastrophic forgetting (The model forgetting what it learned in the pretraining step) we use gradual unfreezing.
Rather than fine-tuning all layers at once, ULMFiT gradually unfreezes the model starting from the last layer as this contains the least general knowledge. First, the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeats until all layers are fine-tuned until convergence at the last iteration.
The output of this step is a language model that knows how to represent words in MSA or dialectical Arabic.
From this step, we export the encoder part (the embedding layer and the 3 LSTMs) this is the part that contains the weights learned from the training process.
4- Training a classifier with labeled MSA and dialectical data:
The exported encoder from the previous step is added to 2 FC Layers. This bootstraps our classifier model so it already knows how to represent words, it now only has to learn how to associate labels with sentences by training our model on labeled data.
As stated earlier, recently Transformers models have achieved SOTA results in many NLP tasks so naturally, we compared this ULMFIT model to our in-house 🎉 transformers models (Distilbert, Roberta, Electra) surprisingly the best result from them (Roberta model) was less by 1% than our ULMFIT model. This has been also observed by multiple researchers https://twitter.com/jeremyphoward/status/1222157485085089793.
we believe that Transformers models are more suitable for tasks that require more understanding of the language, our in-house NER model 🎉 was a testament to the power of Transformers models (More in a coming blog post).
We believe that ULMFIT strikes a perfect balance between accuracy and complexity allowing us to have a SOTA model that can be trained using a single modern GPU in less than 12 hours.
In the next blogpost we will cover how we optimized our model to reduce its CPU inference time by 70% .
AIM technologies is the first Middle East based customer experience platform that introduced a multi-lingual text analytics solution with the world’s highest accuracy in Arabic language, and the first end to end automated customer research tool. AIM Technologies has a vision of harnessing the power of AI to develop a fully automated customer insights and actions platform helping brands enhance their customer’s experience.
To learn more about the products we’ve built using our AI models, you can reach out to us here.
We Are also hiring in our Data Science team feel free to send us your CV at firstname.lastname@example.org with the subject "DS candidate.