1 Abstract

In the following report, we employ Text Mining methods to predict company ratings through Glassdoor Reviews.

First, we treat positive and negative parts of each review separately, then as part of our pre-treatment we apply different techniques, such as removal of stopwords and lemmatization and performed an exploratory analysis.

In second time, we observe through the application of the Latent Dirichlet Allocation (LDA) method that it is difficult to distinguish which tokens are more specific to each topic that no topic is specific to any company.

Finally, we conclude our work with supervised learning methods and perform a grid search in order to find the best model and hyperparameters to predict the rating. After comparison, we conclude that the best model is the one with the lowest RMSE, which constist of a random forest using LSA on the TF-IDF with 30 topics for the positive part of the reviews and 20 for the negative part of the reviews.

2 Introduction

As part of a Text Mining course taught at HEC Lausanne by Professor Marc-Olivier Boldi, we are going to carry out the following project : Predicting company ratings through Glassdoor Reviews.

Glassdoor is a website where current and former company employees anonymously evaluate their salary, work environment and the company. Being the reference for obtaining information that is not made public by companies, the website has a large audience which is mainly due to the fact that the current and former employees are very active on the platform and therefore counts a lot of reviews on different companies.

The aim of our project is to predict company rating of five banks ( JP Morgan , Deutsche Bank , TD , HSBC Holdings and UBS ) based on reviews from the Glassdoor’s website. We want to see if there are differences and similarities between the five banks in order to determine if some factors are more responsible of the success of one of the company. Are there some patterns that human resources will be able to use to increase competitiveness? Are there specific characteristics that are recurrent in the industry ? In addition, knowing how Glassdoor reviews work can give us insights into the job market, but above all, information about the corporate culture, which is very important to us since we are currently looking for an internship and would like to join a good company.

In order to achieve our objective, we will apply different techniques to extract valuable information from the texts. After performing an exploratory analysis, we will perform a semantic analysis. Then we will fit a topic model, using the Latent Dirichlet Allocation (LDA) method. Finally, we will use machine learning methods by doing regression tasks to predict which model is the best at predicting the rating.

3 Data

Method: Scrapping
Website: Glassdoor.com
Data: Reviews in English from five major banks, namely JP Morgan, Deutsche Bank, TD, HSBC Holdings and UBS.

3.1 Data Acquisition

Our process of data acquisition was composed of the following steps:

First, we identified the html tags required with the information suitable for the purpose of this analysis and prepared a script (scripts/glassdoor-html) in charge of identifying the desired tags. After that, we combined all of the reviews for a company into a single tibble. We also tried to account for any parsing errors that may happen as the html tags of the website were consistently updated.
Next, we used another script (scripts\web-scraper) and created a vector with the URL of the five desired companies. Once defining the minimum number of required reviews, which was 5,010 per company, we looped over each page of the review and combined all the results for the five banks into a single tibble. In order to not have issues while accessing the website a several times (i.e. being blocked by the company), we have defined a rnorm_sleep_generator which introduces random pauses between each visit replicating a human’s browsing behaviors.
Finally, we remove any duplicates, parse the review time and finished by writing our CSV file with bank’s processed reviews. A sample of the website and the tags during the scrapping process can be found in data\glass-door-sample.html.

3.2 Preprocessing

3.2.1 Definitions

Corpus = All reviews on Glassdoor for five major banks

Text = Reviews for each bank falls in this category; in the EDA section, we have used the five banks as the Document. However, in the context of unsupervised and supervised learning, we will refer to the reviews themselves and consider each individual review rather than aggregating them.

Tokens = Words in each review

3.2.2 Data structure

page	review_id	company	review_title	employee_role	employee_history	employer_pros	employer_cons	employer_rating	work_life_balance	culture_values	diversity_inclusion	career_opportunities	compensation_and_benefits	senior_management	review_time
1	empReview_38007833	J.P. Morgan	Cool place	Current Employee - Vice President	I have been working at J.P. Morgan full-time	All Benefits are very good!	Diversity within certain lines of business	4	5	4	3	4	4	NA	2020-11-04 18:57:34
1	empReview_38038804	J.P. Morgan	Fun	Former Employee - Teller	I worked at J.P. Morgan part-time for less than a year	It was a great learning experience	No cons really it was good	4	4	4	NA	4	3	4	2020-11-05 12:08:31
1	empReview_37961239	J.P. Morgan	Great workplace	Former Employee - Analyst	I worked at J.P. Morgan full-time	smart people, good hours, good pay	bureaucracy, not challenging enough, very corporate	5	NA	NA	NA	NA	NA	NA	2020-11-03 14:17:03
1	empReview_37986045	J.P. Morgan	Great bank	Current Employee - Personal Banker II	I have been working at J.P. Morgan full-time for more than a year	Great commissions and opportunity to grow.	You can get switched to another branch out of nowhere.	4	3	5	5	5	5	NA	2020-11-04 07:41:18
1	empReview_37926635	J.P. Morgan	Fantastic company, wanting a change	Current Employee - Trial Consultant	I have been working at J.P. Morgan full-time	positive work environment and potential for increased salary	Stressful work conditions and not enough review on performance and room for improvement	5	NA	NA	NA	NA	NA	NA	2020-11-02 20:33:56

The dataset consists of 5 banks which are J.P. Morgan, TD, UBS, HSBC Holdings, Deutsche Bank. We also remark that employer rating is always filled, but this is not true for the other ratings, such as work life balance, that employee choose or not to answer.

3.2.3 Checking for NAs and duplicates.

Table 3.1: Overview of the missing values
	Number of missing values	%
diversity_inclusion	24155	98
culture_values	5596	23
senior_management	4900	20
work_life_balance	4115	17
compensation_and_benefits	4110	17
career_opportunities	4096	17
employee_role	2749	11
review_title	29	0
page	0	0
review_id	0	0
company	0	0
employee_history	0	0
employer_pros	0	0
employer_cons	0	0
employer_rating	0	0
review_time	0	0

As mentionned above, not all ratings are mandatory to be filled, for example the diversity inclusion is mostly not considered. This indicates that we wont be able to use all of these variables for further classifiation tasks.

For unknown reasons, some reviews seem to have been duplicated, approximately 429. We remove these instances from our dataset to ensure its good quality.

3.2.4 General overview

Table 3.2: Overview of average scores
Bank name	Number of reviews	Rating	Work life balance	Culture values	Diversity inclusion	Career Opportunities	Compensations and benefits	Senior management
Deutsche Bank	3.39	3.43	3.24	3.82	3.22	3.35	2.87	4893
HSBC Holdings	3.53	3.57	3.49	3.78	3.31	3.53	3.01	4900
J.P. Morgan	3.77	3.43	3.62	4.05	3.63	3.74	3.25	4839
TD	3.55	3.32	3.73	4.39	3.46	3.45	3.11	4792
UBS	3.44	3.43	3.32	3.88	3.21	3.29	2.98	4917

Table 3.3: Pros and cons review word length: comparison between banks
metric	Deutsche Bank	HSBC Holdings	J.P. Morgan	TD	UBS
employer_cons_1st_Qu.	6.0	6.0	6.0	6.0	7.0
employer_cons_3rd_Qu.	22.0	16.0	17.0	23.0	24.0
employer_cons_Mean	21.0	17.0	18.0	23.6	21.0
employer_cons_Median	10.0	8.0	9.0	10.0	11.0
employer_pros_1st_Qu.	6.0	6.0	6.0	6.0	6.0
employer_pros_3rd_Qu.	17.0	12.0	12.0	16.0	20.0
employer_pros_Mean	13.9	11.9	11.3	13.9	14.8
employer_pros_Median	9.0	7.0	8.0	8.0	9.0

Regarding the number of words per bank for the reviews, we cannot identify interesting patterns. If we look at the median and the mean number of words for cons reviews, we do not find similar results compared to the scores. It would have been interesting to see that longer cons reviews are correlated with bad score, as angry employees would emphasize the negative points of the company instead of the positive, but this is not the case here. We can use this information later in the modeling part to only work with the reviews that are long enough and not only a few passive words enriching our analysis.

Both graphs above clearly show that the number of word per review is quite concentrated around 5-6 words. This result is not surprising, as every new Glassdoor member need to write minimum length pro and con reviews for account validation. Most of them write a very short review.

4 Exploratory Analysis

A general EDA for the reviews. Note that here we treat each company as a document. Our dataset is then composed of 23,341 reviews.

4.1 Tokenization

Text Mining results are largely impacted by the tokenization process. After testing different options, we decide to use the lemmatization as it is the most promising option.

We create the function EDA_handler, that deals with every aspect of the tokenization. The tokens are split by word and we also remove the stopwords. We also do not want to take into consideration the case and lowercase of the tokens.

4.2 Wordclouds

4.2.1 Wordcloud for Pros

First, generate the wordcloud for the most common words found in the employer_pros column.

4.2.2 Wordcloud for Cons

Next, we can do the same for the employer_cons column.

4.2.3 Most frequent Pro words

The graph below represents the most frequent words per company as the grouping was not done on the review but on the company itself.

4.2.4 Most frequent Cons Words

Regarding the pros, most often the words benefit and people come up, it is also interesting to see that the words culture, and life are also there perhaps indicating that most people care about these values when trying to describe work positively.

We can see that the most cons were associated with the word management. This is followed by employees, hour and time. Furthermore we have the same people word in the cons meaning that we would have to put our analysis into context and use the valence sifters to see if they mention something good about people or bad about these people.

4.3 Semantic Analysis

First we will do a sentiment analysis with the two dictionaries of nrc and afinn. Then, we will introduce valence shifters in the subsequent part.

4.3.1 NRC & AFINN

#> # A tibble: 13,901 x 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 abacus      trust    
#>  2 abandon     fear     
#>  3 abandon     negative 
#>  4 abandon     sadness  
#>  5 abandoned   anger    
#>  6 abandoned   fear     
#>  7 abandoned   negative 
#>  8 abandoned   sadness  
#>  9 abandonment anger    
#> 10 abandonment fear     
#> # ... with 13,891 more rows
#> # A tibble: 2,477 x 2
#>    word       value
#>    <chr>      <dbl>
#>  1 abandon       -2
#>  2 abandoned     -2
#>  3 abandons      -2
#>  4 abducted      -2
#>  5 abduction     -2
#>  6 abductions    -2
#>  7 abhor         -3
#>  8 abhorred      -3
#>  9 abhorrent     -3
#> 10 abhors        -3
#> # ... with 2,467 more rows

4.3.1.1 Plot of NRC dictionary

The feelings for the banks are very close to one another for the nrc method. Unfortunately, we won’t be able to use nrc dictionnary for classification.

4.3.1.2 Plot of AFINN dictionary

Using Afinn dictionnary, the sentiment analysis seems more promising. It really emphasize the importance of choosing the correct dictionary. The Afinn contains over 3,300+ words with a polarity score associated with each word.

4.3.2 More advanced methods with valence shifters.

Table 4.1: Employer rating: Comparison with 2 sentiment analysis methods
Bank name	Actual employer rating	Valence shifter score	Afinn sentiment score
J.P. Morgan	3.77	0.563	1.162
TD	3.55	0.522	1.086
HSBC Holdings	3.53	0.527	1.237
UBS	3.45	0.512	1.212
Deutsche Bank	3.40	0.499	0.962

We get a score that 4/5 times the sentiment corresponds to the actual score. More importantly, it highlights the differences in the scores with Morgan Stanley drastically being in lead. Consequently, this new score through the valence shifters can be used as a feature for the supervised learning part as it seems to provide a lot of relevant information for predicting the score of the review.

4.4 Job positions analysis

In order to have the most accurate classifier, we need to have a look at every variable available. During the scraping process, we have also managed to extract the job position. It makes sense to have a look at it to see if the employee role has an impact on the company review score.

We have also explored the job positions to potentially identify patterns. After removing inaccurate job-title, we observe on the top 5 most frequent position per bank that some position seems more specific to certain banks. For instance, UBS seems to hire a larger number of interns for its operations.

To have a more interpretable insight and using LSA techniques, we manage to create a biplot in 2 dimensions with the 50 top job positions. Interestingly, JP Morgan, HSBC and UBS seem to have a similar job structure. On the other, TD offers more position related to representants (customer oriented), whereas DB hire more analysts.

4.5 Compare the reviews in terms of lexical diversity

Lexical diversity can also be an interesting tool for classification. However, it seems unlikely to have a usable result here. The only explanation would be that maybe some specific banks hire more non-native English speakers, whose linguistic abilities in exercising this langauge may be lower.

We use both Yule’s index and TTR’s index to demonstrate that they are not candidate for classification. Both graphs tell very different stories, and we cannot correlate these results with the employee_review.

5 Unsupervised Topic Analysis

5.1 Preprocessing

To start our analysis, we tokenized our data and removed the stopwords. To make a relevant analysis, we manually removed some words that, from our point of view, would create a bias in our analysis. Stopwords mainly include uninformative words such as quantifiers or company names which appear in every topic. The stopwords definition was an evolutive process. When we performed the unsupervised and superviseed analysis we changed our stopwords gradually.

We selected the pros and cons reviews with at least five tokens because Glassdoor’s users can not write a review with less than five words.

5.2 Latent dirichlet allocation Analysis

The LDA method displays each topic as a combination of words, and each document as a combination of topics. We created two functions, fors pros and cons, that displays our DFM in an LDA plot.

5.2.1 Dimensions testing

After several trials, we decided to fix the number of topics at 3 and the number of terms at 10 for pros and cons reviews. Having a small number of terms on the LDA plots allow us to better visualize the topics and adding more topics does not bring more insights.

5.2.2 Topic by word

Here we can see that for pros and cons, it is very difficult to distinguish which words are more specific to each topic because the three topics have a lot of common words.

5.2.3 Topic by Company

Here, we wanted to see the probability of each text being part of a topic. To do so, we have attached each document with the company, then we have made the average gamma by company, which allowed us to see that no company is closer to a topic than another. In fact, this graph emphasized the fact that using the gamma of LDA is irrelevant in this case.

6 Supervised Learning

In this supervised learning part, we applied machine learning techniques to our data in order to predict the rating of the reviews as a regression task. To do so we proceeded the following way: We applied a filter to remove the duplicated entries and also added the sentiments separately for pros and con which we have summed up to obtain the total sentiment for each review. Additionally, we have made use of the word count for both pros and cons and obtain the total words found in a single review. We used get_sentences to avoid warnings. Furthermore, using lemmatization with the function token_replace, we decided to use reviews containing at least 10 tokens. The motivation behind this removal is the preference given to quality reviews which here is highly dependent on the size as many individuals write extremely short reviews just to have access to the information provide by other users on the website.

Given that we are in a regression task context, we employed linear regression and random forest model associated with LSA or Word Embedding. These dimensions reductions methods were used on top of either the document frequency matrix bag of words or the TF-IDF.

For supervised learning, we decided to apply a different strategy than from the unsupervised part. Instead of having three topics for pros and cons, we decided to assign a different number of topics for each of the two categories (i.e. we can have 5 topics for the pros and 20 topics for the cons) in order to achieve better predictions. We used the RMSE which has been calculated for different methods relatively to the number of topics for positive and negative reviews.

6.1 Document Term Matrix

The following table shows us that the lowest RMSE is obtained by using 50 topics for the pros and only 5 for the cons. The best result (RMSE= 0.913) is given by a linear model using DTM+LSA with 50 pros and 5 cons reviews.

Table 6.1: Document-term frequency: RMSE for LM and RF
Method	Input	# of dimensions pros	# of dimensions cons	Linear model	Random forest
LSA	DTM	5	5	0.932	0.923
LSA	DTM	5	20	0.944	0.919
LSA	DTM	5	30	0.944	0.920
LSA	DTM	5	50	0.959	0.928
LSA	DTM	20	5	0.917	0.920
LSA	DTM	20	20	0.929	0.916
LSA	DTM	20	30	0.928	0.917
LSA	DTM	20	50	0.941	0.921
LSA	DTM	30	5	0.920	0.923
LSA	DTM	30	20	0.934	0.913
LSA	DTM	30	30	0.934	0.918
LSA	DTM	30	50	0.948	0.918
LSA	DTM	50	5	0.913	0.927
LSA	DTM	50	20	0.926	0.919
LSA	DTM	50	30	0.926	0.918
LSA	DTM	50	50	0.939	0.919

6.2 TF-IDF

The following table shows us that the lowest RMSE is obtained by using 20 topics for the pros and only 5 for the cons. The best result (RMSE= 0.901) is given by a random forest using TF-IDF+LSA with 20 pros and 5 cons reviews.

Table 6.2: TF-IDF: RMSE for LM and RF
Method	Input	# of dimensions pros	# of dimensions cons	Linear model	Random forest
LSA	TFIDF	5	5	0.972	0.907
LSA	TFIDF	5	20	0.993	0.907
LSA	TFIDF	5	30	1.012	0.911
LSA	TFIDF	5	50	1.011	0.913
LSA	TFIDF	20	5	0.969	0.901
LSA	TFIDF	20	20	0.976	0.902
LSA	TFIDF	20	30	0.997	0.906
LSA	TFIDF	20	50	0.980	0.904
LSA	TFIDF	30	5	0.978	0.904
LSA	TFIDF	30	20	0.981	0.900
LSA	TFIDF	30	30	1.006	0.910
LSA	TFIDF	30	50	0.990	0.907
LSA	TFIDF	50	5	0.971	0.919
LSA	TFIDF	50	20	0.981	0.914
LSA	TFIDF	50	30	1.010	0.916
LSA	TFIDF	50	50	0.988	0.910

6.3 Combining the results

The following table shows us that the lowest RMSE is obtained by using 30 topics for the pros and 20 for the cons. The best result (RMSE= 0.900) is given by a random forest using TF-IDF+LSA with 30 pros and 20 cons reviews, which gave us the best results among the 3.

Table 6.3: MIXED: RMSE for LM and RF
Method	Input	# of dimensions pros	# of dimensions cons	Linear model	Random forest
LSA	TFIDF	30	20	RF	0.900
LSA	TFIDF	20	5	RF	0.901
LSA	TFIDF	20	20	RF	0.902
LSA	TFIDF	20	50	RF	0.904
LSA	TFIDF	30	5	RF	0.904
LSA	TFIDF	20	30	RF	0.906
LSA	TFIDF	5	5	RF	0.907
LSA	TFIDF	5	20	RF	0.907
LSA	TFIDF	30	50	RF	0.907
LSA	TFIDF	30	30	RF	0.910
LSA	TFIDF	50	50	RF	0.910
LSA	TFIDF	5	30	RF	0.911
LSA	TFIDF	5	50	RF	0.913
LSA	DTM	50	5	LM	0.913
LSA	DTM	30	20	RF	0.913
LSA	TFIDF	50	20	RF	0.914
LSA	TFIDF	50	30	RF	0.916
LSA	DTM	20	20	RF	0.916
LSA	DTM	20	5	LM	0.917
LSA	DTM	20	30	RF	0.917
LSA	DTM	30	30	RF	0.918
LSA	DTM	50	30	RF	0.918
LSA	DTM	30	50	RF	0.918
LSA	DTM	50	20	RF	0.919
LSA	DTM	50	50	RF	0.919
LSA	DTM	5	20	RF	0.919
LSA	TFIDF	50	5	RF	0.919
LSA	DTM	5	30	RF	0.920
LSA	DTM	30	5	LM	0.920
LSA	DTM	20	5	RF	0.920
LSA	DTM	20	50	RF	0.921
LSA	DTM	5	5	RF	0.923
LSA	DTM	30	5	RF	0.923
LSA	DTM	50	30	LM	0.926
LSA	DTM	50	20	LM	0.926
LSA	DTM	50	5	RF	0.927
LSA	DTM	20	30	LM	0.928
LSA	DTM	5	50	RF	0.928
LSA	DTM	20	20	LM	0.929
LSA	DTM	5	5	LM	0.932
LSA	DTM	30	30	LM	0.934
LSA	DTM	30	20	LM	0.934
LSA	DTM	50	50	LM	0.939
LSA	DTM	20	50	LM	0.941
LSA	DTM	5	20	LM	0.944
LSA	DTM	5	30	LM	0.944
LSA	DTM	30	50	LM	0.948
LSA	DTM	5	50	LM	0.959
LSA	TFIDF	20	5	LM	0.969
LSA	TFIDF	50	5	LM	0.971
LSA	TFIDF	5	5	LM	0.972
LSA	TFIDF	20	20	LM	0.976
LSA	TFIDF	30	5	LM	0.978
LSA	TFIDF	20	50	LM	0.980
LSA	TFIDF	50	20	LM	0.981
LSA	TFIDF	30	20	LM	0.981
LSA	TFIDF	50	50	LM	0.988
LSA	TFIDF	30	50	LM	0.990
LSA	TFIDF	5	20	LM	0.993
LSA	TFIDF	20	30	LM	0.997
LSA	TFIDF	30	30	LM	1.006
LSA	TFIDF	50	30	LM	1.010
LSA	TFIDF	5	50	LM	1.011
LSA	TFIDF	5	30	LM	1.012

6.4 Best model (TF-IDF + LSA)

Here we added to our best model valence shifter (sentiment function mentioned above) and the length of the reviews. We can see that despite the addition of the new element, it does not improve our RMSE (0.900) and that the results obtained previously are all better. However, we can see that our model does quit well when it has to predict twos and threes, which could be explain by the fact that there a more reviews which have been rated with a two or a three.

6.4.1 Variable importance

Sinnce our best model is using Random Forest with LSA, it is possible to extract the variables importance and assess wether a topic is more or less important in predicting accuratly the ratings.

Here we can see that the length of pros and cons are important and that the sentiment score helps predicting greatly. We can also observe that cons.3 seems to be an important topic.

In this table we are investigating the words in cons.3. We can see that the 2 most important variables constituting cons.3 are people and team, which mean quite the same.

Table 6.4: The most important variable seems related to colleagues
Key words	Value	Absolute value
people	0.153	0.153
team	0.116	0.116
day	-0.094	0.094
service	-0.103	0.103
graduate	-0.113	0.113
goal	-0.120	0.120
open	-0.124	0.124
call	-0.127	0.127
schedule	-0.128	0.128
employee	-0.129	0.129
teller	-0.132	0.132
store	-0.135	0.135
sale	-0.210	0.210
branch	-0.277	0.277
customer	-0.399	0.399

Now let’s have a look at a pro topic. We can clearly see that pros.2 is related to vacations, with words such as day, year, vacation, time. It could also be related to the compensation in case of sickness (sick, pay)

Table 6.5: Topic pro 2 - Related to holidays
Key words	Value	Absolute value
day	0.306	0.306
year	0.218	0.218
week	0.202	0.202
vacation	0.181	0.181
time	0.174	0.174
pay	0.166	0.166
3	0.161	0.161
401k	0.146	0.146
sick	0.129	0.129
4	0.126	0.126
match	0.120	0.120
2	0.119	0.119
5	0.119	0.119
leave	0.118	0.118
opportunity	-0.118	0.118

These results may suggest that, in general, employees are satisfied with their vacation and days-off, and that they are dissatisfied with internal management methods.

6.5 Word Embedding & Glove

To complete our supervised learning, we used the Word Embedding and GloVe embedding to model in a different way. First, we computed the RMSE for each combination of number of topic for pros and cons and then, while creating our data frame, we took the mean words instead of the sum as it worked better.

Table 6.6: GLOVE: RMSE results
Method	Input	# of dimensions pros	# of dimensions cons	Learner	RMSE
Glove	FCM	50	50	RF	0.942
Glove	FCM	20	20	RF	0.943
Glove	FCM	50	20	RF	0.950
Glove	FCM	5	20	RF	0.953
Glove	FCM	20	5	RF	0.956
Glove	FCM	5	50	RF	0.959
Glove	FCM	20	50	RF	0.960
Glove	FCM	50	5	RF	0.963
Glove	FCM	5	5	RF	0.964

Finally, we can see that the results are without appeal. GloVe produces results that are worse than the ones obtained through the LSA.

7 Conclusion, Limits and Recommendations

In conclusion, our exploratory analysis and unsupervised analysis allows us to state that there seem to be no big differences regarding the working environment in the banking industry. However, our exploratory analysis has allowed us to analyse the positions occupied by employees and has enabled us to identify certain recruitment strategies specific to certain banks, for example UBS mainly uses interns to carry out operations. In addition, thanks to supervised learning, we determined that the use of TF-IDF+LSA with 30 topics for the pros and 20 for the cons allowed us to obtain the best predictions with an RMSE of 0.900.

However, our model has limits. In fact, we haven’t done a classification task, but rather a regression which may not always make sense as the user gives only integers for the rating.

To improve our project, we recommend to scrape more data across various banks, in order to determine and define a kind of ideal threshold of how many words are at least needed to have a high quality review. We recommend also to find a solution for the predictions of reviews combining pros and cons. In addition, taking into account if the user is a current or former employee could maybe help to improve our results.

Text Mining 2020

Ilia Azizi, Justin Denis-Lessard, Romain Donati, Rina Neziri

14 December, 2020