1 Abstract

In the following report, we employ Text Mining methods to predict company ratings through Glassdoor Reviews.

First, we treat positive and negative parts of each review separately, then as part of our pre-treatment we apply different techniques, such as removal of stopwords and lemmatization and performed an exploratory analysis.

In second time, we observe through the application of the Latent Dirichlet Allocation (LDA) method that it is difficult to distinguish which tokens are more specific to each topic that no topic is specific to any company.

Finally, we conclude our work with supervised learning methods and perform a grid search in order to find the best model and hyperparameters to predict the rating. After comparison, we conclude that the best model is the one with the lowest RMSE, which constist of a random forest using LSA on the TF-IDF with 30 topics for the positive part of the reviews and 20 for the negative part of the reviews.

2 Introduction

As part of a Text Mining course taught at HEC Lausanne by Professor Marc-Olivier Boldi, we are going to carry out the following project : Predicting company ratings through Glassdoor Reviews.

Glassdoor is a website where current and former company employees anonymously evaluate their salary, work environment and the company. Being the reference for obtaining information that is not made public by companies, the website has a large audience which is mainly due to the fact that the current and former employees are very active on the platform and therefore counts a lot of reviews on different companies.

The aim of our project is to predict company rating of five banks ( JP Morgan , Deutsche Bank , TD , HSBC Holdings and UBS ) based on reviews from the Glassdoor’s website. We want to see if there are differences and similarities between the five banks in order to determine if some factors are more responsible of the success of one of the company. Are there some patterns that human resources will be able to use to increase competitiveness? Are there specific characteristics that are recurrent in the industry ? In addition, knowing how Glassdoor reviews work can give us insights into the job market, but above all, information about the corporate culture, which is very important to us since we are currently looking for an internship and would like to join a good company.

In order to achieve our objective, we will apply different techniques to extract valuable information from the texts. After performing an exploratory analysis, we will perform a semantic analysis. Then we will fit a topic model, using the Latent Dirichlet Allocation (LDA) method. Finally, we will use machine learning methods by doing regression tasks to predict which model is the best at predicting the rating.

3 Data

3.1 Data Acquisition

Our process of data acquisition was composed of the following steps:

  1. First, we identified the html tags required with the information suitable for the purpose of this analysis and prepared a script (scripts/glassdoor-html) in charge of identifying the desired tags. After that, we combined all of the reviews for a company into a single tibble. We also tried to account for any parsing errors that may happen as the html tags of the website were consistently updated.

  2. Next, we used another script (scripts\web-scraper) and created a vector with the URL of the five desired companies. Once defining the minimum number of required reviews, which was 5,010 per company, we looped over each page of the review and combined all the results for the five banks into a single tibble. In order to not have issues while accessing the website a several times (i.e. being blocked by the company), we have defined a rnorm_sleep_generator which introduces random pauses between each visit replicating a human’s browsing behaviors.

  3. Finally, we remove any duplicates, parse the review time and finished by writing our CSV file with bank’s processed reviews. A sample of the website and the tags during the scrapping process can be found in data\glass-door-sample.html.

3.2 Preprocessing

3.2.1 Definitions

Corpus = All reviews on Glassdoor for five major banks

Text = Reviews for each bank falls in this category; in the EDA section, we have used the five banks as the Document. However, in the context of unsupervised and supervised learning, we will refer to the reviews themselves and consider each individual review rather than aggregating them.

Tokens = Words in each review

3.2.2 Data structure

page review_id company review_title employee_role employee_history employer_pros employer_cons employer_rating work_life_balance culture_values diversity_inclusion career_opportunities compensation_and_benefits senior_management review_time
1 empReview_38007833 J.P. Morgan Cool place Current Employee - Vice President I have been working at J.P. Morgan full-time All Benefits are very good! Diversity within certain lines of business 4 5 4 3 4 4 NA 2020-11-04 18:57:34
1 empReview_38038804 J.P. Morgan Fun Former Employee - Teller I worked at J.P. Morgan part-time for less than a year It was a great learning experience No cons really it was good 4 4 4 NA 4 3 4 2020-11-05 12:08:31
1 empReview_37961239 J.P. Morgan Great workplace Former Employee - Analyst I worked at J.P. Morgan full-time smart people, good hours, good pay bureaucracy, not challenging enough, very corporate 5 NA NA NA NA NA NA 2020-11-03 14:17:03
1 empReview_37986045 J.P. Morgan Great bank Current Employee - Personal Banker II I have been working at J.P. Morgan full-time for more than a year Great commissions and opportunity to grow. You can get switched to another branch out of nowhere. 4 3 5 5 5 5 NA 2020-11-04 07:41:18
1 empReview_37926635 J.P. Morgan Fantastic company, wanting a change Current Employee - Trial Consultant I have been working at J.P. Morgan full-time positive work environment and potential for increased salary Stressful work conditions and not enough review on performance and room for improvement 5 NA NA NA NA NA NA 2020-11-02 20:33:56


The dataset consists of 5 banks which are J.P. Morgan, TD, UBS, HSBC Holdings, Deutsche Bank. We also remark that employer rating is always filled, but this is not true for the other ratings, such as work life balance, that employee choose or not to answer.

3.2.3 Checking for NAs and duplicates.

Table 3.1: Overview of the missing values
Number of missing values %
diversity_inclusion 24155 98
culture_values 5596 23
senior_management 4900 20
work_life_balance 4115 17
compensation_and_benefits 4110 17
career_opportunities 4096 17
employee_role 2749 11
review_title 29 0
page 0 0
review_id 0 0
company 0 0
employee_history 0 0
employer_pros 0 0
employer_cons 0 0
employer_rating 0 0
review_time 0 0


As mentionned above, not all ratings are mandatory to be filled, for example the diversity inclusion is mostly not considered. This indicates that we wont be able to use all of these variables for further classifiation tasks.


For unknown reasons, some reviews seem to have been duplicated, approximately 429. We remove these instances from our dataset to ensure its good quality.

3.2.4 General overview

Table 3.2: Overview of average scores
Bank name Number of reviews Rating Work life balance Culture values Diversity inclusion Career Opportunities Compensations and benefits Senior management
Deutsche Bank 3.39 3.43 3.24 3.82 3.22 3.35 2.87 4893
HSBC Holdings 3.53 3.57 3.49 3.78 3.31 3.53 3.01 4900
J.P. Morgan 3.77 3.43 3.62 4.05 3.63 3.74 3.25 4839
TD 3.55 3.32 3.73 4.39 3.46 3.45 3.11 4792
UBS 3.44 3.43 3.32 3.88 3.21 3.29 2.98 4917
Table 3.3: Pros and cons review word length: comparison between banks
metric Deutsche Bank HSBC Holdings J.P. Morgan TD UBS
employer_cons_1st_Qu. 6.0 6.0 6.0 6.0 7.0
employer_cons_3rd_Qu. 22.0 16.0 17.0 23.0 24.0
employer_cons_Mean 21.0 17.0 18.0 23.6 21.0
employer_cons_Median 10.0 8.0 9.0 10.0 11.0
employer_pros_1st_Qu. 6.0 6.0 6.0 6.0 6.0
employer_pros_3rd_Qu. 17.0 12.0 12.0 16.0 20.0
employer_pros_Mean 13.9 11.9 11.3 13.9 14.8
employer_pros_Median 9.0 7.0 8.0 8.0 9.0


Regarding the number of words per bank for the reviews, we cannot identify interesting patterns. If we look at the median and the mean number of words for cons reviews, we do not find similar results compared to the scores. It would have been interesting to see that longer cons reviews are correlated with bad score, as angry employees would emphasize the negative points of the company instead of the positive, but this is not the case here. We can use this information later in the modeling part to only work with the reviews that are long enough and not only a few passive words enriching our analysis.


Both graphs above clearly show that the number of word per review is quite concentrated around 5-6 words. This result is not surprising, as every new Glassdoor member need to write minimum length pro and con reviews for account validation. Most of them write a very short review.

4 Exploratory Analysis

A general EDA for the reviews. Note that here we treat each company as a document. Our dataset is then composed of 23,341 reviews.

4.1 Tokenization

Text Mining results are largely impacted by the tokenization process. After testing different options, we decide to use the lemmatization as it is the most promising option.

We create the function EDA_handler, that deals with every aspect of the tokenization. The tokens are split by word and we also remove the stopwords. We also do not want to take into consideration the case and lowercase of the tokens.

4.2 Wordclouds

4.2.1 Wordcloud for Pros

First, generate the wordcloud for the most common words found in the employer_pros column.

4.2.2 Wordcloud for Cons

Next, we can do the same for the employer_cons column.

4.2.3 Most frequent Pro words

The graph below represents the most frequent words per company as the grouping was not done on the review but on the company itself.

4.2.4 Most frequent Cons Words

Regarding the pros, most often the words benefit and people come up, it is also interesting to see that the words culture, and life are also there perhaps indicating that most people care about these values when trying to describe work positively.

We can see that the most cons were associated with the word management. This is followed by employees, hour and time. Furthermore we have the same people word in the cons meaning that we would have to put our analysis into context and use the valence sifters to see if they mention something good about people or bad about these people.

4.3 Semantic Analysis

First we will do a sentiment analysis with the two dictionaries of nrc and afinn. Then, we will introduce valence shifters in the subsequent part.

4.3.1 NRC & AFINN

#> # A tibble: 13,901 x 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 abacus      trust    
#>  2 abandon     fear     
#>  3 abandon     negative 
#>  4 abandon     sadness  
#>  5 abandoned   anger    
#>  6 abandoned   fear     
#>  7 abandoned   negative 
#>  8 abandoned   sadness  
#>  9 abandonment anger    
#> 10 abandonment fear     
#> # ... with 13,891 more rows
#> # A tibble: 2,477 x 2
#>    word       value
#>    <chr>      <dbl>
#>  1 abandon       -2
#>  2 abandoned     -2
#>  3 abandons      -2
#>  4 abducted      -2
#>  5 abduction     -2
#>  6 abductions    -2
#>  7 abhor         -3
#>  8 abhorred      -3
#>  9 abhorrent     -3
#> 10 abhors        -3
#> # ... with 2,467 more rows


4.3.1.1 Plot of NRC dictionary

The feelings for the banks are very close to one another for the nrc method. Unfortunately, we won’t be able to use nrc dictionnary for classification.

4.3.1.2 Plot of AFINN dictionary


Using Afinn dictionnary, the sentiment analysis seems more promising. It really emphasize the importance of choosing the correct dictionary. The Afinn contains over 3,300+ words with a polarity score associated with each word.

4.3.2 More advanced methods with valence shifters.

Table 4.1: Employer rating: Comparison with 2 sentiment analysis methods
Bank name Actual employer rating Valence shifter score Afinn sentiment score
J.P. Morgan 3.77 0.563 1.162
TD 3.55 0.522 1.086
HSBC Holdings 3.53 0.527 1.237
UBS 3.45 0.512 1.212
Deutsche Bank 3.40 0.499 0.962

We get a score that 4/5 times the sentiment corresponds to the actual score. More importantly, it highlights the differences in the scores with Morgan Stanley drastically being in lead. Consequently, this new score through the valence shifters can be used as a feature for the supervised learning part as it seems to provide a lot of relevant information for predicting the score of the review.

4.4 Job positions analysis

In order to have the most accurate classifier, we need to have a look at every variable available. During the scraping process, we have also managed to extract the job position. It makes sense to have a look at it to see if the employee role has an impact on the company review score.

We have also explored the job positions to potentially identify patterns. After removing inaccurate job-title, we observe on the top 5 most frequent position per bank that some position seems more specific to certain banks. For instance, UBS seems to hire a larger number of interns for its operations.

To have a more interpretable insight and using LSA techniques, we manage to create a biplot in 2 dimensions with the 50 top job positions. Interestingly, JP Morgan, HSBC and UBS seem to have a similar job structure. On the other, TD offers more position related to representants (customer oriented), whereas DB hire more analysts.

4.5 Compare the reviews in terms of lexical diversity

Lexical diversity can also be an interesting tool for classification. However, it seems unlikely to have a usable result here. The only explanation would be that maybe some specific banks hire more non-native English speakers, whose linguistic abilities in exercising this langauge may be lower.

We use both Yule’s index and TTR’s index to demonstrate that they are not candidate for classification. Both graphs tell very different stories, and we cannot correlate these results with the employee_review.

5 Unsupervised Topic Analysis

5.1 Preprocessing

To start our analysis, we tokenized our data and removed the stopwords. To make a relevant analysis, we manually removed some words that, from our point of view, would create a bias in our analysis. Stopwords mainly include uninformative words such as quantifiers or company names which appear in every topic. The stopwords definition was an evolutive process. When we performed the unsupervised and superviseed analysis we changed our stopwords gradually.

We selected the pros and cons reviews with at least five tokens because Glassdoor’s users can not write a review with less than five words.

5.2 Latent dirichlet allocation Analysis

The LDA method displays each topic as a combination of words, and each document as a combination of topics. We created two functions, fors pros and cons, that displays our DFM in an LDA plot.

5.2.1 Dimensions testing

After several trials, we decided to fix the number of topics at 3 and the number of terms at 10 for pros and cons reviews. Having a small number of terms on the LDA plots allow us to better visualize the topics and adding more topics does not bring more insights.

5.2.2 Topic by word

Here we can see that for pros and cons, it is very difficult to distinguish which words are more specific to each topic because the three topics have a lot of common words.

5.2.3 Topic by Company

Here, we wanted to see the probability of each text being part of a topic. To do so, we have attached each document with the company, then we have made the average gamma by company, which allowed us to see that no company is closer to a topic than another. In fact, this graph emphasized the fact that using the gamma of LDA is irrelevant in this case.

6 Supervised Learning

In this supervised learning part, we applied machine learning techniques to our data in order to predict the rating of the reviews as a regression task. To do so we proceeded the following way: We applied a filter to remove the duplicated entries and also added the sentiments separately for pros and con which we have summed up to obtain the total sentiment for each review. Additionally, we have made use of the word count for both pros and cons and obtain the total words found in a single review. We used get_sentences to avoid warnings. Furthermore, using lemmatization with the function token_replace, we decided to use reviews containing at least 10 tokens. The motivation behind this removal is the preference given to quality reviews which here is highly dependent on the size as many individuals write extremely short reviews just to have access to the information provide by other users on the website.

Given that we are in a regression task context, we employed linear regression and random forest model associated with LSA or Word Embedding. These dimensions reductions methods were used on top of either the document frequency matrix bag of words or the TF-IDF.

For supervised learning, we decided to apply a different strategy than from the unsupervised part. Instead of having three topics for pros and cons, we decided to assign a different number of topics for each of the two categories (i.e. we can have 5 topics for the pros and 20 topics for the cons) in order to achieve better predictions. We used the RMSE which has been calculated for different methods relatively to the number of topics for positive and negative reviews.

6.1 Document Term Matrix

The following table shows us that the lowest RMSE is obtained by using 50 topics for the pros and only 5 for the cons. The best result (RMSE= 0.913) is given by a linear model using DTM+LSA with 50 pros and 5 cons reviews.

Table 6.1: Document-term frequency: RMSE for LM and RF
Method Input # of dimensions pros # of dimensions cons Linear model Random forest
LSA DTM 5 5 0.932 0.923
LSA DTM 5 20 0.944 0.919
LSA DTM 5 30 0.944 0.920
LSA DTM 5 50 0.959 0.928
LSA DTM 20 5 0.917 0.920
LSA DTM 20 20 0.929 0.916
LSA DTM 20 30 0.928 0.917
LSA DTM 20 50 0.941 0.921
LSA DTM 30 5 0.920 0.923
LSA DTM 30 20 0.934 0.913
LSA DTM 30 30 0.934 0.918
LSA DTM 30 50 0.948 0.918
LSA DTM 50 5 0.913 0.927
LSA DTM 50 20 0.926 0.919
LSA DTM 50 30 0.926 0.918
LSA DTM 50 50 0.939 0.919

6.2 TF-IDF

The following table shows us that the lowest RMSE is obtained by using 20 topics for the pros and only 5 for the cons. The best result (RMSE= 0.901) is given by a random forest using TF-IDF+LSA with 20 pros and 5 cons reviews.

Table 6.2: TF-IDF: RMSE for LM and RF
Method Input # of dimensions pros # of dimensions cons Linear model Random forest
LSA TFIDF 5 5 0.972 0.907
LSA TFIDF 5 20 0.993 0.907
LSA TFIDF 5 30 1.012 0.911
LSA TFIDF 5 50 1.011 0.913
LSA TFIDF 20 5 0.969 0.901
LSA TFIDF 20 20 0.976 0.902
LSA TFIDF 20 30 0.997 0.906
LSA TFIDF 20 50 0.980 0.904
LSA TFIDF 30 5 0.978 0.904
LSA TFIDF 30 20 0.981 0.900
LSA TFIDF 30 30 1.006 0.910
LSA TFIDF 30 50 0.990 0.907
LSA TFIDF 50 5 0.971 0.919
LSA TFIDF 50 20 0.981 0.914
LSA TFIDF 50 30 1.010 0.916
LSA TFIDF 50 50 0.988 0.910

6.3 Combining the results

The following table shows us that the lowest RMSE is obtained by using 30 topics for the pros and 20 for the cons. The best result (RMSE= 0.900) is given by a random forest using TF-IDF+LSA with 30 pros and 20 cons reviews, which gave us the best results among the 3.
Table 6.3: MIXED: RMSE for LM and RF
Method Input # of dimensions pros # of dimensions cons Linear model Random forest
LSA TFIDF 30 20 RF 0.900
LSA TFIDF 20 5 RF 0.901
LSA TFIDF 20 20 RF 0.902
LSA TFIDF 20 50 RF 0.904
LSA TFIDF 30 5 RF 0.904
LSA TFIDF 20 30 RF 0.906
LSA TFIDF 5 5 RF 0.907
LSA TFIDF 5 20 RF 0.907
LSA TFIDF 30 50 RF 0.907
LSA TFIDF 30 30 RF 0.910
LSA TFIDF 50 50 RF 0.910
LSA TFIDF 5 30 RF 0.911
LSA TFIDF 5 50 RF 0.913
LSA DTM 50 5 LM 0.913
LSA DTM 30 20 RF 0.913
LSA TFIDF 50 20 RF 0.914
LSA TFIDF 50 30 RF 0.916
LSA DTM 20 20 RF 0.916
LSA DTM 20 5 LM 0.917
LSA DTM 20 30 RF 0.917
LSA DTM 30 30 RF 0.918
LSA DTM 50 30 RF 0.918
LSA DTM 30 50 RF 0.918
LSA DTM 50 20 RF 0.919
LSA DTM 50 50 RF 0.919
LSA DTM 5 20 RF 0.919
LSA TFIDF 50 5 RF 0.919
LSA DTM 5 30 RF 0.920
LSA DTM 30 5 LM 0.920
LSA DTM 20 5 RF 0.920
LSA DTM 20 50 RF 0.921
LSA DTM 5 5 RF 0.923
LSA DTM 30 5 RF 0.923
LSA DTM 50 30 LM 0.926
LSA DTM 50 20 LM 0.926
LSA DTM 50 5 RF 0.927
LSA DTM 20 30 LM 0.928
LSA DTM 5 50 RF 0.928
LSA DTM 20 20 LM 0.929
LSA DTM 5 5 LM 0.932
LSA DTM 30 30 LM 0.934
LSA DTM 30 20 LM 0.934
LSA DTM 50 50 LM 0.939
LSA DTM 20 50 LM 0.941
LSA DTM 5 20 LM 0.944
LSA DTM 5 30 LM 0.944
LSA DTM 30 50 LM 0.948
LSA DTM 5 50 LM 0.959
LSA TFIDF 20 5 LM 0.969
LSA TFIDF 50 5 LM 0.971
LSA TFIDF 5 5 LM 0.972
LSA TFIDF 20 20 LM 0.976
LSA TFIDF 30 5 LM 0.978
LSA TFIDF 20 50 LM 0.980
LSA TFIDF 50 20 LM 0.981
LSA TFIDF 30 20 LM 0.981
LSA TFIDF 50 50 LM 0.988
LSA TFIDF 30 50 LM 0.990
LSA TFIDF 5 20 LM 0.993
LSA TFIDF 20 30 LM 0.997
LSA TFIDF 30 30 LM 1.006
LSA TFIDF 50 30 LM 1.010
LSA TFIDF 5 50 LM 1.011
LSA TFIDF 5 30 LM 1.012

6.4 Best model (TF-IDF + LSA)

Here we added to our best model valence shifter (sentiment function mentioned above) and the length of the reviews. We can see that despite the addition of the new element, it does not improve our RMSE (0.900) and that the results obtained previously are all better. However, we can see that our model does quit well when it has to predict twos and threes, which could be explain by the fact that there a more reviews which have been rated with a two or a three.

6.4.1 Variable importance

Sinnce our best model is using Random Forest with LSA, it is possible to extract the variables importance and assess wether a topic is more or less important in predicting accuratly the ratings.

Here we can see that the length of pros and cons are important and that the sentiment score helps predicting greatly. We can also observe that cons.3 seems to be an important topic.

In this table we are investigating the words in cons.3. We can see that the 2 most important variables constituting cons.3 are people and team, which mean quite the same.

Table 6.4: The most important variable seems related to colleagues
Key words Value Absolute value
people 0.153 0.153
team 0.116 0.116
day -0.094 0.094
service -0.103 0.103
graduate -0.113 0.113
goal -0.120 0.120
open -0.124 0.124
call -0.127 0.127
schedule -0.128 0.128
employee -0.129 0.129
teller -0.132 0.132
store -0.135 0.135
sale -0.210 0.210
branch -0.277 0.277
customer -0.399 0.399
Now let’s have a look at a pro topic. We can clearly see that pros.2 is related to vacations, with words such as day, year, vacation, time. It could also be related to the compensation in case of sickness (sick, pay)
Table 6.5: Topic pro 2 - Related to holidays
Key words Value Absolute value
day 0.306 0.306
year 0.218 0.218
week 0.202 0.202
vacation 0.181 0.181
time 0.174 0.174
pay 0.166 0.166
3 0.161 0.161
401k 0.146 0.146
sick 0.129 0.129
4 0.126 0.126
match 0.120 0.120
2 0.119 0.119
5 0.119 0.119
leave 0.118 0.118
opportunity -0.118 0.118

These results may suggest that, in general, employees are satisfied with their vacation and days-off, and that they are dissatisfied with internal management methods.

6.5 Word Embedding & Glove

To complete our supervised learning, we used the Word Embedding and GloVe embedding to model in a different way. First, we computed the RMSE for each combination of number of topic for pros and cons and then, while creating our data frame, we took the mean words instead of the sum as it worked better.

Table 6.6: GLOVE: RMSE results
Method Input # of dimensions pros # of dimensions cons Learner RMSE
Glove FCM 50 50 RF 0.942
Glove FCM 20 20 RF 0.943
Glove FCM 50 20 RF 0.950
Glove FCM 5 20 RF 0.953
Glove FCM 20 5 RF 0.956
Glove FCM 5 50 RF 0.959
Glove FCM 20 50 RF 0.960
Glove FCM 50 5 RF 0.963
Glove FCM 5 5 RF 0.964

Finally, we can see that the results are without appeal. GloVe produces results that are worse than the ones obtained through the LSA.

7 Conclusion, Limits and Recommendations

In conclusion, our exploratory analysis and unsupervised analysis allows us to state that there seem to be no big differences regarding the working environment in the banking industry. However, our exploratory analysis has allowed us to analyse the positions occupied by employees and has enabled us to identify certain recruitment strategies specific to certain banks, for example UBS mainly uses interns to carry out operations. In addition, thanks to supervised learning, we determined that the use of TF-IDF+LSA with 30 topics for the pros and 20 for the cons allowed us to obtain the best predictions with an RMSE of 0.900.

However, our model has limits. In fact, we haven’t done a classification task, but rather a regression which may not always make sense as the user gives only integers for the rating.

To improve our project, we recommend to scrape more data across various banks, in order to determine and define a kind of ideal threshold of how many words are at least needed to have a high quality review. We recommend also to find a solution for the predictions of reviews combining pros and cons. In addition, taking into account if the user is a current or former employee could maybe help to improve our results.