Our process of data acquisition was composed of the following steps:
First, we identified the html tags required with the information suitable for the purpose of this analysis and prepared a script (scripts/glassdoor-html
) in charge of identifying the desired tags. After that, we combined all of the reviews for a company into a single tibble. We also tried to account for any parsing errors that may happen as the html tags of the website were consistently updated.
Next, we used another script (scripts\web-scraper
) and created a vector with the URL of the five desired companies. Once defining the minimum number of required reviews, which was 5,010 per company, we looped over each page of the review and combined all the results for the five banks into a single tibble. In order to not have issues while accessing the website a several times (i.e. being blocked by the company), we have defined a rnorm_sleep_generator
which introduces random pauses between each visit replicating a human’s browsing behaviors.
Finally, we remove any duplicates, parse the review time and finished by writing our CSV file with bank’s processed reviews. A sample of the website and the tags during the scrapping process can be found in data\glass-door-sample.html
.
Corpus
= All reviews on Glassdoor for five major banks
Text
= Reviews for each bank falls in this category; in the EDA section, we have used the five banks as the Document
. However, in the context of unsupervised and supervised learning, we will refer to the reviews themselves and consider each individual review rather than aggregating them.
Tokens
= Words in each review
page | review_id | company | review_title | employee_role | employee_history | employer_pros | employer_cons | employer_rating | work_life_balance | culture_values | diversity_inclusion | career_opportunities | compensation_and_benefits | senior_management | review_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | empReview_38007833 | J.P. Morgan | Cool place | Current Employee - Vice President | I have been working at J.P. Morgan full-time | All Benefits are very good! | Diversity within certain lines of business | 4 | 5 | 4 | 3 | 4 | 4 | NA | 2020-11-04 18:57:34 |
1 | empReview_38038804 | J.P. Morgan | Fun | Former Employee - Teller | I worked at J.P. Morgan part-time for less than a year | It was a great learning experience | No cons really it was good | 4 | 4 | 4 | NA | 4 | 3 | 4 | 2020-11-05 12:08:31 |
1 | empReview_37961239 | J.P. Morgan | Great workplace | Former Employee - Analyst | I worked at J.P. Morgan full-time | smart people, good hours, good pay | bureaucracy, not challenging enough, very corporate | 5 | NA | NA | NA | NA | NA | NA | 2020-11-03 14:17:03 |
1 | empReview_37986045 | J.P. Morgan | Great bank | Current Employee - Personal Banker II | I have been working at J.P. Morgan full-time for more than a year | Great commissions and opportunity to grow. | You can get switched to another branch out of nowhere. | 4 | 3 | 5 | 5 | 5 | 5 | NA | 2020-11-04 07:41:18 |
1 | empReview_37926635 | J.P. Morgan | Fantastic company, wanting a change | Current Employee - Trial Consultant | I have been working at J.P. Morgan full-time | positive work environment and potential for increased salary | Stressful work conditions and not enough review on performance and room for improvement | 5 | NA | NA | NA | NA | NA | NA | 2020-11-02 20:33:56 |
The dataset consists of 5 banks which are J.P. Morgan, TD, UBS, HSBC Holdings, Deutsche Bank. We also remark that employer rating is always filled, but this is not true for the other ratings, such as work life balance, that employee choose or not to answer.
Number of missing values | % | |
---|---|---|
diversity_inclusion | 24155 | 98 |
culture_values | 5596 | 23 |
senior_management | 4900 | 20 |
work_life_balance | 4115 | 17 |
compensation_and_benefits | 4110 | 17 |
career_opportunities | 4096 | 17 |
employee_role | 2749 | 11 |
review_title | 29 | 0 |
page | 0 | 0 |
review_id | 0 | 0 |
company | 0 | 0 |
employee_history | 0 | 0 |
employer_pros | 0 | 0 |
employer_cons | 0 | 0 |
employer_rating | 0 | 0 |
review_time | 0 | 0 |
As mentionned above, not all ratings are mandatory to be filled, for example the diversity inclusion is mostly not considered. This indicates that we wont be able to use all of these variables for further classifiation tasks.
For unknown reasons, some reviews seem to have been duplicated, approximately 429. We remove these instances from our dataset to ensure its good quality.
Bank name | Number of reviews | Rating | Work life balance | Culture values | Diversity inclusion | Career Opportunities | Compensations and benefits | Senior management |
---|---|---|---|---|---|---|---|---|
Deutsche Bank | 3.39 | 3.43 | 3.24 | 3.82 | 3.22 | 3.35 | 2.87 | 4893 |
HSBC Holdings | 3.53 | 3.57 | 3.49 | 3.78 | 3.31 | 3.53 | 3.01 | 4900 |
J.P. Morgan | 3.77 | 3.43 | 3.62 | 4.05 | 3.63 | 3.74 | 3.25 | 4839 |
TD | 3.55 | 3.32 | 3.73 | 4.39 | 3.46 | 3.45 | 3.11 | 4792 |
UBS | 3.44 | 3.43 | 3.32 | 3.88 | 3.21 | 3.29 | 2.98 | 4917 |
metric | Deutsche Bank | HSBC Holdings | J.P. Morgan | TD | UBS |
---|---|---|---|---|---|
employer_cons_1st_Qu. | 6.0 | 6.0 | 6.0 | 6.0 | 7.0 |
employer_cons_3rd_Qu. | 22.0 | 16.0 | 17.0 | 23.0 | 24.0 |
employer_cons_Mean | 21.0 | 17.0 | 18.0 | 23.6 | 21.0 |
employer_cons_Median | 10.0 | 8.0 | 9.0 | 10.0 | 11.0 |
employer_pros_1st_Qu. | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 |
employer_pros_3rd_Qu. | 17.0 | 12.0 | 12.0 | 16.0 | 20.0 |
employer_pros_Mean | 13.9 | 11.9 | 11.3 | 13.9 | 14.8 |
employer_pros_Median | 9.0 | 7.0 | 8.0 | 8.0 | 9.0 |
Regarding the number of words per bank for the reviews, we cannot identify interesting patterns. If we look at the median and the mean number of words for cons reviews, we do not find similar results compared to the scores. It would have been interesting to see that longer cons reviews are correlated with bad score, as angry employees would emphasize the negative points of the company instead of the positive, but this is not the case here. We can use this information later in the modeling part to only work with the reviews that are long enough and not only a few passive words enriching our analysis.
Both graphs above clearly show that the number of word per review is quite concentrated around 5-6 words. This result is not surprising, as every new Glassdoor member need to write minimum length pro and con reviews for account validation. Most of them write a very short review.
A general EDA for the reviews. Note that here we treat each company as a document. Our dataset is then composed of 23,341 reviews.
Text Mining results are largely impacted by the tokenization process. After testing different options, we decide to use the lemmatization as it is the most promising option.
We create the function EDA_handler
, that deals with every aspect of the tokenization. The tokens are split by word and we also remove the stopwords. We also do not want to take into consideration the case and lowercase of the tokens.
First, generate the wordcloud for the most common words found in the employer_pros
column.
Next, we can do the same for the employer_cons
column.
The graph below represents the most frequent words per company as the grouping was not done on the review but on the company itself.
Regarding the pros, most often the words benefit
and people
come up, it is also interesting to see that the words culture
, and life
are also there perhaps indicating that most people care about these values when trying to describe work positively.
We can see that the most cons were associated with the word management
. This is followed by employees
, hour
and time
. Furthermore we have the same people
word in the cons meaning that we would have to put our analysis into context and use the valence sifters to see if they mention something good about people or bad about these people.
First we will do a sentiment analysis with the two dictionaries of nrc
and afinn
. Then, we will introduce valence shifters in the subsequent part.
#> # A tibble: 13,901 x 2
#> word sentiment
#> <chr> <chr>
#> 1 abacus trust
#> 2 abandon fear
#> 3 abandon negative
#> 4 abandon sadness
#> 5 abandoned anger
#> 6 abandoned fear
#> 7 abandoned negative
#> 8 abandoned sadness
#> 9 abandonment anger
#> 10 abandonment fear
#> # ... with 13,891 more rows
#> # A tibble: 2,477 x 2
#> word value
#> <chr> <dbl>
#> 1 abandon -2
#> 2 abandoned -2
#> 3 abandons -2
#> 4 abducted -2
#> 5 abduction -2
#> 6 abductions -2
#> 7 abhor -3
#> 8 abhorred -3
#> 9 abhorrent -3
#> 10 abhors -3
#> # ... with 2,467 more rows
The feelings for the banks are very close to one another for the nrc
method. Unfortunately, we won’t be able to use nrc dictionnary for classification.
Using Afinn dictionnary, the sentiment analysis seems more promising. It really emphasize the importance of choosing the correct dictionary. The Afinn contains over 3,300+ words with a polarity score associated with each word.
Bank name | Actual employer rating | Valence shifter score | Afinn sentiment score |
---|---|---|---|
J.P. Morgan | 3.77 | 0.563 | 1.162 |
TD | 3.55 | 0.522 | 1.086 |
HSBC Holdings | 3.53 | 0.527 | 1.237 |
UBS | 3.45 | 0.512 | 1.212 |
Deutsche Bank | 3.40 | 0.499 | 0.962 |
We get a score that 4/5 times the sentiment corresponds to the actual score. More importantly, it highlights the differences in the scores with Morgan Stanley drastically being in lead. Consequently, this new score through the valence shifters can be used as a feature for the supervised learning part as it seems to provide a lot of relevant information for predicting the score of the review.
In order to have the most accurate classifier, we need to have a look at every variable available. During the scraping process, we have also managed to extract the job position. It makes sense to have a look at it to see if the employee role has an impact on the company review score.
We have also explored the job positions to potentially identify patterns. After removing inaccurate job-title, we observe on the top 5 most frequent position per bank that some position seems more specific to certain banks. For instance, UBS seems to hire a larger number of interns for its operations.
To have a more interpretable insight and using LSA techniques, we manage to create a biplot in 2 dimensions with the 50 top job positions. Interestingly, JP Morgan, HSBC and UBS seem to have a similar job structure. On the other, TD offers more position related to representants (customer oriented), whereas DB hire more analysts.
Lexical diversity can also be an interesting tool for classification. However, it seems unlikely to have a usable result here. The only explanation would be that maybe some specific banks hire more non-native English speakers, whose linguistic abilities in exercising this langauge may be lower.
We use both Yule’s index and TTR’s index to demonstrate that they are not candidate for classification. Both graphs tell very different stories, and we cannot correlate these results with the employee_review
.
To start our analysis, we tokenized our data and removed the stopwords. To make a relevant analysis, we manually removed some words that, from our point of view, would create a bias in our analysis. Stopwords mainly include uninformative words such as quantifiers or company names which appear in every topic. The stopwords definition was an evolutive process. When we performed the unsupervised and superviseed analysis we changed our stopwords gradually.
We selected the pros and cons reviews with at least five tokens because Glassdoor’s users can not write a review with less than five words.
The LDA method displays each topic as a combination of words, and each document as a combination of topics. We created two functions, fors pros and cons, that displays our DFM in an LDA plot.
After several trials, we decided to fix the number of topics at 3 and the number of terms at 10 for pros and cons reviews. Having a small number of terms on the LDA plots allow us to better visualize the topics and adding more topics does not bring more insights.
Here we can see that for pros and cons, it is very difficult to distinguish which words are more specific to each topic because the three topics have a lot of common words.
Here, we wanted to see the probability of each text being part of a topic. To do so, we have attached each document with the company, then we have made the average gamma by company, which allowed us to see that no company is closer to a topic than another. In fact, this graph emphasized the fact that using the gamma of LDA is irrelevant in this case.
In this supervised learning part, we applied machine learning techniques to our data in order to predict the rating of the reviews as a regression task. To do so we proceeded the following way: We applied a filter to remove the duplicated entries and also added the sentiments separately for pros and con which we have summed up to obtain the total sentiment for each review. Additionally, we have made use of the word count for both pros and cons and obtain the total words found in a single review. We used get_sentences to avoid warnings. Furthermore, using lemmatization with the function token_replace, we decided to use reviews containing at least 10 tokens. The motivation behind this removal is the preference given to quality reviews which here is highly dependent on the size as many individuals write extremely short reviews just to have access to the information provide by other users on the website.
Given that we are in a regression task context, we employed linear regression and random forest model associated with LSA or Word Embedding. These dimensions reductions methods were used on top of either the document frequency matrix bag of words or the TF-IDF.
For supervised learning, we decided to apply a different strategy than from the unsupervised part. Instead of having three topics for pros and cons, we decided to assign a different number of topics for each of the two categories (i.e. we can have 5 topics for the pros and 20 topics for the cons) in order to achieve better predictions. We used the RMSE which has been calculated for different methods relatively to the number of topics for positive and negative reviews.
The following table shows us that the lowest RMSE is obtained by using 50 topics for the pros and only 5 for the cons. The best result (RMSE= 0.913) is given by a linear model using DTM+LSA with 50 pros and 5 cons reviews.
Method | Input | # of dimensions pros | # of dimensions cons | Linear model | Random forest |
---|---|---|---|---|---|
LSA | DTM | 5 | 5 | 0.932 | 0.923 |
LSA | DTM | 5 | 20 | 0.944 | 0.919 |
LSA | DTM | 5 | 30 | 0.944 | 0.920 |
LSA | DTM | 5 | 50 | 0.959 | 0.928 |
LSA | DTM | 20 | 5 | 0.917 | 0.920 |
LSA | DTM | 20 | 20 | 0.929 | 0.916 |
LSA | DTM | 20 | 30 | 0.928 | 0.917 |
LSA | DTM | 20 | 50 | 0.941 | 0.921 |
LSA | DTM | 30 | 5 | 0.920 | 0.923 |
LSA | DTM | 30 | 20 | 0.934 | 0.913 |
LSA | DTM | 30 | 30 | 0.934 | 0.918 |
LSA | DTM | 30 | 50 | 0.948 | 0.918 |
LSA | DTM | 50 | 5 | 0.913 | 0.927 |
LSA | DTM | 50 | 20 | 0.926 | 0.919 |
LSA | DTM | 50 | 30 | 0.926 | 0.918 |
LSA | DTM | 50 | 50 | 0.939 | 0.919 |
The following table shows us that the lowest RMSE is obtained by using 20 topics for the pros and only 5 for the cons. The best result (RMSE= 0.901) is given by a random forest using TF-IDF+LSA with 20 pros and 5 cons reviews.
Method | Input | # of dimensions pros | # of dimensions cons | Linear model | Random forest |
---|---|---|---|---|---|
LSA | TFIDF | 5 | 5 | 0.972 | 0.907 |
LSA | TFIDF | 5 | 20 | 0.993 | 0.907 |
LSA | TFIDF | 5 | 30 | 1.012 | 0.911 |
LSA | TFIDF | 5 | 50 | 1.011 | 0.913 |
LSA | TFIDF | 20 | 5 | 0.969 | 0.901 |
LSA | TFIDF | 20 | 20 | 0.976 | 0.902 |
LSA | TFIDF | 20 | 30 | 0.997 | 0.906 |
LSA | TFIDF | 20 | 50 | 0.980 | 0.904 |
LSA | TFIDF | 30 | 5 | 0.978 | 0.904 |
LSA | TFIDF | 30 | 20 | 0.981 | 0.900 |
LSA | TFIDF | 30 | 30 | 1.006 | 0.910 |
LSA | TFIDF | 30 | 50 | 0.990 | 0.907 |
LSA | TFIDF | 50 | 5 | 0.971 | 0.919 |
LSA | TFIDF | 50 | 20 | 0.981 | 0.914 |
LSA | TFIDF | 50 | 30 | 1.010 | 0.916 |
LSA | TFIDF | 50 | 50 | 0.988 | 0.910 |
Method | Input | # of dimensions pros | # of dimensions cons | Linear model | Random forest |
---|---|---|---|---|---|
LSA | TFIDF | 30 | 20 | RF | 0.900 |
LSA | TFIDF | 20 | 5 | RF | 0.901 |
LSA | TFIDF | 20 | 20 | RF | 0.902 |
LSA | TFIDF | 20 | 50 | RF | 0.904 |
LSA | TFIDF | 30 | 5 | RF | 0.904 |
LSA | TFIDF | 20 | 30 | RF | 0.906 |
LSA | TFIDF | 5 | 5 | RF | 0.907 |
LSA | TFIDF | 5 | 20 | RF | 0.907 |
LSA | TFIDF | 30 | 50 | RF | 0.907 |
LSA | TFIDF | 30 | 30 | RF | 0.910 |
LSA | TFIDF | 50 | 50 | RF | 0.910 |
LSA | TFIDF | 5 | 30 | RF | 0.911 |
LSA | TFIDF | 5 | 50 | RF | 0.913 |
LSA | DTM | 50 | 5 | LM | 0.913 |
LSA | DTM | 30 | 20 | RF | 0.913 |
LSA | TFIDF | 50 | 20 | RF | 0.914 |
LSA | TFIDF | 50 | 30 | RF | 0.916 |
LSA | DTM | 20 | 20 | RF | 0.916 |
LSA | DTM | 20 | 5 | LM | 0.917 |
LSA | DTM | 20 | 30 | RF | 0.917 |
LSA | DTM | 30 | 30 | RF | 0.918 |
LSA | DTM | 50 | 30 | RF | 0.918 |
LSA | DTM | 30 | 50 | RF | 0.918 |
LSA | DTM | 50 | 20 | RF | 0.919 |
LSA | DTM | 50 | 50 | RF | 0.919 |
LSA | DTM | 5 | 20 | RF | 0.919 |
LSA | TFIDF | 50 | 5 | RF | 0.919 |
LSA | DTM | 5 | 30 | RF | 0.920 |
LSA | DTM | 30 | 5 | LM | 0.920 |
LSA | DTM | 20 | 5 | RF | 0.920 |
LSA | DTM | 20 | 50 | RF | 0.921 |
LSA | DTM | 5 | 5 | RF | 0.923 |
LSA | DTM | 30 | 5 | RF | 0.923 |
LSA | DTM | 50 | 30 | LM | 0.926 |
LSA | DTM | 50 | 20 | LM | 0.926 |
LSA | DTM | 50 | 5 | RF | 0.927 |
LSA | DTM | 20 | 30 | LM | 0.928 |
LSA | DTM | 5 | 50 | RF | 0.928 |
LSA | DTM | 20 | 20 | LM | 0.929 |
LSA | DTM | 5 | 5 | LM | 0.932 |
LSA | DTM | 30 | 30 | LM | 0.934 |
LSA | DTM | 30 | 20 | LM | 0.934 |
LSA | DTM | 50 | 50 | LM | 0.939 |
LSA | DTM | 20 | 50 | LM | 0.941 |
LSA | DTM | 5 | 20 | LM | 0.944 |
LSA | DTM | 5 | 30 | LM | 0.944 |
LSA | DTM | 30 | 50 | LM | 0.948 |
LSA | DTM | 5 | 50 | LM | 0.959 |
LSA | TFIDF | 20 | 5 | LM | 0.969 |
LSA | TFIDF | 50 | 5 | LM | 0.971 |
LSA | TFIDF | 5 | 5 | LM | 0.972 |
LSA | TFIDF | 20 | 20 | LM | 0.976 |
LSA | TFIDF | 30 | 5 | LM | 0.978 |
LSA | TFIDF | 20 | 50 | LM | 0.980 |
LSA | TFIDF | 50 | 20 | LM | 0.981 |
LSA | TFIDF | 30 | 20 | LM | 0.981 |
LSA | TFIDF | 50 | 50 | LM | 0.988 |
LSA | TFIDF | 30 | 50 | LM | 0.990 |
LSA | TFIDF | 5 | 20 | LM | 0.993 |
LSA | TFIDF | 20 | 30 | LM | 0.997 |
LSA | TFIDF | 30 | 30 | LM | 1.006 |
LSA | TFIDF | 50 | 30 | LM | 1.010 |
LSA | TFIDF | 5 | 50 | LM | 1.011 |
LSA | TFIDF | 5 | 30 | LM | 1.012 |
Here we added to our best model valence shifter (sentiment function mentioned above) and the length of the reviews. We can see that despite the addition of the new element, it does not improve our RMSE (0.900) and that the results obtained previously are all better. However, we can see that our model does quit well when it has to predict twos and threes, which could be explain by the fact that there a more reviews which have been rated with a two or a three.
Sinnce our best model is using Random Forest with LSA, it is possible to extract the variables importance and assess wether a topic is more or less important in predicting accuratly the ratings.
Here we can see that the length of pros and cons are important and that the sentiment score helps predicting greatly. We can also observe that cons.3
seems to be an important topic.
In this table we are investigating the words in cons.3
. We can see that the 2 most important variables constituting cons.3
are people and team, which mean quite the same.
Key words | Value | Absolute value |
---|---|---|
people | 0.153 | 0.153 |
team | 0.116 | 0.116 |
day | -0.094 | 0.094 |
service | -0.103 | 0.103 |
graduate | -0.113 | 0.113 |
goal | -0.120 | 0.120 |
open | -0.124 | 0.124 |
call | -0.127 | 0.127 |
schedule | -0.128 | 0.128 |
employee | -0.129 | 0.129 |
teller | -0.132 | 0.132 |
store | -0.135 | 0.135 |
sale | -0.210 | 0.210 |
branch | -0.277 | 0.277 |
customer | -0.399 | 0.399 |
pros.2
is related to vacations, with words such as day, year, vacation, time. It could also be related to the compensation in case of sickness (sick, pay)
Key words | Value | Absolute value |
---|---|---|
day | 0.306 | 0.306 |
year | 0.218 | 0.218 |
week | 0.202 | 0.202 |
vacation | 0.181 | 0.181 |
time | 0.174 | 0.174 |
pay | 0.166 | 0.166 |
3 | 0.161 | 0.161 |
401k | 0.146 | 0.146 |
sick | 0.129 | 0.129 |
4 | 0.126 | 0.126 |
match | 0.120 | 0.120 |
2 | 0.119 | 0.119 |
5 | 0.119 | 0.119 |
leave | 0.118 | 0.118 |
opportunity | -0.118 | 0.118 |
These results may suggest that, in general, employees are satisfied with their vacation and days-off, and that they are dissatisfied with internal management methods.
To complete our supervised learning, we used the Word Embedding and GloVe
embedding to model in a different way. First, we computed the RMSE for each combination of number of topic for pros and cons and then, while creating our data frame, we took the mean words instead of the sum as it worked better.
Method | Input | # of dimensions pros | # of dimensions cons | Learner | RMSE |
---|---|---|---|---|---|
Glove | FCM | 50 | 50 | RF | 0.942 |
Glove | FCM | 20 | 20 | RF | 0.943 |
Glove | FCM | 50 | 20 | RF | 0.950 |
Glove | FCM | 5 | 20 | RF | 0.953 |
Glove | FCM | 20 | 5 | RF | 0.956 |
Glove | FCM | 5 | 50 | RF | 0.959 |
Glove | FCM | 20 | 50 | RF | 0.960 |
Glove | FCM | 50 | 5 | RF | 0.963 |
Glove | FCM | 5 | 5 | RF | 0.964 |
Finally, we can see that the results are without appeal. GloVe
produces results that are worse than the ones obtained through the LSA
.