Debiasing Career Recommendations with Neural Fair Collaborative Filtering

A growing proportion of human interactions are digitized on social media platforms and subjected to algorithmic decision-making, and it has become increasingly important to ensure fair treatment from these algorithms. In this work, we investigate gender bias in collaborative-filtering recommender systems trained on social media data. We develop neural fair collaborative filtering (NFCF), a practical framework for mitigating gender bias in recommending career-related sensitive items (e.g. jobs, academic concentrations, or courses of study) using a pre-training and fine-tuning approach to neural collaborative filtering, augmented with bias correction techniques. We show the utility of our methods for gender de-biased career and college major recommendations on the MovieLens dataset and a Facebook dataset, respectively, and achieve better performance and fairer behavior than several state-of-the-art models.


INTRODUCTION
There is increasing awareness that machine learning (ML) algorithms can affect people in unfair ways with legal or ethical consequences when used to automate decisions [2,3], for example, exhibiting discrimination towards certain demographic groups. Systemic bias, which has long been the concern of civil rights and feminist scholars and activists [1,14,15,42,49], in turn affects data, and hence ML algorithms trained on data [3]. The need to connect the fairness and bias demonstrated in ML algorithms with the broader context of fairness and bias in society is increasingly well understood [37,46]. Structural disadvantages and systems of oppression in our society such as sexism and racism can lead individuals from marginalized groups to perform below their true potential. For example, these issues can reduce the available cognitive bandwidth required for academic success [50] or increase the probability and length of incarceration [1,16] for minority groups. It is important to ensure that these patterns are not replicated or amplified by ML models which are used to make consequential decisions [12].
As social media platforms are a major contributor to the number of automated data-driven decisions that we as individuals are subjected to, it is clear that such ML fairness issues in social media can potentially cause substantial societal harm. Recommender systems are the primary method for a variety of ML tasks for social media data, e.g. suggesting targets of advertisements, products, friends, web pages, and potentially consequential suggestions such as romantic partners or even career paths.
Despite the practical challenges from labor market dynamics [36], professional networking sites' job recommendations [4,23,24] are helpful for job seekers and employers. However, biases inherent in social media data can potentially lead recommender systems to produce unfair suggestions [54]. Many studies have been conducted which demonstrated the demographic biases in the different aspects of the job market. For example, racial discrimination was shown in the recruitment process of the labor market [6]. A similar study [47] was conducted to confirm the presence of discrimination in a job market in Canada with respect to race as well as ethnicity. A recent study on a job platform, XING, similar to LinkedIn demonstrated that it ranks less qualified male candidates higher than more qualified female candidates [41]. Recommendations in educational and career choices are another important application for fair recommender systems. Students' academic choices can have significant impacts on their future careers and lives. An earlier study illustrated that the screening process of a medical school in London was highly biased [44] against women and members of ethnic minorities. In 2010, women accounted for only 18% of the bachelor's degrees awarded in computer science [9], and interventions to bridge this gap are crucial to support the economic competitiveness and level of innovation of the United States [5].
Recommender systems can reinforce this disparity, or -potentiallyhelp to mitigate it. We envision an ML-based career counseling tool which makes personalized data-driven recommendations regarding important career choices such as profession, college major, certifications, or jobs to apply for, while ensuring that the recommendations do not perpetuate systemic bias or harmful stereotypes which are damaging both for our society and for the individuals who use the system. Such a tool could support young people in consequential life decisions in partnership with their parents and counselors, as well as professionals who aim to make smart career moves. Social media data is readily available to support personalized recommendations, as long as bias issues are adequately countered.
We propose a practical technique to mitigate gender bias in sensitive item (e.g. college major or career path) recommendations. Our approach, which we call neural fair collaborative filtering (NFCF), achieves accurate predictions while addressing sensitive data sparsity (e.g., users typically have only one or two college majors or occupations) by pre-training a deep neural network on big implicit feedback data for non-sensitive items (e.g. "liked" Facebook pages, movies or music), and then fine-tuning the neural network for sensitive item recommendations. We perform two bias corrections, to address (1) bias in the input embeddings due to the non-sensitive items, and (2) bias in the prediction outputs due to the sensitive items. An ablation study shows that both interventions are important for fairness. We demonstrate the utility of our method on two datasets: MovieLens (non-sensitive movie ratings and sensitive occupations), and a Facebook dataset (non-sensitive Facebook page "likes" and sensitive college majors). Our main contributions include: • We develop a pre-training + fine-tuning neural network method for fair recommendations on social media data. • We propose two de-biasing methods for this task: 1) debiasing latent embeddings, and 2) learning with a fairness penalty. We also develop two simpler model variants. • We perform extensive experiments showing both fairness and accuracy benefits over baselines on two datasets.

BACKGROUND
In this section we formalize the problem, and discuss collaborative filtering with implicit data, and fairness metrics.

Problem Formulation
Let M and N denote the number of users and items, respectively (see Table 1 for relevant notation). Suppose we are given a user-item interaction matrix Y ∈ R M ×N of implicit feedback from users, e.g. Tuning parameter for fairness and accuracy trade-off ϵ is Differential fairness measure for a sensitive item i s ϵ me an Mean differential fairness measure U abs Absolute unfairness measure social media "likes, " defined as Here, y ui = 1 when there is an interaction between user u and item i, e.g. when u "likes" Facebook page i. In this setting, a value of 0 does not necessarily mean u is not interested in i, as it can be that the user is not yet aware of it, or has not yet interacted with it. While interacted entries reflects users' interest in items, the unobserved entries may just be missing data. Therefore, there is a natural scarcity of strong negative feedback. The collaborative filtering (CF) problem with implicit feedback is formulated as the problem of predicting scores of unobserved entries, which can be used for ranking the items. The CF model outputsŷ ui = f (u, i |Θ), whereŷ ui denotes the estimated score of interaction y ui , Θ denotes model parameters, and f denotes the function that maps model parameters to the estimated score. If we constrainŷ ui in the range of [0,1] and interpret it as the probability of an interaction, we can learn Θ by minimizing the following negative log-likelihood objective function: where χ represents the set of interacted user-item pairs, and χ − represents the set of negative instances, which can be all (or a sample of) unobserved interactions. In our setting, we further suppose that items i are divided into non-sensitive items (i n ) and sensitive items (i s ). For example, the i n 's can be Facebook pages where user preferences may reasonably be influenced by the protected attribute such as gender, and the user's "likes" of the pages are the implicit feedback. Since each user u can (and often does) "like" many pages, u's observed non-sensitive data (u-i n ) is typically large. On the other hand, i s may be the users' occupation or academic concentration provided in their social media profiles. We desire that the recommendations of i s to new users should be unrelated to the users' gender (or other protected attribute). Since each user u may typically be associated with only a single occupation (or other sensitive personal data rarely disclosed), the data sparsity in the observed sensitive item interactions (u-i s ) is a major challenge. Typical collaborative filtering methods can suffer from overfitting in this scenario that often amplifies unfairness or demographic bias in the data [22,57]. Alternatively, the non-sensitive interactions u-i n can be leveraged, but these will by definition encode biases that are unwanted for predicting the sensitive items. For example, liking the Barbie doll Facebook page may be correlated with being female and negatively correlated with computer science, thus implicitly encoding societal bias in the career recommendations.

Neural Collaborative Filtering
Matrix factorization (MF) models [39] map both users and items to a joint latent factor space of dimensionality v such that user-item interactions are modeled as inner products in that space. Each item i and user u are associated with a vector q i ∈ R v and p u ∈ R v , witĥ where µ is the overall average rating, and b u and b i indicate the deviations of user u and item i from µ, respectively. Neural collaborative filtering (NCF) [27] replaces the inner products in MF with a deep neural network (DNN ) which learns the user-item interactions. In the input layer, the users and items are typically one-hot encoded, then mapped into the latent space with an embedding layer. NCF combines the latent features of users p u and items q i by concatenating them. Complex non-linear interactions are modeled by stacking hidden layers on the concatenated vector, e.g. using a standard multi-layer perceptron (MLP).

Fairness Metrics
We consider several existing fairness metrics which are applicable for collaborative filtering problems.
2.3.1 Differential Fairness. The differential fairness [21,22] metric aims to ensure equitable treatment for all protected groups, and it provides a privacy interpretation of disparity, and economic guarantees. Let M(x) be an algorithmic mechanism (e.g. a recommender system) which takes an individual's data x and assigns them an outcome y (e.g. a class label or whether a user-item interaction is present). The mechanism M(x) is ϵ-differentially fair (DF) with respect to (A, Θ) if for all θ ∈ Θ with x ∼ θ , and y ∈ Range(M), for all (s i , s j ) ∈ A×A where P(s i |θ ) > 0, P(s j |θ ) > 0. Here, s i , s j ∈ A are tuples of all protected attribute values, e.g. male and female, and Θ, the set of data generating distributions, is typically a point estimate of the data distribution. If all of the P M,θ (M(x) = y|s, θ ) probabilities are equal for each group s, across all outcomes y and distributions θ , ϵ = 0, otherwise ϵ > 0. [22] proved that a small ϵ guarantees similar utility per protected group, and ensures that protected attributes cannot be inferred based on outcomes. For gender bias in our recommender (assuming a gender binary), we can estimate ϵ-DF per sensitive item i by verifying that: where scalar α is each entry of the parameter of a symmetric Dirichlet prior with concentration parameter 2α, i is an item and N A is the number of users of gender A (m or f ).

Absolute
Unfairness. The absolute unfairness (U abs ) metric for recommender systems measures the discrepancy between the predicted behavior for disadvantaged and advantaged users [54]. It measures differences in absolute estimation error across user types: where, for N items, E D [ŷ ui ] j is the average predicted score for the j-th item for disadvantaged users, E A [ŷ ui ] j is the average predicted score for advantaged users, and E D [r ] j and E A [r ] j are the average score for the disadvantaged and advantaged users, respectively.

NEURAL FAIR CF
Due to biased data which encode harmful human stereotypes in our society, typical social media-based collaborative filtering (CF) models can encode gender bias and make unfair decisions. In this section, we propose a practical framework to mitigate gender (or other demographic) bias in CF recommendations, which we refer to as neural fair collaborative filtering (NFCF) as shown in Figure 1.
The main components in our NFCF framework are as follows: an NCF model, pre-training user and non-sensitive item embeddings, debiasing pre-trained user embeddings, and fine-tuning with a fairness penalty. We use NCF as the CF model because of its flexible network structure for pre-training and fine-tuning. We will show the value of each component below with an ablation study (Table 4). Similarly to [27], the DNN under the NCF model can be defined as: where z l , ϕ l , W l , b l . and a l denote the neuron values, mapping function, weight matrix, intercept term, and activation function for the l-th layer's perceptron, respectively. The DNN is applied to z 1 to learn the user-item latent interactions.
In the first step of our NFCF method, pre-training user and item embeddings, NCF is trained to predict users' interactions with nonsensitive items (e.g. "liked" social media pages) via back-propagation. This leverages plentiful non-sensitive social media data to learn user embeddings of the user's preference or profile and network weights, but may introduce demographic bias due to correlations between non-sensitive items and demographics. E.g., liking the Barbie doll page typically correlates with user gender. These correlations are expected to result in systematic differences in the embeddings for different demographics, which in turn can lead to systematic differences in sensitive item recommendations. Our aim is to leverage the valuable signal of the user's preferences  for sensitive item recommendations, but also address the problems with it regarding bias. In step two, the user embeddings from step one are de-biased. Our method to de-bias user embeddings adapts a very recent work on attenuating bias in word vectors [18] to the task of collaborative filtering. Specifically, [18] propose to debias word vectors using a linear projection of each word embedding w orthogonally onto a bias vector v B , which identifies the "bias component" of w. The bias component is then removed To adapt this method to CF, the main challenge is to find the proper bias direction v B . [18] construct v B based on word embeddings for gender-specific names, which are not applicable for CF. We instead use CF embeddings for users from each protected group. We first compute a group-specific bias direction for female users as where f 1 , f 2 , . . . are vectors for each female user, and n f is the total number of female users. We similarly compute a bias direction for male v mal e . Finally, we compute the overall gender bias vector: We then de-bias each user embedding p u by subtracting its component in the direction of the bias vector: As we typically do not have demographic attributes for items, we only de-bias user embeddings and not item embeddings. In the third step, we transfer the de-biased user embeddings and pretrained DNN 's parameters to a model for recommending sensitive items, which we fine-tune for this task. During fine-tuning, a fairness penalty is added to the objective function to address a second source of bias: demographic bias in the sensitive items. E.g., more men than women choose computer science careers [9], and this should be corrected [5]. We penalize the mean of the per-item ϵ's: where ϵ 1 , ϵ 2 , . . . ϵ n s are the DF measures for sensitive items and ϵ mean is the average across the ϵ's for each item. Following [22], our learning algorithm for fine-tuning uses the fairness cost as a regularizer to balance the trade-off between fairness and accuracy. Using back-propagation, we minimize the loss function L χ ∪χ − (W) from Equation 2 for model parameters W plus a penalty on ϵ mean , weighted by a tuning parameter λ > 0: where R χ (ϵ mean ) = max(0, ϵ mean M W (χ ) − ϵ mean 0 ) is the fairness penalty term, and ϵ mean M W (χ ) is the ϵ mean for the CF model M W (χ ) while χ and χ − are the set of interacted and not-interacted useritem pairs, respectively. In our experiments, we use ϵ mean 0 = 0 to encourage demographic parity. Pseudo-code is given in Algorithm 1.

Variants of NFCF Model
We also consider two variants of our method which are simplifications of the NFCF model.
De-biasing embeddings steps: • Compute gender bias vector v B using Equation 8 and 9 • De-bias each user embedding using: bias vector. Since there is no additional fairness penalty in the objective function, this algorithm converges faster. There is also no requirement to tune the λ hyperparameter.

Projection-based CF.
In the Projection-based CF algorithm, our approach is to learn an NCF model for non-sensitive u-i n interactions (using Equation 7), and then debias the user embeddings using the linear projection technique in Equation 10. Finally, we learn a classifier such as k-nearest neighbors or logistic regression on the de-biased user embeddings to predict sensitive items (i s ). There is no fine-tuning to address overfitting for sensitive items or fairness penalty-based bias correction in this approach. We previously presented this simpler model as a non-archival extended abstract at a workshop [29].
Since a user usually interacts with a single sensitive item (e.g. occupation), it is tempting to use a classifier, as in the Projectionbased CF method, to predict the sensitive items such as careers, viewing them as discrete class labels. However, our experiments will show that classification approaches including Projection-based CF and a deep neural network classifier are suboptimal. The intuition is that even though the output for sensitive items is like classification (a single label), the input data is like recommendation (interactions of users with other items), and the overall system hence benefits from an end-to-end collaborative filtering approach as in NFCF.

EXPERIMENTS
In this section, we validate and compare our model with multiple baselines for recommending careers and academic concentrations using social media data. Our implementation's source code is provided on GitHub. 1

Datasets
We evaluate our models on two datasets: MovieLens, 2 a public dataset which facilitates research reproducibility, and a Facebook dataset which is larger and is a more realistic setting for a fair social media-based recommender system.

MovieLens Data.
We analyzed the widely-used MovieLens dataset which contains 1 million ratings of 3,900 movies by 6,040 users who joined MovieLens [25], a noncommercial movie recommendation service operated by the University of Minnesota. We used gender as the protected attribute, self-reported occupation as the sensitive item (with one occupation per user), and movies as the non-sensitive items. Since we focus on implicit feedback, which is common in a social media setting (e.g. page "likes"), we follow [27,38] to convert explicit movie ratings to binary implicit feedback, where a 1 indicates that the user has rated the item. We discarded movies that were rated less than 5 times, and users who declared their occupation as "K-12 student, " "retired, " "unemployed, " and "unknown or not specified" were discarded for career recommendation. A summary of the pre-processed dataset is shown in Table 2.

Facebook Data.
The Facebook dataset was collected as part of the myPersonality project [40]. The data for research were collected with opt-in consent. We used gender as the protected attribute, college major as the sensitive items (at most one per user), and user-page interaction pairs as the non-sensitive items. A userpage interaction occurs when a user "likes" a Facebook page. We discarded pages that occurred in less than 5 user-page interactions. See Table 2 for a summary of the dataset after pre-processing. Figure 2, we show disparities in the gender distributions of 10 example careers and college majors for MovieLens and Facebook datasets, respectively. For example, 97% of the associated users for the occupation homemaker are women in the MovieLens data, while there are only 27% women among the users associated with the computer science major in the Facebook data. As a qualitative illustration, we also show the gender distribution of top-1 recommendations from our proposed NFCF model. NFCF mitigated gender bias for most of these sensitive items. In the above examples, NFCF decreased the percentage of women for homemaker from 97% to 50%, while increasing the percentage of women for computer science from 27% to 48%.

Baselines
We compare our methods to the following "typical" baseline models: •      • DNN Classifier. A simple baseline where we train a DNNbased classifier to predict i s given the u-i n interactions as features (i.e. binary features, one per user-page "like" or one per user-movie "rating"). No user embeddings are learned. • BPMF. Bayesian probabilistic MF (BPMF) via MCMC [48] is also used, since it typically has good performance with small data. BPMF is trained with the u-i interactions for i s recommendations, where i contains both i n and i s .
We also use the following fair baseline models: • MF-U abs . The objective of the MF w/o Pre-train model is augmented with a smoothed variation of U abs [54] using the Huber loss [28], weighted by a tuning parameter λ. • Resampling for Balance. This method [19] involves preprocessing by resampling the u-i data to produce a genderbalanced version of the training data. First, we extract u-i data for users with known gender and randomly sample same number of male and female users without replacement where i includes both i n and i s . Finally, NCF and MF are trained on the gender-balanced u-i interactions for i s recommendations.

Experimental Settings
All the models were trained via adaptive gradient descent optimization (Adam) with learning rate = 0.001 using pyTorch where we sampled 5 negative instances per positive instance. The mini-batch size for all models was set to 2048 and 256 for user-page and usercareer data, respectively, while the embedding size for users and items was set to 128. The configuration of the DNN under NFCF and NFCF_embd was 4 hidden layers with 256, 64, 32, 16 neurons in each successive layer, "relu" and "sigmoid" activations for the hidden and output layers, respectively. We used the same DNN architecture for the NCF and DNN Classifier models. For the Facebook dataset, we held-out 1% and 40% from the user-page and user-college major data, respectively, as the test set, using the remainder for training. Since there are fewer users in the Movielens dataset, we held-out 1% and 30% from the user-movie and user-career data, respectively, as the test set, using the remainder for training. We further held-out 1% and 20% from the training u-i n and u-i s data, respectively, as the development set for each dataset. The fairness penalty was computed for each mini-batch during training. Note that the tuning parameter λ needs to be chosen as a trade-off between accuracy and fairness [22]. We chose λ as 0.1 for NFCF and MF-U abs via a grid search on the development set according to similar criteria to [22], i.e. optimizing fairness while allowing up to 2% degradation in accuracy (i.e. NDCG) from the corresponding typical model (NCF w/ Pre-train and MF w/o Pre-train, respectively).
To evaluate the performance of item recommendation on the test data, since it is too time-consuming to rank all items for every user during evaluation [27], we followed a common strategy in the literature [20]. For non-sensitive items, we randomly sampled 100 items which are not interacted by the user for each test instance, and ranked the test instance among the 100 items. For sensitive item recommendations, in the case of Facebook data we similarly randomly sampled 100 college majors. For the MovieLens data, there are only 17 unique careers, so we used the remaining 16 careers when ranking the test instance. The performance of a ranked list is measured by the average Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) [26]. The HR measures whether the test item is present in the top-K list, while the NDCG accounts for the position of the hit by assigning higher scores to hits at top ranks. We calculated both metrics for each test user-item pair and reported the average score. Finally, we computed ϵ mean and U abs on the test data for user-sensitive item to measure the fairness of the models in career and college major recommendations.

Validation of NFCF Model Design
Before comparing to fair recommendation baseline models, we systematically validate our modeling choices for NFCF.
Pre-training Task Performance: We first study the performance for NCF and MF models at the pre-training task, Facebook page and movie recommendations (Table 3). NCF had substantially and consistently better performance compared to MF on the larger Facebook dataset, and similar overall performance on MovieLens (better in 2 of 4 metrics).
Fine-Tuning Performance: We fine-tuned these models on the interaction of users with the sensitive items for career and college major recommendations on MovieLens and Facebook dataset, respectively. Figure 3 shows top-K recommendations from 17 and 169 unique careers and college majors using several "typical" baseline models that do not involve any fairness constraints, where K ranges from 1 to 5 and 1 to 25 for MovieLens and Facebook dataset, respectively. NCF w/ Pre-train had the best performance in NDCG versus other baselines while our proposed NFCF and NFCF_embd performed approximately similarly to NCF w/ Pre-train for both datasets. Of the typical baselines, MF w/o Pre-train and NCF w/o Pre-train performed the second best for MovieLens and Facebook dataset, respectively. For the MovieLens dataset, MF w/o Pre-train performed better than MF w/ Pre-train, presumably due to the relatively small dataset and having relatively few parameters to fine-tune, unlike for the DNN -based NCF model. BPMF performed poorly despite using Bayesian inference for scarce data, perhaps due to [48]'s initialization via older MF methods.
Visualization of Embedding De-biasing: We visualized the PCA projections of the male and female vectors (Equation 8) before and after the linear projection-based de-biasing embeddings method, where PCA was performed based on all the embeddings. Figure 5 shows that the male and female vectors have very different directions and magnitudes. After de-biasing, the male and female vectors had a more similar direction and magnitude to each other.
Ablation Study: Finally, we conducted an ablation study in which the components of the method were removed one at a time.  As shown in Table 4, there was a large degradation of the performance of NFCF when pre-training was removed (the de-biasing embeddings step was also removed, since there was no pre-trained user vector), or when NCF was replaced by MF. Removing the debiased embedding method lead to better HR and NDCG scores, but with an increase in gender bias metrics. Similarly, learning without the fairness penalty led to similar performance in HR and NDCG, but greatly increased gender bias. Therefore, both of the bias correction methods in the NFCF model are necessary to achieve the best level of fairness while maintaining a high recommendation accuracy.

Performance for Mitigating Gender Bias in Sensitive Item Recommendations
We evaluated performance for career and college major recommendations in terms of accuracy (HR and NDCG) and fairness (ϵ mean and U abs ). In Figure 4, we show that our proposed NFCF and NFCF_embd models clearly outperformed all the fair baseline models in terms of NDCG, regardless of the cut-off K. Another variant of our proposed method, Projection-based CF, performed the second best on both datasets out of all of the fair models.
In Table 5, we show detailed results for the top 5 and top 7 recommendations on MovieLens and for the top 10 and top 25 recommendations on the Facebook dataset. Our proposed NFCF model was the most fair career and college major recommender in terms of ϵ mean , while our NFCF_embd was the most fair in terms of U abs on the Facebook dataset. In the case of the MovieLens dataset, our NFCF model was the most fair recommender model in terms of both fairness metrics. NCF w/ Pre-train performed best in the HR and NDCG metrics on both datasets. NFCF and NFCF_embd had nearly as good HR and NDCG performance as NCF w/ Pre-train, while also mitigating gender bias. We also found that our proposed fair models NFCF and NFCF_embd sometimes outperformed NCF w/ Pre-train in terms of HR and NDCG. For example, HR@5 and NDCG@10 for NFCF on the MovieLens and NFCF_embd on the Facebook data, respectively. This counter-intuitive result is presumably due to the   regularization behavior of the fairness penalty on the objective which can sometimes lead fair models to reduce overfitting to some extent compared to the typical model, a phenomenon which has previously been observed by [45].
As expected, we also found that the pre-training and fine-tuning approach reduced overfitting for NCF w/ Pre-train, and thus improved the fairness metrics by reducing bias amplification. NCF w/ Pre-train outperforms most of the fair baselines in terms of both fairness and accuracy-based measures which validates effectiveness of pre-training and fine-tuning neural method for career recommendations. This was not the case for MF w/ Pre-train, presumably due to the limited number of pre-trained parameters to fine-tune. Projection-based CF and MF-U abs also showed relatively good performance in mitigating bias in terms of U abs compared to the typical models, but with a huge sacrifice in the accuracy. Similarly, NCF via Resampling and MF via Resampling had poor performance in accuracy, but improved fairness to some extent  over their corresponding "typical" models, NCF w/o Pre-train and MF w/o Pre-train, respectively. Although it is intuitive to use a classification-based method to recommend sensitive items which typically occur only once such as careers, the results show that our NFCF and NFCF_embd methods comprehensively outperformed Projection-based CF and DNN Classifier in terms of all measures on both datasets. As a further qualitative experiment, we recommended the top-1 career and college major to each test male and female user via the NFCF and NCF w/o Pre-train models. In Table 6, we show top 5 and 10 most frequent recommendations to the overall male and female users among the 17 and 169 unique careers and majors for MovieLens and Facebook dataset, respectively. NFCF was found to recommend similar careers to both male and female users on average for both datasets, while NCF w/o Pre-train encoded societal stereotypes in its recommendations. For example, NCF w/o Pre-train recommends computer science to male users and nursing to female users on the Facebook dataset while it recommends executive/managerial to male users and customer service to female users on the MovieLens dataset.

DISCUSSION AND FUTURE WORK
In this paper, we investigated gender bias in recommender systems trained on social media data for suggesting sensitive items (e.g. college majors or career paths). For social media data, we typically have abundant implicit feedback for user preferences of various nonsensitive items in which gender disparities are acceptable, or even desirable (e.g. "liked" Facebook pages, movies or music), but limited data on the sensitive items (e.g., users typically have only one or two college majors or occupations). User embeddings learned from the non-sensitive data can help recommend the sparse sensitive items, but may encode harmful stereotypes, as has been observed for word embeddings [8]. Furthermore, the distribution of sensitive items typically introduces further unwanted bias due to societal disparities in academic concentrations and career paths, e.g. from the "leaky pipeline" in STEM education [5].
We developed a practical solution for gender de-biased career recommendations while resolving the above challenges. Although we generally aimed to predict discrete class labels such as college majors, we intentionally framed the fair career recommendation task as a recommender system problem rather than a classification problem. We use this approach because as our results showed in Table 5, the personalized predictions in this task benefited from collaborative filtering, which outperformed classification baselines. Furthermore, the components of our proposed method such as debiasing embeddings, pre-training, fine-tuning, and fairness interventions via penalty term can potentially be transferred to other models, e.g. neural graph collaborative filtering [52], and applied directly to mitigate other demographic biases, e.g. race, age, and nationality.
In general, the disparate behavior of typical recommendation systems (e.g. see NCF w/o Pre-train in Table 6) may partly reflect legitimately differing real-world preferences in career choices by women and men. However, according to a report by the US Department of Commerce [5], gender disparity in STEM jobs can also be attributed to factors such as strong gender stereotypes and a lack of female role models, and reducing this gender disparity is an untapped opportunity to improve the economic competitiveness and innovative capacity of the USA, and to decrease the gender wage gap. The equitable predictions produced by NFCF are one step in this direction.
The main limitation of our approach is that it is designed and evaluated for a single protected attribute, i.e. gender. For multiple protected attributes, although it is straightforward to measure differential fairness-based penalty term [22], it is not clear how the bias direction can be computed accurately in the intermediate step between pre-training and fine-tuning. In future work, inspired by [43,51,56], we plan to address this limitation with an adversarial network included in the fine-tuning step which aims to make the user embeddings independent from multiple protected attributes simultaneously.

RELATED WORK
The recommender systems research community has begun to consider issues of fairness in recommendation. A frequently practiced strategy for encouraging fairness is to enforce demographic parity among different protected groups. Demographic parity aims to ensure that the set of individuals in each protected group have  Table 5: Comparison of proposed models with the baselines in career and college major recommendations on MovieLens (17 careers) and Facebook (169 majors). Higher is better for HR and NDCG; lower is better for ϵ me an and U abs . NFCF greatly improves fairness metrics and beats all baselines at recommendation except for NCF w Pre-train, a variant of NFCF without its fairness correction. similar overall distributions over outcomes [55]. Some authors have addressed the unfairness issue in recommender systems by adding a regularization term that enforces demographic parity [7,[31][32][33][34][35]. However, demographic parity is only appropriate when user preferences have no legitimate relationship to the protected attributes. In recommendation systems for typical items such as movies, user preferences are indeed often influenced by protected attributes such as gender, race, and age [13]. Therefore, enforcing demographic parity may significantly damage the quality of recommendations. Fair recommendation systems have also been proposed by penalizing disparate distributions of prediction error [54], by making recommended items independent from protected attributes such as gender, race, or age [30], and by isolating protected attributes in tensor-based recommendations [58]. In addition, [10,11] taxonomize fairness objectives and methods based on which set of stakeholders in the recommender system are being considered, since it may be meaningful to consider fairness among many different groups. Pareto efficiency-based fairness-aware group recommendation [53] was also proposed, however this method is not effective in personalized fair recommendations. Furthermore, a simple technique using fair tf-idf was recently proposed [17] to mitigate demographic bias in the AI-based resume screening process. Unlike previous methods, we develop neural network method for fair collaborative filtering on social media data that focuses on mitigating bias in career recommendations.

CONCLUSION
We investigated gender bias in social-media based collaborative filtering. To address this problem, we introduced Neural Fair Collaborative Filtering (NFCF), a pre-training and fine-tuning method which corrects gender bias for recommending sensitive items such as careers or college majors with little loss in performance. On the  MovieLens and Facebook datasets, we achieved better performance and fairness compared to an array of state-of-the-art models.