As pointed out by Barr,
1 the standard analysis of computer science (CS) degree data does not account for the changing demographics of the undergraduate population in terms of overall numbers and relative proportion of federally designated gender, race, and ethnicity groupings.
a While it does give an indication of a student’s experience walking into a classroom, and is somewhat reflective of overall current demographics and historical marginalization, a new framework is necessary to evaluate longitudinal change for each demographic. A second issue we observe is that most of the literature on broadening participation in computing (BPC) reports data on gender
or on race/ethnicity, omitting data on students’ intersectional identities. This leads to an incorrect understanding of both the data and the challenges we face as a field by using a single axis of analysis
5 at a time (gender or race/ethnicity). When used as a framework,
intersectional analysis, a term coined by Crenshaw,
5 allows us to expose the multidimensionality of experiences that Black women, in Crenshaw’s work for example, experience in everyday life. In computing, a number of researchers
13,15–18,20 have been exploring intersectionality as a framework of analysis for exploring broadening participation in computing.
To truly assess the effectiveness of curricular, pedagogic, and institutional interventions, we must use multiple data-analysis methods, each of which presents a different perspective on the situation and the improvements achieved. These different analyses allow us to distinguish between the experience a student may have walking into a CS classroom at a particular institution relative to their experience walking into a non-CS classroom, the extent to which the CS department at institution X represents the demographics of students across all disciplines at X, and the extent to which CS as a field is attracting and retaining students of different identities.
Cohort Analysis of Longitudinal Degree Data
Discussion of diversity in computing typically looks at the degrees earned by subgroups as a percentage of the whole. For example, women’s participation in computing is typically based on examining the percentage of total CS degrees earned by women each year, as shown in Figure
1.
b The data in this graph is from the Integrated Postsecondary Education Data System (IPEDS) Completions dataset.
19 IPEDS data is divided by Classification of Instructional Programs (CIP) codes. Computer Science data can be found in Federal CIP code 11. However, for some universities, CIP-11 includes Information Technology (IT) and other similarly named programs, so care must be taken in analyzing results because IT degrees are often more diverse than CS degrees.
As this article will later discuss in detail, the standard analysis (see Figure
1) does not support an accurate analysis of longitudinal trends. It does, however, provide a realistic picture of the experience an individual student has as they go through their CS studies. For example, in 2020, 21% of CS degrees were awarded to women, which means that a woman CS major walking into a CS classroom of 100 CS seniors would on average see 20 other women.
Similarly, the standard analysis of CS degrees as reported in IPEDS by race and ethnicity categories, shown in Figure
2, illustrates that in 2020, a Black CS major in a class of 100 students would see 8 other Black students, 57 White students, and so on.
The standard analysis is otherwise problematic, particularly for longitudinal analysis of change over time. It does not, for example, account for significant demographic changes that have taken place in the college-going population over time. In 1966 (leftmost data point in Figure
1), women made up 42% of the U.S. undergraduate population, but by 2020, (rightmost data point in the figure) women made up 57% of the U.S. undergraduate population. Thus, although Figure
1 gives us men’s and women’s participation relative to each other, it does not show shifts in interest by either group over time—what looks like a sudden drop in women’s degrees in the 1980s might actually be a sudden increase in interest by men while women’s interest stayed steady. The relative nature of the data presentation obscures the actual interest in CS indicated by the data.
We see this phenomenon clearly when we use the standard analysis to compare women’s CS degrees to women’s math degrees. Figure
3 shows the percentage of CS degrees and the percentage of math degrees earned by women from 1966-2020. From this graph, we might conclude that women study math at a significantly higher rate than they study CS because they earn a much higher percentage of math degrees than they do CS degrees. Yet, this view of the data completely hides the extent to which students do or do not study each field, making a relative comparison inappropriate and erroneous.
Figure
4 changes the computation, showing women’s CS degrees and women’s math degrees, each as a percentage of women’s degrees across
all fields. Figure
4 is an accurate representation of the extent to which each field attracts women, independent of how many men study the field. The story told by Figure
4 is quite different than that told by Figure
3. Figure
4 shows that women’s pursuit of math degrees fell off by 1980 and is currently below women’s pursuit of CS degrees. Figure
3 cannot correctly show this reality because it is distorted by the fact that men study math at a much lower rate than they study CS, making women’s interest in math
appear higher than it actually is. Figure
5 shows men’s and women’s CS degrees as a percentage of all men and women graduates and makes clear that, despite increased interest in the field by both groups, men’s pursuit of CS degrees has increased far more rapidly than has women’s.
As another example of the importance of cohort analysis, we examine CS degrees earned by Hispanic and Black students. Figure
6 shows the standard analysis with Hispanic CS degrees and Black CS degrees as a percentage of total CS degrees. One might conclude from this figure that there was a sharp increase in participation in CS on the part of Hispanic students with a concomitant decrease in participation of Black students. Yet, this conclusion is incorrect; the growth in Hispanic CS degrees is likely also driven by the overall demographic shift in the country’s college-going population. Figure
7 provides a more accurate picture of the extent to which each group is pursuing CS degrees. In this figure, we look at Hispanic CS degrees as a percentage of total Hispanic degrees and Black CS degrees as a percentage of total Black degrees, showing clearly that both groups began a steady increase in CS as a percentage of their cohort degrees as of 2010.
It is critically important that we look at data intersectionally. Figure
7 shows an increase in CS degrees for Black students but does not address the question of whether that increase is reflected in both Black women’s degrees and Black men’s degrees. Figure
8 reports Black CS degrees as a percentage of all Black degrees, Black men’s CS degrees as a percentage of all Black men’s degrees, and Black women’s CS degrees as a percentage of all Black women’s degrees. This clearly indicates that while there is an overall increase in the extent to which Black men are being attracted to and retained by CS, there is no analogous increase in the participation of Black women.
These examples show that the standard analysis used to analyze degree data, namely examining a group’s CS degrees as a percentage of all CS degrees, is faulty because the relative size of the component groups changes over time. The standard analysis, therefore, can falsely indicate a negative trend that does not actually exist and, conversely, hide progress that is occurring. Cohort analysis (across gender, race and ethnicity, and intersectional identities) provides an accurate picture of the extent to which groups are attracted into the field. Interested readers can carry out intersectional cohort analysis for U.S. nationwide CIP-11 data or for any college or university via a Web app available at
https://aiice.shinyapps.io/AiiCE/. We next explore the importance of examining CS degree data in the context of university degree data.
Cohort analysis (across gender, race and ethnicity, and intersectional identities) provides an accurate picture of the extent to which groups are attracted into the field.
The standard analysis, therefore, can falsely indicate a negative trend that does not actually exist and, conversely, hide progress that is occurring.
The Importance of University Context
Computing departments often have no control over who attends their university, but they can influence who can
discover computing, feel a sense of belonging, and persist to graduation. Looking at the data intersectionally in their department compared to the university’s overall data can let them see their “opportunity gap.” Many departments struggle to gain access to the demographic data they need to track students by their intersectional identity as they make their way through the CS degree.
c However, all departments have access to their graduation data via IPEDS.
19 To understand the importance of reporting intersectional data in the context of the university’s data, we first look at graduation data for the entire U.S. and then examine the opportunity gap for a Hispanic Serving Institution (HSI) in California.
Figure
9 shows the 2021 national computing graduation rates for the intersection of gender and race/ethnicity captured by IPEDS as the solid bar (black and in the foreground), and the graduation rates for all degrees as the shaded bar (gray and behind the solid bar). For each race/ethnicity category tracked by IPEDS, the bar on the left represents men and the bar on the right represents women. To understand this data, we focus on a particular intersectional identity. In 2021, 8% of all computing graduates in the U.S. identified as Hispanic men, whereas only 2% identified as Hispanic women. In contrast, 6% of graduates from university (in any field) identified as Hispanic men and 9% as Hispanic women. Thus, with respect to who graduated from university in the U.S. in 2021, Hispanic men are over-represented in computing (8% vs. 6%) and Hispanic women are underrepresented (2% vs. 9%).
Computer science departments can understand their opportunity gap by looking at their own data in comparison to the data across their university. In Figure
10, we show the 2021 graduation data for an HSI in California. What is particularly striking is that out of the more than 2,300 Hispanic women who graduated from this university in 2021, only five graduated with a CS degree. Indeed, this problem persists across women of all races and ethnicities at this university; in 2021, 59% of its graduates were women, whereas only 19% of CS graduates were women.
Computer science departments can understand their opportunity gap by looking at their own data in comparison to the data across their university.
Comparing diversity in computing across different institutions is best performed in the context of their university demographics. In Table
1, we use the standard analysis to compare two institutions, both public universities. Institution 1 is an HSI whereas Institution 2 is not dominated by any single race or ethnicity grouping. Unsurprising at Institution 1, since they make up the majority of the student body, Hispanic students earn the majority of CS degrees. Similarly, because the Hispanic student body is very large at Institution 1, we can see that Hispanic women make up a much larger proportion of total CS degrees than they do at Institution 2. Yet, when we apply a cohort analysis approach, we generate a picture of these two institutions that points more clearly to where existing interventions may be effective and what new interventions might be useful.
As we can see in Table
2, the percentage of Hispanic student degrees that are earned in CS degrees is similar at both institutions (3.2% vs. 3.4%). Yet, interesting differences arise when we consider the intersectionality of gender with ethnicity. At Institution 1, 0.9% of all Hispanic women’s degrees are earned in CS, whereas in Institution 2, 2.7% of all Hispanic women’s degrees are earned in CS. In contrast, Institution 1 does a better job drawing Hispanic men into CS (6.4% of Hispanic men’s degrees) compared to Institution 2 (4.3% of Hispanic men’s degrees). This may indicate that Institution 2 has strong interventions designed to recruit and retain women in computing, with derivative impact on Hispanic women students, but does not necessarily have efforts targeting students from historically marginalized race and ethnicity groups. By the same token, it would appear that Institution 1 should consider developing interventions focused on its women students.
The examples in this section illustrate the utility of looking at intersectional graduation data of the CS department in the context of the overall demographics of the university. This provides them with the opportunity gap they can tackle. For example, a CS department that awards 30% of degrees to women is a stunning success in a technical university where the representation of women across
all degrees is 30% but represents an opportunity gap in a university that is 57% women.
d In the next section, we examine the utility of summary statistics for measuring diversity using entropy-based measures/metrics.
Entropy-based Diversity Metrics
There is a large body of literature on the use of entropy-based metrics for measuring diversity in populations. Perhaps the most analogous application of these metrics to BPC is the work done on measuring residential segregation. Massey and Denton, for example, discuss many metrics used to measure residential segregation.
10 They proposed that residential segregation could be described by five dimensions: evenness, exposure, concentration, centralization, and clustering. Their analysis has been replicated and widely discussed,
4,9–11,21–22 and it serves as a good baseline for measuring diversity in computing programs.
eOne of these dimensions, evenness, has a good parallel with our analysis of diversity in computing. Recently, Kelly lamented that there was “no composite institution-level measure for ethnic diversity”
6 and proposed to use the Shannon index as a way to measure ethnic distribution in academic programs. Kelly goes on to state, “What is needed is a single index that does more than simply count how many ethnicities exist in a dataset, but instead takes account of the relative population size of those different ethnicities.”
6The Shannon information index,
22 or the Entropy index, has been commonly used for such purpose. This measure is defined as:
where
k is the number of groups in the analysis,
Nk is the number of students in group
k,
N is the total number of students in the population, and
pk =
Nk/
N (that is, the percentage of group
k in the population). When all
k groups are equally distributed,
H is maximized. For our purposes, we will use the normalized version, called the Shannon Equitability Index (see Kelly
6) which is computed as
EH =
H/
ln(
k) and produces values between 0 (no diversity) and 1 (all groups are equally represented).
EH (often called “Evenness”) represents the degree to which all groups are equally proportioned in the population of study. When this value is represented as a percent (0% .. 100%), it can be interpreted as the percent of a uniform distribution that a particular distribution represents.
To illustrate the strengths and weaknesses of
EH for our purposes, we apply this measure to the IPEDS data of the 12 universities in North Carolina that graduated the largest number of students with CS degrees in 2020. These institutions are listed in Table
3 and include public and private, urban and rural, Historically Black Colleges and Universities (HBCUs), as well as different levels in the Carnegie Classification.
f Figure
11 shows
EH for the CS program at Univ-11 for the years 2010-2019. It shows three different calculations of
EH:
2.
Race/Ethnicity (American Indian or Alaska Native, Asian, Black or African American, Hispanic or Latino, Native Hawaiian or Other Pacific Islander, White, and two or more races).
3.
Intersectionality, which includes all combinations of gender and race.
For this institution, you can see that EH calculated for gender has not changed much, rising to 67.2% in 2019 from 65% in 2010. In contrast, when we look at race, we observe that in the same time period EH has risen from 36.5% in 2010 to 67.3% in 2019, which is a significant improvement. This example shows that EH can be used to track diversity in a single institution over time. The next analysis illustrates the weaknesses of EH as a measure to compare diversity across institutions.
Figure
12 shows a dumbbell graph of
EH for all institutions shown in Table
3 for gender (circle), race/ethnicity (triangle), and the intersection of race and gender (diamond) of the 2020 graduation data across all degrees for each university as reported in IPEDS. As shown in Figure
12, nearly all institutions are close to gender parity with 90.6% as the lowest
EH value among this group. Indeed, for these universities, the percentage of female graduates in 2020 ranges from 49% to 68% of all students on campus. Note that a student body which is 50% female and 50% male would yield 100% for the
EH metric.
The EH metric for race tells a different story. EH calculated using race ranges from 23% to 66% for this set of institutions. It is worth noting that the three institutions with the lowest value for the race EH metric are all HBCUs, where the representation of African American students ranges between 81% and 93%. EH measures how close a population is to a uniform distribution, ignoring the context of the institution. Indeed, it is expected that HBCUs would have a low value in the EH metric given the mission and composition of HBCUs. Therefore, we must be careful when using EH to compare across institutions because it ignores institutional context.
As we saw previously in this article, to truly understand participation of a particular group within a particular computing program, we must consider the representation of sub-populations in the context of the larger reference group (that is, cohort analysis).
EH as a measure of evenness ignores the size of the reference group. For that, we turn to the Jensen-Shannon divergence,
8 which measures the similarity between two probability distributions. It is based on the Kullback–Leibler divergence,
7 with some notable (and useful) differences, including that it is symmetric, and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon (JS) distance.
gFigure
13 shows the JS distance between the intersectional distribution of CIP 11 degrees awarded and all degrees awarded for each of the institutions listed in Table
3. A value of zero means that the two distributions are identical. To understand this metric, we look at Univ-5, a private HBCU, which has the highest JS distance of the 12 universities. Figure
14 shows the intersectional breakdown for CIP 11 degrees and all degrees for Univ-5. Although most degrees awarded (62%) by this institution went to Black women, the institution awarded
zero CIP 11 degrees to Black women. Furthermore, Hispanic men and women are overrepresented in CIP 11 with regard to all degrees awarded on this campus.
In this section, we examined the use of two commonly applied entropy-based measures for evaluating the demographic diversity of a population. The first,
EH, is maximized when all sub-populations are uniformly distributed. The second, the JS distance, measures how different the population of CS is compared to the reference population and can be seen as a summary statistic of the data presented in Figure
9. Both are useful summary statistics to track over time to see if representation is increasing overall, but they are best used in combination with the other, more detailed analysis methods presented.
Conclusion
In this article, we have pointed out the limitations of looking at diversity and assessing BPC efforts via the single metric of the percentage of each sub-population’s degree attainment as a proportion of the total degrees in the field. We make three recommendations for quantitative data analysis of BPC efforts. First, we need to examine cohort-based data to evaluate each group’s interest in computing, independent of larger demographic shifts in the student population. Second, the field as a whole needs to adopt the norm of always reporting intersectional data, rather than just looking at men/women and race/ethnicity separately. Third, university demographic context must be considered when evaluating how well a computing department is doing to broaden participation, thereby also accounting for shifts in the overall university population. Cohort-based analysis, intersectionality, and entropy-based measures provide different insights that are necessary to fully understand the challenges and successes of BPC activities.
We conclude with one final observation about what data to analyze. In this article, we analyzed IPEDS graduation data, but there are additional facets to the challenges faced by students not captured in this data, including differential exposure to computing, recruitment, mentoring, retention, and institutional barriers students face in discovering and majoring in computing.
2;
3 Graduation data is not sufficient for monitoring the impact of BPC activities, particularly on the introductory sequence of courses in the computing major. Thus, we recommend tracking the intersectional demographics of students’ drop/fail/withdraw rates in the introductory sequence classes by professor
every semester/quarter to uncover opportunities for change in the curriculum, the co-curricular elements, and so on.
12 We note that it can be difficult for computing departments to obtain the often centrally held student demographic data. The Center for Inclusive Computing has, to date, successfully helped more than 60 U.S. universities provide this intersectional data for their computing departments. Finally, we would be remiss in not pointing out that, in addition to quantitative analysis, we must also examine qualitative analysis and student survey data to better understand the experiences of students and the opportunities for our programs to be more inclusive.