by Eric Chu
4 September, 2017 - 10 minute read

At Winton, we often find that our greatest challenge with data is not its size or the selection of good models; our greatest challenge is with the unstated preconceptions of the data scientist. While data might not lie, it is all too easy to construct narratives around the data that reinforce preexisting worldviews.

Even rigor and the language of science can be hijacked and data selected to reinforce specific perspectives. Without careful thought or proper human systems in place, it is all too easy for those of us in positions of responsibility to be convinced of the wrong thing simply because we already conceived it to be true.

By way of illustration, let’s examine a dataset that purports to measure the quality of a university website. We at Winton Data have been wondering whether the quality of a university’s website is correlated with said university’s world ranking.

Websites are, at one level, a marketing tool. So for the best-known and less tech-focused universities–those with the strongest brands–one preconception might be that they might have less need of a high-quality website, given their ability to attract applicants on the strength of their name alone. For others, however, it might seem a reasonable assumption that the website is part of a university’s public image and is therefore likely to be well-tended. This, therefore, represents a good test case to see whether either of those potentially contradictory prejudices might be borne out.

(In the interest of full disclosure: the author is a Stanford grad and also would like to know how Stanford’s website fares in comparison to Berkeley’s.)

To help us answer this question, we use Google’s Lighthouse and data from the US Department of Education.

Google Lighthouse

Google recently released an open-source tool, Lighthouse, to help web developers determine if their websites meet certain best practices and performance metrics. Using this tool alongside a handy Chrome browser, we are able to measure four different metrics for a web page:

  1. its adherence to Google’s progressive web app checklist,
  2. its loading and rendering performance,
  3. its accessibility, and
  4. its adherence to modern web best practices.

Here’s what these metrics say about Winton’s site:

Winton Lighthouse scores -- pwa: 55, performance: 83, access: 86, best
practices: 85

Note that the notion of website quality is defined by the folks on the Chrome Dev Tools developer team. A different set of metrics may produce a different set of results. Furthermore, the progressive web app metric is a fairly new metric that only applies to applications hosted on a website (e.g., a to-do list) and does not apply to the vast majority of websites, which are purely informational. Even though Winton’s score for a progressive web app is 55, a quick visit to Google shows that they, too, have a score of 55. Most sites have a score sitting around 50 or 60.

University websites

The US Department of Education maintains a list of universities, their accredidation status and their websites. Unfortunately, it does not contain any ranking data. We obtain the world university ranking data from Kaggle, which has the Times World Ranking data from 2012 to 2015.

We join the two datasets together using the institution name and restrict ourselves to the rankings as of 2015. The resulting dataset contains 153 universities, their world ranking, and their webistes (as reported to the Department of Education). We did notice one mistake: The University of Colorado, Denver had their website reported as instead of

Lighthouse scores

Because of the variations in measuring performance scores with Lighthouse, we execute three runs of Lighthouse against all 153 websites and report the average score. We then rank the universities by any of the four metrics, and also by their world ranking. The following table contains five rows of our results ordered by the university’s world rank:

Institution Progressive Web App Performance Accessibility Best Practices World Rank
Harvard University 36.4 78.7 91.4 69.2 1
Stanford University 54.5 96.9 94.3 92.3 2
Massachusetts Institute of Technology 45.5 99.6 92.4 76.9 3
University of California, Berkeley 36.4 96.7 97.1 61.5 7
University of Chicago 45.5 77.7 85.7 69.2 8

Here are the score distributions across all 153 websites:

University website performance distributions

Here are the rankings based solely on connection performance:

University website performance rankings

There are several things to note in this result:

These observations suggest that rankings based on a single metric gives a poor sense of the quality of the site.

(Also, the astute reader may realize that ranking by performance means Berkeley is better than Stanford, so the post must go on!)

While we could form a weighted combination of the four scores, it is unclear what the proper weights should be. Furthermore, the weights betray individual biases (i.e., you might care more about accessibility, while I care more about performance) and are ultimately subjective. We thus take a more data-driven approach and consider two different scoring mechanisms.

First, we use principal component analysis (PCA) to create a low-dimensional embedding of the data. This approach also yields axes that explain the vast majority of the variance in the data, thus unambiguously lending credence to the choice of weights. One caveat, however, is that the PCA approach is likely to be unstable: new data points may significantly change the weights. We do not perform any sensitivity analysis on these weights.

The second is to rank universities based on the best and worst percentile of their four metrics. This approach is simpler and does not require any choice of weights.

Alternative ranking methods

Low-dimensional Lighthouse scores via PCA

Using PCA to embed the four metrics into a two dimensional space, we obtain the following principal components:

\begin{array}{lcl} x & = & 0.36 m_{\mathrm{prog}} + 0.88 m_{\mathrm{perf}} + 0.10 m_{\mathrm{access}} + 0.29 m_{\mathrm{best}} - 122.41\\ y & = & 0.74 m_{\mathrm{prog}} - 0.47 m_{\mathrm{perf}} + 0.003 m_{\mathrm{access}} + 0.47 m_{\mathrm{best}} - 22.64 \end{array}

where \( m_{\mathrm{prog}} \) is the progressive web app metric, \(m_\mathrm{perf}\) is the performance metric, \(m_\mathrm{access}\) is the accesibility metric, and \(m_\mathrm{best}\) is the best practices metric. These two principal components explain about 75% of the variance in the data.

Plotting the university web sites along these two principal components and normalizing, we obtain the following plot.

University website PCA embedding

Because principal components are unique up to a sign, we choose the signs on the two components such that points that fall “up” and to the “right” of the plot generally have higher scores across all four metrics while those that fall “down” and to the “left” generally have lower scores across all four metrics. Interestingly, both Stanford University (upper right) and Sofia University (lower left) are in Palo Alto, California.

Institution Progressive Web App Performance Accessibility Best Practices World Rank
Stanford University 54.5 96.9 94.3 92.3 2
Sofia University 27.3 75.4 82.9 53.8 845

As PCA has already decorrelated the data along these two axes, we can quantify this intuition by projecting all points onto the \(x = y\) line. Intuitively, the choice of this line means that we weight both principal components equally as the embedding has already been normalized to have unit variance. Projecting onto any other line would suggest that we have prior information about the distribution that is not captured in the data. In this case, we should reweight the data according to our prior beliefs or use a technique different from PCA.

Note that this process differs from taking the sum of all four metrics. An average of the four metrics implies that all four metrics are equally important in distinguishing website quality; whereas this approach uses PCA to determine weights (from the data) for the four metrics automatically. The weights obtained by PCA are axes that contain the most variance in the data: these are axes that cause our data points to be the most dispersed.

Because all the operations are linear, our final score is a linear combination of the four metrics.

\begin{array}{lcl} s & = & 0.10 m_{\mathrm{prog}} + 0.008 m_{\mathrm{perf}} + 0.01 m_{\mathrm{access}} + 0.07 m_{\mathrm{best}} - 11.12 \end{array}

Ordering the universities by their score, \(s\):

University PCA ordering

It is now trivial to game the university rankings by focusing specifically on the metrics with highest weights (the progressive web app metric and the best practices metric).

Best and worst percentiles

With any linear approrach, we must be disciplined with our choice of weights. A relative approach, however, such as the one we are about to present, avoids the discussion of weights and instead ranks websites according to their position in their peer group.

For each university, we compute its percentile for each metric. We then report both its best and worst percentile. Here are the top five universities:

Institution Best Percentile Worst Percentile World Rank
Harvard University 0.50 0.17 1
Stanford University 0.99 0.64 2
Massachusetts Institute of Technology 0.90 0.55 3
University of California, Berkeley 0.83 0.22 7
University of Chicago 0.66 0.14 8

To determine which university websites are worst overall, we use their best percentile to rank them. This ranking method tries to be optimistic and ranks university websites using the best score. A score of 0.2 means that, of the four metrics, the site’s best metric is better than 20% of other sites. Its other metrics are ranked even lower.

University worst ordering

Despite its best efforts, Sofia University has one of the worst overall scores. Note that Sofia University also has one of the worst scores according to PCA.

To determine which university websites are best overall, we use their worst percentile to rank them. This ranking method tries to be pessimistic and ranks university websites using their worst score. A score of 0.8 means that, of the four metrics, the site’s worst metric is better than 80% of other sites. Its other metrics are ranked even higher. This ranking mechanism prevents any one metric with a high score from dominating the ranking, although it will penalize universities with one metric with a very low score.

University best ordering

Closing thoughts

We originally set out to determine if a university’s world ranking correlated with the quality of their website. The following chart has the universities ordered by their world ranking while the bar heights correspond to their position in quality ranking, as determined by PCA (longer bars have better quality).

University PCA ordering vs world ranking

The correlation coefficient of the PCA rank and the world ranking is 0.28; using percentile rankings for this plot does not significantly change its appearance, and the correlation coefficient is 0.19.

While it is perhaps unsurprising that university rankings are not very correlated with university website quality, this process did highlight some university pages that were not yet using HTTPS to serve their sites. Rice and Harvard University are two examples. Harvard’s site actually redirects to the corresponding HTTP site. However, Harvard makes amends elsewhere: for instance, its alumni donation sites are secured by HTTPS.

Lastly, while both the PCA and percentile ranking approaches suggest that Stanford’s website is of better overall quality than Berkeley’s, these metrics can be chosen such that the results agree with our personal biases. (Remember, this post is a bit longer than it needs to be because Berkeley scores better on the performance metric alone.)

While data might never lie, it is easy to construct narratives around the data that reinforce preexisting worldviews. While university website rankings may not have profound implications, many other studies do have far-reaching implications. It is therefore important for us to be aware of our personal biases - especially for those of us that are stewards of others’ wealth - and create processes designed to limit the impact of wishful thinking.