I’ve always personally thought that some of the results driving the development of the weightwatcher library to be fascinating. I understand that they have several papers and explain what they do on their website, but I want to re-confirm one of their core observations, without double-checking to make sure it’s true first.

Critically, I faintly recall the claim that the linear layers of trained neural networks have power law distributed singular values, as opposed to values that would be expected of random matrices and derived from random matrix theory (RMT).

I want to see if my recollection is correct, and so:

import matplotlib.pyplot as plt
import torch
from transformers import AutoModel

bert = AutoModel.from_pretrained("bert-base-uncased")

svds = [ layer.intermediate.dense.weight.svd().S.detach().numpy() for layer in bert.encoder.layer ]

fig, axes = plt.subplots(3, 4, figsize=(12, 9))
for i, ax in enumerate(axes.flatten()):
    ax.hist(svds[i], bins=100, color="blue", alpha=0.7)
    ax.set_title(f"Layer {i+1}")

plt.suptitle("FFN Layers", fontsize=16);

png

And for attention components:

Q_svds = [ layer.attention.self.query.weight.svd().S.detach().numpy() for layer in bert.encoder.layer ]

fig, axes = plt.subplots(3, 4, figsize=(12, 9))

for i, ax in enumerate(axes.flatten()):
    ax.hist(Q_svds[i], bins=100, color='blue', alpha=0.7)
    ax.set_title(f'Layer {i+1}')
plt.suptitle("$W_Q$ Parameters", fontsize=16);

png

Interesting! It seems like there is a qualitiative difference between the spectral density of the $W_Q$ parameters and their FFN counterparts. Granted, they are different sizes, but still, we can see interesting differences in their distributions.

What do the singular values of a Gaussian random matrix of the same size look like?

plt.title("Random matrix singular values - FFN")
plt.hist(torch.randn_like(bert.encoder.layer[0].intermediate.dense.weight).svd().S, color="blue", alpha=0.7, bins=100);

png

plt.title("Random matrix singular values - FFN");
plt.hist(torch.randn_like(bert.encoder.layer[0].attention.self.query.weight).svd().S, color="blue", alpha=0.7, bins=100);

png

It seems qualitatively different from the trained matrix results.

Overall, this kind of thing begs to be explored more, and I’m sure the weightwatcher folks have and are going to do so. I just wish it was discussed more frequently.