Great paper. Your stability-quality paradox has a clean explanation through renormalization group (RG) flow, which reframes it as a convergence-correctness distinction.
Each expert learns an energy landscape with its own attractor basins — its own "universality class" in physics language. A landscape expert's attractors encode horizons and sky gradients; a portrait expert's encode facial geometry. These are genuinely different fixed points.
Full ensemble averaging blends velocity fields that point toward different fixed points. The result is smooth (low Jacobian norms, good convergence) because contradictory predictions partially cancel — but the blended field guides the system toward a centroid of all attractor basins, a point corresponding to no coherent image. You converge reliably to the wrong place.
Sparse routing preserves the flow toward the correct fixed point by ensuring only experts whose attractors match the emerging image contribute velocity predictions. Noisier trajectory, right destination.
The denoising timestep itself plays the role of RG scale: early steps (high noise) are "infrared" where coarse structure forms and expert differences barely matter; late steps are "ultraviolet" where fine detail emerges and expert specialization is critical. This suggests expert-data alignment matters most at late denoising steps — which your cluster distance analysis at different timesteps could test directly.
Great paper. Your stability-quality paradox has a clean explanation through renormalization group (RG) flow, which reframes it as a convergence-correctness distinction.
Each expert learns an energy landscape with its own attractor basins — its own "universality class" in physics language. A landscape expert's attractors encode horizons and sky gradients; a portrait expert's encode facial geometry. These are genuinely different fixed points.
Full ensemble averaging blends velocity fields that point toward different fixed points. The result is smooth (low Jacobian norms, good convergence) because contradictory predictions partially cancel — but the blended field guides the system toward a centroid of all attractor basins, a point corresponding to no coherent image. You converge reliably to the wrong place.
Sparse routing preserves the flow toward the correct fixed point by ensuring only experts whose attractors match the emerging image contribute velocity predictions. Noisier trajectory, right destination.
The denoising timestep itself plays the role of RG scale: early steps (high noise) are "infrared" where coarse structure forms and expert differences barely matter; late steps are "ultraviolet" where fine detail emerges and expert specialization is critical. This suggests expert-data alignment matters most at late denoising steps — which your cluster distance analysis at different timesteps could test directly.
More on the RG flow view of neural network forward passes here: https://www.symmetrybroken.com/transformer-as-renormalization-group-flow/
and https://arxiv.org/abs/2507.17912
Hey Michael, thanks for the comment. Your explanation and intuition is on point!
Congrats 👏