Everything Bagel

WorldDiT: A Unified Architecture for World and Action Modeling

Sen Wang — Tue, 28 Jul 2026 18:56:36 GMT

We’re releasing WorldDiT, a pareto frontier model architecture bringing world and robot action modeling into one compact diffusion transformer, without relying on a large pretrained VLM action backbone.

WorldDiT in action

Three reasons WorldDiT is different

At 399 million total parameters, WorldDiT lies on the reported LIBERO Pareto frontier for model size and mean success.
Among publicly released methods in the comparison that do not use a large pretrained VLM action backbone, it reports the highest mean success.
WorldDiT unifies future visual world state prediction with continuous robot action generation in a single diffusion transformer.

Reported LIBERO success against total model parameters for 24 methods.

🤗 Run WorldDiT from Hugging Face with the released checkpoints and complete inference runtime.

📄 Learn more about the architecture, training, and results in our technical report on arXiv.

If you use WorldDiT in your research, feel free to cite the paper.

@article{260723909,
  title={{WorldDiT: A Unified Diffusion Architecture for World and Action Modeling}},
  author={Sen Wang and R. Gnana Praveen and Bidhan Roy and Marcos Villagra},
  year={{2026}},
  eprint={{2607.23909}},
  archivePrefix={{arXiv}}
}

Introducing Paris 2.0

Ali Rouzbayani — Thu, 28 May 2026 16:53:47 GMT

Today we’re releasing Paris 2.0, the first video generation model pre-trained across heterogeneous GPU types distributed across regions.

Three things make Paris 2.0 unique.

The model was trained on a deeply heterogeneous pool of GPUs across generations and vendors.
Paris 2.0 training ran across geographically distributed regions and clouds, instead of shared datacenters.
In low-resolution text-to-video training, against a monolithic baseline model at a matched total compute budget, Paris 2.0 cuts Fréchet Video Distance (FVD) from 561.04 → 279.01 - a ~2.0× improvement.
Relative improvement over the monolithic baseline, where a taller bar marks a larger gain. Paris 2.0 roughly halves FVD and lifts CLIP text-video and aesthetic scores.
Samples
A woman with long, blond, wavy hair is speaking directly to the camera. She is wearing a red sweater. While talking, her facial expressions changing as she speaks. The background is a cluttered room.
A person’s hands performing a paper-folding craft on a green cutting mat with a grid. The person uses a black marker to make a small mark on a piece of purple paper that has already been folded into a specific shape.
A pair of hands interacting with a translucent, gelatinous slime. The slime is a vibrant blue color. The hands are seen stretching, squeezing, and folding the slime, demonstrating its gooey and pliable texture.
Get Started

📄 → Read the Full Technical Report on arXiv
🤗 → Download Weights on Hugging Face

Heterogeneous Decentralized Diffusion Models

Gin Jiang — Wed, 11 Mar 2026 15:16:10 GMT

Based on Bagel Labs' CVPR 2026 paper on Heterogeneous Decentralized Diffusion Models: arxiv link

Decentralized Diffusion Models (DDMs) train independent experts on disjoint data partitions and combine them at inference time. Existing DDM frameworks assume all experts share the same training objective. We relax this constraint. In our setup, some experts train with DDPM (ε-prediction) and others with Flow Matching (velocity-prediction), then unify at inference through a deterministic conversion into a common velocity space. No retraining, no fine-tuning, no coordination during training.

Heterogeneous experts trained independently on single GPUs, unified at inference through velocity conversion.

Under aligned inference settings, this heterogeneous ensemble (2 DDPM + 6 FM experts) achieves better FID (11.88 vs. 12.45) and higher intra-prompt diversity (LPIPS 0.631 vs. 0.617) than a homogeneous ensemble of 8 FM experts. Relative to the training scale reported for prior DDM work, our framework reduces compute from 1176 to 72 A100-days (16×) and data from 158M to 11M images (14×), with each expert requiring only 24–48 GB VRAM on a single GPU, making decentralized diffusion training accessible on commodity hardware.

Background: Decentralized Flow Matching

DDMs decompose a generative model into K experts, each trained on a semantically coherent subset of the data. Following prior DDM work, the marginal velocity field is expressed as a weighted combination of per-expert conditional flows:

where uₜ⁽ᵏ⁾(xₜ) is the velocity predicted by expert k trained on cluster Sₖ, and pₜ(k | xₜ) is a posterior weight from a learned router.

Training

We partition the dataset into K semantic clusters using DINOv2 features (1024-dimensional representations with hierarchical k-means). This produces semantically coherent partitions, e.g. portraits, landscapes, architecture. Each expert θₖ trains exclusively on its assigned cluster Sₖ with zero communication between experts. No shared parameters, no gradient synchronization, no activation passing.

Routing

A lightweight router (DiT-B, 129M parameters) learns to predict cluster assignments from noisy inputs:

trained with cross-entropy loss against ground-truth cluster labels. At inference, the router dynamically selects and weights experts based on the current noisy state and timestep. We support three selection modes: Top-1 (single best expert), Top-K (weighted ensemble of K highest-probability experts), and Full Ensemble (all experts weighted by router probabilities). As shown in our prior work, Top-2 routing consistently outperforms Full Ensemble because sparse routing maintains expert-data alignment, selecting experts that are in-distribution for the current denoising state.

Heterogeneous Objectives

Previous DDM work requires all experts to share the same training objective. This is a coordination requirement that may be impractical when contributors operate independently. We remove this constraint.

Two objectives, different emphasis

We train n experts with Flow Matching loss and m experts with DDPM loss. DDPM experts predict the noise ε added during the forward process:

Flow Matching experts predict the velocity field directly:

Here x₀ ∈ Sₖ means expert k trains only on clean samples from its assigned cluster Sₖ, ε ~ N(0, I) is Gaussian noise, t is the sampled timestep, and αₜ, σₜ are the schedule coefficients controlling signal and noise strength, xₜ = (1-t)x₀ + tε is the linear interpolation between clean data and noise.

Both objectives model the same generative process through different parameterizations. But they weigh errors differently across timesteps, and this difference is the mechanism behind heterogeneous ensembles.

Complementary loss weighting

We can write both losses in terms of the squared clean-sample estimation error |x̂₀ - x₀|² (the detailed derivation can be read from variational diffusion models). Under a variance-preserving schedule (αₜ² + σₜ² = 1):

The ratio between them is:

Here w_ε(t) and wᵥ(t) are the effective per-timestep weights after rewriting each loss in terms of clean-sample estimation error. Since αₜ ≤ 1 and decays toward 0 at high noise, this ratio diverges. Velocity-prediction experts receive relatively stronger gradients at high-noise timesteps (global structure), while ε-prediction experts are relatively upweighted at low noise (fine details). In the paper, we derive this under a variance-preserving schedule and then note that linear interpolation recovers the same 1 / αₜ² structure. So the complementary weighting pattern holds both for the VP analysis and for the linear FM path used here.

DDPM concentrates learning signal near clean images. Flow Matching emphasizes high-noise timesteps. Their blind spots are complementary.

The implication is that each objective has a “blind spot” region where its effective weight is relatively low, and these blind spots are complementary. Where DDPM under-trains (high noise), FM trains hardest. Where FM under-trains (low noise), DDPM concentrates its signal. Mixing both in an ensemble lets each objective cover the other’s blind spots, providing more uniform coverage across the full denoising trajectory.

Inference-Time Unification

The central technical challenge is that DDPM experts output noise predictions ε_θ(xₜ, t) while FM experts output velocity predictions v_θ(xₜ, t). These live in different spaces. You cannot average them directly.

We unify all expert predictions into a common velocity space through a deterministic, schedule-aware conversion. The derivation proceeds in three steps.

The conversion pipeline: DDPM noise predictions are converted to velocity through deterministic algebra. FM predictions pass through unchanged.

Step 1: Recover the clean-image estimate

From the DDPM forward process xₜ = αₜ x₀ + σₜ ε, invert the linear map using the model’s noise prediction:

Step 2: Derive the velocity

Treating x̂₀ and ε_θ as fixed at their current-timestep values defines a deterministic path x̃ₜ = αₜ x̂₀ + σₜ ε_θ. Differentiating with respect to t:

Under linear interpolation (αₜ = 1-t, σₜ = t), the schedule derivatives are -1 and +1, so this simplifies to:

This is exactly the FM velocity target v = ε - x₀.

Step 3: Combine

FM experts already output velocity, so they pass through unchanged. All predictions are now in the same space. The router assigns per-expert weights, and we take a weighted combination to form the ensemble field uₜ, which drives a standard ODE integration step (x_{t−Δt} = xₜ − uₜ · Δt).

Numerical stability

The conversion requires dividing by αₜ, which approaches zero at high noise. We apply three safeguards: (1) clamp x̂₀ to [-20, 20] for VAE latents, (2) use α_safe = max(αₜ, 0.01) in the denominator, and (3) apply adaptive velocity scaling that dampens converted predictions at elevated noise levels where schedule derivatives become large. These are simple clamps with no learned parameters.

The entire conversion is closed-form algebra. No learned components, no fine-tuning, no additional training of any kind.

Efficient Training

Prior DDM work required 1176 A100-days on 158M images. We achieve competitive quality with 72 A100-days on 11M images, a 16× reduction in compute and 14× in data. Three techniques enable this.

Resource comparison: our approach requires a fraction of the compute and data of prior DDM work.

Pretrained checkpoint conversion

We initialize experts from ImageNet-pretrained DiT checkpoints. Patch embeddings, positional embeddings, and all transformer blocks are fully transferred. Only the final projection layer (which differs between ε- and velocity-prediction targets) and the text projection (new modality) are reinitialized. Class-conditional embeddings from ImageNet pretraining are removed.

A key technical detail is timestep compatibility. DiT models expect discrete timesteps t ∈ {0, 1, …, 999} while Flow Matching uses continuous t ∈ [0, 1]. We handle this with runtime conversion (t_DiT = round(999t)) rather than modifying pretrained weights, preserving the learned timestep embedding structure.

Converted checkpoints reach validation loss parity 1.2× faster than training from scratch.

Efficient architecture

Each expert uses DiT with PixArt-α’s AdaLN-Single conditioning. Rather than computing adaptive layer normalization parameters with per-block MLPs, a single global MLP produces all modulation signals:

reshaped into per-block slices plus learned per-block embeddings Eᵦ. This reduces parameters by 30% (891M to 605M for DiT-XL/2) while maintaining generation quality.

True isolation

Each expert trains on its own semantic cluster on a single GPU requiring 20–48GB VRAM. No gradient synchronization, no activation passing, no pipeline coordination, no parameter servers. This is not data parallelism with relaxed communication. There is literally zero inter-expert communication during training.

Experiments

We train on 11M LAION-Aesthetics images. For a high-quality subset of 3.9M images, we use LLaVA to generate improved captions. We evaluate using FID-50K on a held-out 50K test set. We train at two scales: DiT-B/2 (129M parameters per expert) and DiT-XL/2 (605M parameters per expert).

Our standard configuration uses K=8 experts. Experts 0 and 3 train with DDPM objectives (assigned to clusters containing high-fidelity subjects where ε-prediction excels at detail preservation). The remaining six use Flow Matching.

Monolithic versus decentralized

We first validate that decentralized training works. Using DiT-B/2 experts trained from scratch on LAION-Art (3.9M images), all with Flow Matching:

Top-2 achieves 22.60, outperforming the monolithic baseline by 23.7%. Full Ensemble underperforms dramatically (47.89), consistent with our prior finding that indiscriminate combination introduces prediction conflicts from out-of-distribution experts. Selective expert activation is essential.

Homogeneous versus heterogeneous

To isolate the effect of objective heterogeneity, we compare homogeneous and heterogeneous 8-expert DiT-XL/2 models under matched inference settings.

Under aligned settings (first and last rows), heterogeneous experts improve FID from 12.45 to 11.88. Scaling from 1 to 2 DDPM experts improves FID from 19.75 to 15.09 under the conversion setting, suggesting the optimal DDPM:FM ratio deserves careful tuning per domain.

For diversity, we measure intra-prompt LPIPS by generating 10 images per prompt for 100 held-out prompts. Heterogeneous experts achieve 0.631 (± 0.078) vs. homogeneous 0.617 (± 0.074). Objective heterogeneity produces more varied outputs for identical prompts.

Heterogeneous ensembles improve both image quality (lower FID) and output diversity (higher LPIPS) over homogeneous baselines.

Conversion quality

We evaluate the DDPM → FM conversion in isolation, using experts trained on the same data cluster (to isolate objective effects from data distribution differences). Both DDPM and FM experts use DiT-XL/2 with identical hyperparameters.

Three findings emerge. First, the conversion works. DDPM → FM improves over native DDPM (FID 25.61 vs. 27.04) while preserving semantic coherence (CLIP 0.319 vs. 0.316). The conversion is most valuable as a compatibility mechanism rather than a lossless objective replacement.

Second, combined experts achieve higher output diversity (LPIPS 0.782) than single FM (0.752), approaching native DDPM levels (0.787). Heterogeneous objectives create complementary generation patterns.

Third, using the same cosine schedule for both objectives yields marginally better FID than different schedules (32.67 vs. 33.29), suggesting schedule alignment facilitates smoother expert transitions. But both combinations show similar diversity gains, indicating that objective heterogeneity drives the primary benefit.

Routing threshold analysis

For combined DDPM+FM experts, a deterministic router switches between them at a threshold t: DDPM handles timesteps t’ ≤ t (low noise), FM handles t’ > t (high noise).

Impact of router threshold on generation quality. Different thresholds produce a clear quality-diversity trade-off.

Lower thresholds (0.2–0.3) favor quality (FM-dominated denoising, optimal FID). Mid-range thresholds (0.4–0.5) favor diversity (balanced workload, highest LPIPS). Extreme thresholds (0.7) degrade both metrics, confirming that both expert types contribute essential complementary capabilities.

Effects of expert ordering and router thresholds

Expert ordering also matters. In a 2-expert setup (1 converted DDPM + 1 FM), we vary the ordering — DDPM→FM versus FM→DDPM — and the switching threshold τ ∈ {0.3, 0.5, 0.7}.

Expert ordering and router threshold effects. FM→DDPM ordering produces more stable, coherent results, while DDPM→FM shows higher sensitivity to threshold selection.

The results reveal a striking asymmetry. FM→DDPM (bottom row) produces consistently cleaner images across all thresholds. DDPM→FM (top row) degrades at higher thresholds, with blocky artifacts and oversaturation. The reason: when DDPM operates first at high noise (αₜ → 0), the conversion x̂₀ = (xₜ - σₜ ε_θ)/αₜ is numerically unstable, and errors early in the trajectory get baked into the image structure. Letting FM handle high-noise timesteps first avoids this entirely — the converted DDPM expert then refines at low noise where conversion is stable.

Takeaway: DDPM-to-velocity conversion should be restricted to low-noise regimes (t < 0.5).

Discussion

Resource efficiency in context

Our results should be interpreted carefully relative to prior DDM work. The DDM FID range of 5.5–10.5 was achieved at substantially larger training scale (1176 A100-days, 158M images). Our numbers are not directly comparable in absolute FID terms. What they do show is that competitive generation quality is attainable at a fraction of the resources, and that heterogeneous objectives provide an additional quality gain at no extra training cost.

Limitations

We evaluate only a narrow set of DDPM-to-FM ratios (1:7 and 2:6). The ideal allocation likely depends on the data distribution and downstream requirements. The deterministic conversion relies on hand-tuned numerical safeguards; a more robust conversion mechanism that generalizes across arbitrary schedules would strengthen applicability. We consider only ε- and velocity-prediction; extending to x₀-prediction or consistency objectives could further diversify expert specialization but would require generalizing the conversion and routing mechanisms.

What this enables

The practical upshot is that decentralized diffusion training no longer requires coordinated infrastructure or agreement on training objectives. A contributor with a single GPU can train a DDPM expert on portraits. Another can train an FM expert on landscapes using different hardware. These experts combine at inference time without either contributor needing to know what the other was doing.

Citation

@inproceedings{jiang2026heterogeneous,
  title     = {Heterogeneous Decentralized Diffusion Models},
  author    = {Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
  journal = {arXiv preprint arXiv:2603.06741},
  year    = {2026}
}

Stability–Quality Paradox in Decentralized Diffusion Models

Marcos Villagra — Fri, 06 Feb 2026 15:56:46 GMT

In Decentralized Diffusion Models (DDMs), denoising is routed through independently trained experts at inference time. These experts can strongly disagree in their denoising predictions. What actually governs the quality of generations in such a system? We present the first ever systematic interpretability study of this question.

The natural expectation is that minimizing denoising trajectory sensitivity — minimizing how perturbations amplify during sampling — should govern generation quality. It doesn’t. Full ensemble routing, which combines all expert predictions at each step, achieves the most stable sampling dynamics and the best numerical convergence. It also produces the worst generation quality (FID 47.9 vs. 22.6 for sparse Top-2 routing). We call this the stability–quality paradox.

Instead, we identify expert-data alignment as the governing principle. Generation quality depends on routing inputs to experts whose training distribution covers the current denoising state, even when doing so makes the trajectory unstable. For DDM deployment, routing should prioritize expert-data alignment over the usually used numerical stability metrics.

Decentralized Diffusion Models

DDMs aren’t Mixture-of-Experts layers. They’re ensembles of independently trained models. In standard MoE architectures, experts are FFN layers within a shared backbone, trained jointly with load balancing losses, and routed at the token level. DDM experts are complete diffusion models, and routing occurs at the input level (entire noisy generations) rather than token level.

Training

Each expert trains in isolation on a disjoint data partition. The training data is partitioned into K clusters (e.g., using k-means on DINOv2 embeddings), and each expert sees only its assigned cluster. So one might train exclusively on landscapes, another on portraits. No shared parameters or gradient communication. Experts only collaborate at inference time.

MoE experts, therefore, operate within a shared representational framework that limits their differences. In contrast, DDM experts are unrestricted and can generate vastly different outputs from the same input.

Routing

At inference time, a lightweight router predicts weights

at each denoising step. The routed velocity field is:

The critical design question is how many experts should contribute at each step. Three natural strategies emerge.

Top-1 commits fully to the single most relevant expert. Every prediction comes from one model. If the router picks poorly, there’s no fallback.

Top-2 blends the two most relevant experts after renormalizing their weights. This allows experts to cross-check each other while still filtering out the majority.

Full Ensemble weights all experts by their router probability. Every expert contributes to every step, giving the mathematically complete combination.

Standard numerical analysis would suggest that including more experts should help. Averaging reduces variance, smooths the velocity field, and stabilizes the ODE integration. We tested whether this intuition holds.

The Stability Paradox

Numerical stability has been the default lens for optimizing diffusion sampling. The foundational probability-flow ODE formulation frames Lipschitz constants and discretization error as determining solver accuracy. Recent work develops stabilized Runge-Kutta methods for stiff diffusion ODEs, studies Lipschitz singularities and their effects on sampling, and the entire solver design space from Euler to Heun to DPM++ is organized around stability-accuracy tradeoffs. These analyses target single-model diffusion, and DDM-specific stability analysis did not exist. We provide the first systematic test of whether this framework transfers to distributed training systems.

We evaluated on Paris, the world’s first publicly released DDM, comprising 8 experts trained on LAION-Aesthetics. We tracked trajectory sensitivity, measuring how strongly the velocity field responds to input perturbations, and step-refinement disagreement, the difference between images generated with N and 2N steps. The results reveal a stability–quality paradox.

Full Ensemble achieves the lowest trajectory sensitivity. It converges the most cleanly. And it produces the worst images, with FID nearly double that of Top-2.

The cumulative IQR measures variability across denoising trajectories. Full Ensemble shows the lowest variability, indicating consistent, well-behaved numerical integration. Top-2 shows higher variability, yet produces superior images. For full experimental details, see the full paper.

Why does averaging more experts — which stabilizes the denoising trajectory degrade image quality?

Expert-Data Alignment Is The Governing Principle

The answer is not in how smooth the path is, but where it leads. Full ensemble averaging reduces Jacobian spectral norms by cancelling variance across expert predictions, which is exactly why it wins on stability metrics. But each expert is trained on a disjoint data cluster. When all experts contribute to every input, most of them are producing velocity predictions for inputs that lie far outside their training distribution. The resulting velocity field is smooth because the out-of-distribution errors partially cancel, but it points toward a weighted average of all cluster centers rather than the data manifold.

In single-model diffusion, a smoother velocity field means cleaner integration means better samples. DDMs break this. The smoothing that lowers trajectory sensitivity is not coming from a better-conditioned ODE. It is coming from averaging contradictory predictions, which suppresses variance at the cost of introducing systematic bias away from any individual data cluster’s learned distribution. Sparse routing avoids this by selecting only the experts whose training data clusters are close to the current input in embedding space, keeping each active expert within its training distribution. The velocity field is noisier but points in the right direction.

We call this governing principle expert-data alignment. Generation quality depends on routing inputs to experts whose training clusters cover the current denoising state, even when doing so reduces the trajectory stability.

If expert-data alignment governs quality, 3 predictions should hold. Sparse routing should achieve higher alignment (selected experts have lower data-cluster distance). Selected experts should produce superior velocity predictions. And expert disagreement should correlate with quality degradation. We tested all three.

Experimental Validation

Cluster Distance Analysis

Does sparse routing actually select experts whose training distribution match the input?

Using the Paris DDM (K=8 experts, DiT-XL/2 architecture with ~606M parameters each), we extracted DINOv2-ViT-L/14 embeddings at timesteps t in {0.3, 0.5, 0.7} during sampling for 500 samples. For each state, we computed the Euclidean distance from the embedding to each of the 8 cluster centroids used during expert training, then ranked the experts by this distance.

The router consistently selects experts whose training distribution are closest to the current denoising state. Top-1 achieves a mean rank of 1.54 (near-optimal selection), while Top-2 maintains strong alignment at 1.96. Full Ensemble, by construction, averages across all experts regardless of relevance.

The router is picky, and that’s the point. Sparse routing filters out experts that are statistically likely to produce poor predictions because they’ve never seen similar data.

Per-Expert Prediction Quality

Do selected experts actually produce better predictions?

For each step in Top-2 generation, we computed velocity vectors from all experts and measured angular deviation from the final blended velocity (which successfully guides the image to completion).

Selected experts consistently produce predictions that align more closely with the successful trajectory. The gap widens with specialization: the MNIST system (each expert trained on a single digit) shows a 43% difference versus 29% for Paris. When experts are highly specialized, the cost of including an inappropriate expert increases. A landscape expert might still contribute useful texture when generating a portrait. A “zero” expert offers actively harmful gradients when drawing a “seven.”

Expert Disagreement Analysis

Does expert disagreement predict quality degradation?

We computed trajectory-integrated disagreement: the average pairwise Euclidean distance between expert velocity predictions, summed over the denoising trajectory. We sorted generated images into quartiles by disagreement and measured LPIPS (perceptual distance to reference).

Images in the high-disagreement quartile (Q4) exhibit worse LPIPS scores than those in the low-disagreement quartile (Q1). The relationship is monotonic.

This is the Full Ensemble failure mode. When experts agree, the average produces reasonable results. When experts disagree (which happens frequently because most experts are out-of-distribution), the average becomes an incoherent compromise. It’s like asking a portrait painter and a landscape painter to collaborate on a cityscape. Sparse routing avoids this by silencing experts that are likely to disagree.

Is Stability Still Useful?

If numerical stability doesn’t govern quality, is it still useful? Yes, but it just measures the wrong thing for generation quality.

Stability Measures Convergence, Not Correctness

Step-refinement disagreement measures whether the solver converges consistently. Full ensemble achieves excellent convergence with step-refinement disagreement approximately 0.020 - doubling the number of sampling steps barely changes the output. Top-2 exhibits more numerical noise with step-refinement disagreement approximately approximately 0.051.

But it’s possible to converge perfectly to a blurry, incoherent average. Stability metrics indicate how easily the solver finds a solution, not whether that solution is good.

Within-Strategy Diagnostics

Trajectory sensitivity may still work as a within-strategy diagnostic. If practitioners are using Top-2 routing, a sudden spike in sensitivity for a specific input might flag a “hard” sample that needs more inference steps, even though sensitivity doesn’t predict quality across routing strategies.

Discussion

Limitations

We validate our hypothesis on two DDM systems (Paris with 8 experts, MNIST with 10 experts). The pattern is consistent across both, but additional systems would strengthen the conclusions - in progress at Bagel Labs. The relationship between cluster distance and prediction quality could be confounded by other factors, such as experts trained on larger clusters being more robust. We control for this by using the same embedding space (DINOv2) that was used during expert training.

Implications

For practitioners building DDM systems:

1. Routing should prioritize alignment over stability. Standard numerical stability metrics don’t indicate system health in DDMs. A “smooth” sampler may simply be averaging away useful signal.

2. Sparse routing is preferable. Top-2 routing achieves a favorable tradeoff: it maintains expert-data alignment while allowing experts to cross-reference each other. Top-1 may be too aggressive; Full Ensemble destroys alignment.

3. Monitor expert-data alignment directly. Track data-cluster distance ranks and expert disagreement during development, not just final FID scores.

Conclusion

Numerical stability doesn’t govern generation quality in Decentralized Diffusion Models. Expert-data alignment does. It means routing inputs to experts trained on similar data.

DDM systems should be evaluated and optimized for alignment rather than stability. Sparse routing succeeds not because it produces stable trajectories, but because it ensures each active expert operates within its domain of competence.

This post presents findings from our research. For full experimental details, methodology, and additional analyses, see:

Expert-Data Alignment Governs Generation Quality in Decentralized Diffusion Models

@misc{villagra_expertdataalignment_2026,
  author       = {Marcos Villagra and Bidhan Roy and Raihan Seraj and Zhiying Jiang},
  title        = {{Expert-Data Alignment Governs Generation Quality in Decentralized Diffusion Models}},
  howpublished = {\url{https://arxiv.org/abs/2602.02685}},
  note         = {arXiv:2602.02685 • accessed DD Mon YYYY},
  year         = {2026}
}

Introducing Paris

Gin Jiang — Tue, 07 Oct 2025 16:11:16 GMT

We’re releasing Paris - the world’s first decentralized trained open-weight diffusion model. The model is open for research and commercial use under the MIT license.

We named it Paris after the city that has always been a refuge for those creating without permission. Two remarkable facts that makes Paris first of it’s kind,

It’s a combination of smaller expert diffusion models pre-trained from scratch across different continents in complete isolation. The experts required zero gradient, parameter, or intermediate activation synchronization among each other during training.
This zero communication protocol achieves comparable quality to SOTA distributed approaches using 14× less data and 16× less compute.

How? Full technical report and model weights below.

Full Technical Report : https://github.com/bageldotcom/paris/blob/main/paper.pdf
Model Weights : https://huggingface.co/bageldotcom/paris

We believe we can scale this approach to global state-of-the-art results. But that requires solving some more really, really hard problems. If you’re an ML researcher or engineer interested in helping us achieve this while doing the best open-source work of your career, come work with us: jobs.bagel.com.

Tiny Tool Use

Raihan Seraj — Tue, 01 Jul 2025 14:03:17 GMT

Bagel Labs launching Tiny Tool Use, an intentionally tiny but production grade open-source library designed to simplify the process of training open-source LLMs for tool use.
Tool-aware LLMs turn text into real-world actions. Which unlocks autonomous decision making for robotics and general infrastructure use.
Tiny Tool Use distills the latest advances in tool-use RL, SFT and evaluation into easy to use templates. Letting teams train and evaluate tool-calling models without extra scaffolding.

It is fully open source : https://github.com/bagel-org/bagel-RL

Tiny Tool Use ships with:

Interchangeable Training Algorithms – swap SFT, Direct Preference Optimization (DPO), synthetic teacher signals and more with a single config change.
Configuration‑only workflows – every experiment, tool schema, and hyper‑parameter lives in a JSON file as a result performing training with different configuration is easy.
First‑class evaluation support – TensorBoard dashboards for training visualization and integration with Berkeley Function Calling Leaderboard scripts.
Dataset flexibility – plug in real data, generate synthetic traces, or compose both without touching core code.

Training Example Using Qwen3 Models

We now provide an example of using the library to train Qwen3 models. We use Low Rank Adaptation (LoRA) to customize Qwen3 models on ToolBench dataset. The library ships with the example configuration provided in configs/sft_toolbench_config.json which downloads the data, extracts it and uses the processed data for training.

To run the training code with Qwen3—0.6B model, use the following command

python train.py --config configs/sft_toolbench_config.json --output-dir lora_sft_qwen3/

The script will start downloading the ToolBench dataset and unzipping, which will several minutes considering the size of ToolBench.

The above code starts the training procedure, using lora adapters. The configuration file can be edited for full training instead of lora adapters. Furthermore, the configuration of the adapters can also be changed accordingly.

Evaluation and Benchmarking

Beyond its capabilities, the tiny tool use library offers a robust framework for evaluating the general tool-use capabilities of an adapted model. This includes the ability to compare evaluation results directly with established benchmarks, such as the Berkeley Function Calling Leaderboard.

The training statistics can be visualized by running tensorboard with the following command

tensorboard --logdir lora_sft_qwen3/

The performance of the adapted model on ToolBench data as training progresses. The evaluation data is displayed in the TensorBoard dashboard for the Qwen3-0.6B model, demonstrating that the Tiny Tool Use library offers clear and interpretable training and evaluation metrics along with improved model capability for function calling.

When the training is complete, the adapters can be merged and saved using the following command

Python save_merge_model.py \
--base_model Qwen/Qwen3-0.6B \
--adapter_path\ lora_sft_qwen3/ \
--output_dir merged_model/ \ 
--trust_remote_code

A model adapted with a subset of Toolbench data can be obtained from the following link: Qwen3-0.6B-ToolBench

BFCL Leaderboard

The BFCL evaluation on the model can be performed using the following commands, which will generate model response on different test cases.

export BFCL_PROJECT_ROOT=/path/to/your/desired/project/directory

bfcl generate --model Qwen/Qwen3-0.6B --local-model-path merged_model/ \
--test-category simple,parallel,multiple,multi_turn

Finally to obtain the score on the generated model response, the following code is executed, which will save the scores as a csv file.

bfcl evaluate --model Qwen/Qwen3-0.6B \
--test-category simple,parallel,multiple,multi_turn

Bagel Labs team will continue to improve the library to adapt it for broader tool use, with an emphasis on distributed learning algorithms. We welcome contributions, feature requests, and issues on our fully open-source repository: https://github.com/bagel-org/bagel-RL

Return on Experience (RoE)

Bidhan Roy — Thu, 01 May 2025 13:31:07 GMT

Reinforcement learning (RL) evolves at headline speed. Each month brings a new “state‑of‑the‑art.” But we had to pause to ask one important question:

Can a single number line up every milestone so far and hint at the next leap?

We think maybe. We call that number Return on Experience (RoE). It’s an early phase benchmark.

What you read below is a lab notebook, handed down as working scripture. Feel free to share your peer review, or join the loop (what’s loop? more on that below).

New experience costs more than extra GPU time

When power is cheap, we can keep feeding a language model massive amounts of internet text at little extra cost.

But a surgical robot, a self-driving car, or an ICU-triage policy cannot just “scrape more experience”.

Every new trial burns tire, takes staff time, or waits for legal clearance.

In these settings the scarce resource is fresh experience - each environment step that has to happen in the real world.

Sample-efficiency is the art of squeezing the most learning from the fewest such experiences, and RoE is a meter for that exchange rate.

Recent AI breakthroughs point in an interesting direction

DeepSeek-R1-Zero improves reasoning capability through millions of GRPO self-play prompt-episodes. Meaning, the model practically “argue with itself” millions of times, turning each prompt-response into a tiny lesson, without requiring tons of human text.

OpenAI o3 tool use agent was trained to ask whether the next tool call (browser, code runner, or image) will actually move the answer forward. Each tool use carries an explicit cost during training, so the model invokes a tool only when the expected reward of that experience outweighs the cost.

DreamerV3 learns a world-model first, then practices inside that imagined space, harvesting thousands of virtual trials for every real one.

Different domains, same theme. Progress comes from squeezing more value out of each new experience. The next question is, how much result do we get per unit of experience? That ratio is RoE.

RoE compresses sample-efficiency into a single number that rises roughly logarithmically with progress.

The Formula

Let’s turn that idea into one formal equation.

Return on Experience (RoE) tells us how much “win” an RL agent buys per interaction it pays for. In other words, if you spend one unit of real world experience, how far does your performance score climb?

One step, one score, divided by weighted cost. Where,

headline score - the score a paper or a model brags about - Atari score, math-test accuracy, etc.

count - how many times the RL agent interacted with something while learning.

weight - Not all interactions are equal. Weight adds a rough price tag for each kind of interaction because some are costly and some are cheap.

real-world action = 1 (full price)
calling an external tool (e.g. OpenAI o3) ≈ 0.01
self-play or a step inside a learned simulator (e.g. DeepSeek-R1-zero) ≈ 0.001

We think of the weights as discounts. Because some synthetic/simulated steps cost a thousandth of a real-world step, tool calls cost about two hundredth.

This way, RoE lets us put a language model, a robot arm, and a video-game agent on the same chart. High RoE means the agent climbs the leaderboard quickly and keeps the cost for fresh experience low. Low RoE means it wastes a lot of real-world practice for each bump in score.

A short history of RoE Loops

Below we cast some historical milestones of Reinforcement Learning into the RoE benchmark, to show RoE could predict the performance of them.

We call each big RoE jump a loop - one full turn of experience → insight → back to model/agent. Like the closed-loop theorem (or a bagel ring).

But first, some info on the classic benchmarks used to calculate RoE,

Atari - 57 classic video games used as a standard obstacle course for RL.
- Human-normalised score takes a human play-through on each game, sets that to 1.0, then shows how far the agent climbs above it.
- Median is the middle game after ranking all 57 scores; a quick “overall” health check.
- Atari-80k / Atari-100k mean looks at the average score but limits the agent to only 80k or 100k game frames, so we can see who learns fastest, not just who learns best.
GSM8K - 8.5k grade-school word problems. For each problem an RL agent writes one answer. Pass@1 is simply the percentage of problems it gets right on that very first try.

Here is a worked example of RoE calculation with OpenAI o3,

For o3, headline_score = 0.982, weight = 0.01, interaction count = 2M.

So, according to our RoE equation above,

Observations,

Rainbow quadrupled RoE over DQN by upgrading loss functions and replay, proving that better optimization alone can unlock large efficiency from same amount of experiences.
EfficientZero learned to plan inside a latent world model instead of in the real world. With most interactions being synthetic, RoE passed 10.
DreamerV3 pushed the same idea further. Over 90% of the experiences were simulated. Real-world performance climbed with very little new real-world experience.
DeepSeek-R1 brought model self play into the limelight. Model debates with itself on each problem, boosting GSM8K accuracy while keeping the new and cost heavy experience ledger small.
Cost-aware tool use has been another RL win of 2025. OpenAI o3 trained the RL agent to accrue a cost while calling a tool. That optimized RoE to ~50.

Prediction for the next 5 years

RoE has grown on a log curve - 0.023 for Rainbow (2018) up to about 50 for OpenAI o3 (2025). There’s no sign that the slope is flattening. World-model research, sample-efficient policy optimization and a coming surge in cheap compute all point the same way.

Below is a 5 year prediction on RoE numbers, and examples of what that could mean, inspired from peer-reviewed academic work,

(We count the past jumps - Rainbow through o3 - as Loops 1-5.)

Loop 6 - Year 2026 - RoE ≈ 150

World-model research will slip from Atari into the operating room. Hospitals will train surgical robots almost fully in world-model simulators such as Surgical Gym and ORBIT-Surgical. A bot will practise thousands of virtual stitches for before doing a real one.

Loop 7 - Year 2027 - RoE ≈ 1000

Scaling labs will run foundation world-models such as Genie on ~30x cheaper compute hardware. Agents will write and test entire NeurIPS papers inside those generated 3D sandboxes, requiring only high-level edits from humans. RoE will move past the four-digit mark.

Loop 8 - Year 2028 - RoE ≈ 3000

Self-play on synthetic molecule sets will reduce wet-lab screening from tens of thousands of assays to a few hundred. Drug-discovery will be massively accelerated. RoE will keep its log-pace and enter “several thousand” territory.

Loop 9 - Year 2029 - RoE ≈ 8000

Regulators will approve the first cargo aircraft whose flight-control laws are proved largely in simulation-based certification workflows. One hour of real flight data will be enough to sign off.

Loop 10 - Year 2030 - RoE ≈ 12000

Companies will field “digital graduates” - ~30B parameter reasoning models trained with Dr. GRPO style length-penalised objectives. Each will master a new technical field from a few annotated pages, then draft patents, close mergers and rewrite tax law overnight. One human auditor will review the output of a thousand such agents.

ZKLoRA

Bidhan Roy — Tue, 21 Jan 2025 15:50:33 GMT

In 2024, zero knowledge verifiability for machine learning seemed impossible. The latency overhead was too high. We did a research report on it here.

But, it is 2025. The year of AGI. And, we have made it possible.

Today, we’re open sourcing a frontier research, ZKLoRA. A zero knowledge protocol that allows verification of LoRA fine-tuning of open source AI models, in 1-2 seconds.

And not only for toy models, but for current state of the art open source models like llama 3.3 etc. with tens or hundreds of billions of parameters.

Want to try it yourself? Here’s the code : https://github.com/bagel-org/ZKLoRA

Want to see the benchmarks and curious how it works? Read the full research paper here : https://github.com/bagel-org/ZKLoRA/blob/main/paper.pdf

ZKLoRA Paper

331KB ∙ PDF file

Download

We are Bagel Labs, a distributed machine learning research lab.

We believe ZKLoRA kickstarts a new era for verifiable model training across untrusted networks.

Train Fast, But Think Slow

Bidhan Roy — Wed, 06 Nov 2024 14:59:55 GMT

AI is like fire.

We have had radical technological advancements in recent history. Social media, augmented reality, platform shifts like web, mobile. But AI is way more significant of a technology. It is as significant as the discovery of fire. It has the potential to change the trajectory of the evolution of our species.

One of the holy grails of unlocking this potential of AI is to build systems that can reason like humans. By improving AI's, Large Language Models in particular, ability to break down complex problems and apply logical steps.

Bagel's research team has been exploring this problem. Analyzing LLM building techniques, especially fine-tuning techniques, to allow Large Language Models to evolve from pattern-recognizing prediction agents to true cognitive agents. Our deep research spanned three major types of reasoning, aka intelligence: arithmetic, commonsense, and symbolic.

Today, we're sharing our findings. This research targets the core of what we believe to be the ultimate AI evolution, human-level reasoning. Or beyond (God level?).

We have explored techniques for the training and fine-tuning phases of model development. We have also ventured into the absolutely fascinating world of inference-time reasoning. This is where LLMs can be built or fine-tuned to generate novel solutions during inference, even if the solutions aren't part of their training dataset.

Dive in. And if you're in a rush, we have a TLDR at the end.

Types of Reasoning

Varied types of reasoning tasks stretch AI's abilities. First, let's understand how they're defined.

Arithmetic reasoning pushes machine learning to test problem-solving in a clear way. It forces models to break down problems. Choose from many strategies. Connect steps to find solutions. This makes math different. It shows exactly how well models can grasp details. And use the right solution steps in order.

Commonsense reasoning upends our expectations. Models must understand the strange logic of everyday life. The challenges emerge when systems face the quirks of human interactions. The implicit rules we take for granted. For example, a door opens before you walk through. Time flows forward not backward. Water makes things wet. These obvious truths become complex puzzles for artificial systems to unravel.

Symbolic reasoning flips the script on traditional machine learning. While neural networks excel at fuzzy pattern matching, symbols demand precision. Models must follow strict rules. Manipulate abstract concepts. Chain logical steps. Like a careful mathematician rather than an intuitive artist. The symbols hold no inherent meaning. Yet through them, we build towers of logic that reach toward human-level reasoning.

Beyond these core types, reasoning takes many forms. Logical deduction draws rigid conclusions while induction makes creative leaps. Causal reasoning traces the hidden threads between actions and consequences. Multimodal reasoning juggles text, images, and data in a complex combination of understanding. Knowledge graphs map the relationship of facts and relationships. Yet all serve one goal - moving AI from pattern matching toward true comprehension. From memorized responses to novel insights. From prediction to understanding.

Below, we look into training time and inference time approaches to enhance these types of reasoning.

1. Training Time Approaches

1.1. Fine-Tuning Approaches

Parameter Efficient Fine-Tuning (PEFT)

How it works: PEFT reverses traditional model adaptation (Hu et al. 2023). Four methods reveal new techniques.

Prompt-based learning embeds adjustable signals into frozen models. Prefix-tuning and P-tuning introduce small changes. These changes alter outputs without altering the main model.

Reparametrization methods like LoRA simplify complex weight matrices. They turn large updates into efficient low-rank forms. LoRA captures patterns from high-dimensional spaces with minimal adjustment.

Adapters create extra neural pathways. Series adapters stack, each layer adjusting outputs gradually. Parallel adapters develop side skills, keeping the base intact.

Adapter placement is key. Series adapters fit after MLP layers. Parallel adapters excel within them. LoRA touches both attention and MLP layers. Each method targets the right spot.

Why it's useful: PEFT reduces resource demands. Large models gain new abilities without major changes. PEFT preserves the base while adding specialized skills. Hardware that struggled with fine-tuning now handles complex updates.

Tradeoffs: Not all tasks fit PEFT. Some models need deeper changes. Base model limitations still exist. Combining methods is tricky. PEFT may struggle with very complex tasks.

WizardMath

How it works: WizardMath learns in three distinct steps (Luo et al., 2023).

First is supervised fine-tuning. Here, the model picks up raw mathematical patterns. It starts recognizing basic structures. Patterns get mapped to solutions. This step builds intuition for common operations. The foundation is set.

Next, instruction reward models refine the process. They judge both answers and methods. These models look for efficiency. They guide the model toward elegant solutions. The focus shifts from correctness to quality.

Finally, PPO-based reinforcement learning enhances problem-solving. The model tests ideas, adapts, and improves. Evol-Instruct feedback loops refine its logic with each run (Xu et al. 2023). It gets better at selecting strategies.

Source Luo et al. (2023)

Why it's Useful: Most models just match patterns. WizardMath thinks in logical steps. It breaks down problems like a mathematician. It selects methods based on understanding, not memory. This leads to solutions that are both effective and precise.

Tradeoffs: Training WizardMath takes heavy computational resources. Its deep math focus limits general use. Low-quality data can introduce errors. Practical solutions can sometimes lose to elegant ones.

Divergent Chain of Thought (DCoT)

How it works: DCoT breaks the single-path approach (Puerto et al. 2024). Multiple paths form at once. Each one tackles the problem differently. Yet all conclude in a single inference.

Zero-shot generation creates diverse solutions. Every path seeks the truth. Each follows its own course. Some are direct. Others are more complex. All valid. The model acts like a group of experts. Each path offers a different view.

These paths then interact. Strong strategies merge. Weak points become clear. The model learns to assess its own reasoning. It compares methods. It blends insights. All this happens without extra training.

Why it's Useful: Multiple paths offer built-in validation. When paths align, certainty rises. When they don’t, issues appear. Different views reveal hidden details. Diversity deepens understanding.

Tradeoffs: More paths need more computation. Balancing variety and consistency is tricky. Conflicting paths need resolving. For simple tasks, it's overkill. A group isn't always better than one.

1.2. Pre-training and Knowledge Transfer

Continued Pre-training

How it works: Models like Galactica (Taylor et al. 2022) and MINERVA (Lewkowycz et al. 2022) go beyond standard training. They learn from over 100 billion tokens of scientific data. This includes mathematical papers, scientific articles, and technical documentation. Raw data is converted into structured knowledge.

Galactica includes tokens for specific scientific terms. It treats citations as part of the vocabulary. Chemical formulas become meaningful. Mathematical symbols are treated like tools. It learns the language of science.

MINERVA focuses on quantitative reasoning. It answers natural language questions in physics, chemistry, and economics. Converts questions into math formulas. Uses LaTeX to present detailed solutions. It performs the calculations on its own.

Why it's useful: Smaller models can surpass larger ones in specific fields. They grasp complex math. Work with technical notation naturally. The gap between general models and experts shrinks.

Tradeoffs: Training costs rise. Each field requires massive new data. As new knowledge grows, old knowledge fades. Balancing focus and breadth is hard. It might be great at physics but weak in other areas.

Curriculum Learning

How it Works: Learning transforms from random sampling to structured progression (Adyasha & Maharana 2022). Like evolution, but guided. Deliberate. Purposeful.

A teacher network ranks training samples. Easy concepts come first. Complex ideas build on simple ones. The pacing function controls the flow of knowledge. Sometimes fixed. Sometimes adaptive. Responds to the model's growing understanding.

Three methods measure sample difficulty. Question Answering Probability tracks how often the model succeeds. Model Variability watches for consistent responses. Energy-based scoring identifies outliers and edge cases. The curriculum adapts based on these signals.

Why it's Useful: Models learn more efficiently. They build strong foundations before tackling complexity. Understanding grows naturally. Organically. Each concept reinforces the last. Difficult ideas become manageable when approached in sequence.

Tradeoffs: Designing effective curricula challenges even experts. Learning time stretches longer. Some concepts resist ordered progression. The path from simple to complex isn't always clear. Sometimes chaos teaches better than order.

CoT Knowledge Distillation

How it Works: Large models become teachers. Small models become students. Knowledge transfers through carefully curated examples (Magister et al. 2023).

The process splits into two phases. First, generate Chain of Thought data. Large models solve problems step by step. Show their work. Create a roadmap of reasoning. Only correct solutions make the cut. Quality matters more than quantity.

Then comes student fine-tuning. Small models learn from these examples. They see not just answers but thinking processes. The target answer guides early steps. This prevents small errors from derailing entire solutions. Teacher forcing ensures the student stays on track.

Why it's Useful: Advanced reasoning becomes accessible to smaller models. Complex problem-solving skills transfer efficiently. Small models learn to think clearly with limited resources. They gain the wisdom of larger models without the computational burden.

Tradeoffs: Some sophistication gets lost in translation. Students never quite match their teachers. The distillation process demands careful curation. Bad examples can teach bad habits. The balance between compression and comprehension remains delicate.

Join Bagel Community

2. Inference Time Approaches

2.1. Chain-Based Methods

Chain of Thought (CoT)

How it works: Wei et al. redefined reasoning with their 2022 paper. They introduced language models to step-by-step problem-solving using just eight examples. These guided models unlock hidden potential.

With precise prompts, models show their internal reasoning. No need for new training or changes to the model. This latent capability is accessed by using strategic examples.

The models learn to break down problems into logical steps that mimic human thinking. Each step becomes clear. The internal thought process shifts from a black box to a visible sequence.

This approach scaled well. PaLM, with chain-of-thought prompting, hit 75.6% on StrategyQA. Even sports questions saw 95.4% accuracy, surpassing human experts. Complex math problems were solved with clear, step-by-step reasoning. In commonsense tasks, hidden assumptions surfaced in natural language. Symbolic problems became easy to follow.

Why it's useful: Wei et al.'s work showed breakthroughs across fields. LaMDA 137B demonstrated this by generating 96% correct answers with sound reasoning. Problem-solving became transparent. Larger models produced more coherent explanations.

Tradeoffs: Reasoning sometimes fails. Models can get confused. Wei’s research showed that 46% of wrong answers had minor mistakes, while 54% had major logical errors. Sequential reasoning can hit barriers. Complex tasks push models to their limits.

Program of Thought (PoT)

How it works: Chen et al.'s 2022 work changed how models approach math. They turned natural language into executable programs that solve complex problems with machine-level precision.

The process is seamless. Word problems convert directly into Python code. Variables capture key details from the text. Functions embody solution strategies. Algorithms emerge from simple descriptions. The model coordinates external tools with precision.

PoT set new records, improving math benchmarks by 8% in few-shot settings. In zero-shot, the gains were 12%. The code tells a story with structured logic. Control flows mirror human thought. Programs serve as both solution and explanation.

PAL expanded on this. Gao et al. in 2023 showed how models could use Python interpreters for better reasoning. Complex calculations became sharper. Formal math operations translated into natural expression.

Why it's useful: Precision dominates. Math problems flow into code. The model combines high-level reasoning with computational accuracy. It's like a mathematician working alongside a supercomputer.

Tradeoffs: Some problems don't translate well into code. Executing programs raises security concerns. The model must handle both natural language and code, increasing the risk of errors.

2.2. Consistency and Verification Methods

Self-Consistency (SC)

How it works: Wang et al. introduced SC in 2022, shifting from greedy decoding to statistical sampling. This method changes inference entirely.

Instead of one solution, each step produces multiple paths. SC explores various reasoning attempts at once. The decoder samples different trajectories in the probability space. Errors are reduced by repeating steps, leading to validation through sampling.

SC’s statistical foundation is strong. It marginalizes over samples to minimize errors in individual paths. Think of it like quantum mechanics: multiple paths exist, and truth emerges from the statistical patterns.

Their approach was groundbreaking. The decoder generates n unique reasoning chains, each following a different probability path. Final answers come from majority voting, but the process goes beyond simple counts.

Wang's team tested models from UL2-20B to PaLM-540B. Accuracy increased across the board. Smaller models showed the most improvement, indicating SC unlocks hidden potential in models of all sizes.

Why it's useful: Numbers tell the story. Multiple paths automatically validate answers. Different paths catch edge cases. Robustness increases as more paths are explored. Quantity becomes quality.

Tradeoffs: Computation grows costly. Each path demands resources. Memory use spikes. Contradictory paths sometimes arise. Resolving these conflicts adds complexity.

Self-Endorsement (SE)

How it works: Wang et al.’s 2024 paper introduced SE, a new verification method. The system generates diverse responses and then analyzes them. Facts are extracted, labeled, and compared. Cross-response validation assigns endorsement scores to each fact.

SE uses advanced fact extraction algorithms. Neural retrieval identifies key claims, and automatic cross-referencing helps the model distinguish strong facts from weaker ones. This statistical validation process drives the system.

High-scoring facts shape future outputs, while low-scoring ones lead to re-evaluation. Each pass refines the model’s response through consistency.

The fact extraction pipeline is highly technical. Named entity recognition identifies key elements, and relation extraction maps connections. All of this occurs without human input.

Why it's useful: Accuracy improves. Hallucinations drop. The system validates its own facts. Confidence scores make responses more reliable.

Tradeoffs: Processing takes longer. Fact extraction sometimes fails. Complex statements resist simple validation. Some valid facts get rejected if they don’t fit the statistical pattern.

Least-to-Most Prompting (LM)

How it works: Zhou et al. introduced LM in 2022, a system that breaks tasks into smaller parts and solves them step by step.

The process follows phases. First, the model analyzes the input. Next, it identifies sub-tasks. Then, it solves each part. Finally, it combines the results. Each phase builds on the previous one.

For example, in the last-letter task with "cat dog bird," the model processes each word separately. It finds ‘t’ from "cat," ‘g’ from "dog," and ‘d’ from "bird." Then, it combines them into "tgd." The model achieved 94% accuracy with four words and 74% even with twelve.

Errors are predictable. Sometimes letters drop during connection. Sometimes extras appear. But it rarely confuses the final letter of each word.

Why it's useful: LM is highly efficient. It only needs two examples to work well. It uses less tokens than traditional methods, achieving equal or better results.

Scaling is impressive. The model handles sequences four times longer than its training examples without losing accuracy. Standard methods fail on long sequences, scoring 31.8% on twelve-word tests. LM hits 74%, with a growing advantage on harder tasks.

Tradeoffs: Some tasks don't split easily. Certain problems need a different approach. The method requires more steps, which adds processing time.

Technical limits arise. The model must track partial solutions, and memory usage grows with longer sequences. Some tasks need several attempts to find the best split.

Careful planning is essential. The order of sub-tasks affects accuracy, and managing information efficiently becomes critical. The system must adapt its splitting strategy for different problems.

How to Test for Reasoning

Cognitive sciences have studied human reasoning since experimental psychology emerged in the late 19th century. This field has been crucial for technological development, education improvement, cognitive disorder treatment, and better decision-making. Scientists use various tools to study reasoning, including problem-solving tasks, computational models, brain imaging (fMRI and EEG), and behavioral measurements like eye-tracking. These combined methods help researchers understand how humans reason.

Analogously, AI researchers have invented reasoning tasks to test the reasoning capabilities of LLMs in the form of special datasets. Being AI more of an engineering field and computer science field, these datasets provide a rigorous benchmark to test AI systems. This allows researchers to measure a model’s accuracy and identify areas where it may be falling short.

Datasets for testing AI reasoning on one type of reasoning should be diverse around that type of reasoning in order to test various complexities and nuances in tasks. For example, to evaluate language models' common sense capabilities, a dataset like ARC is used. The figure below we shows a ranking of the best LLMs for the ARC-challenge dataset taken from different sources.

Inference-time techniques appear in green, training-time techniques appear in orange, and standard base models appear in blue.

In the image above, the best performing techniques correspond to inference-time approaches, in particular, SC has a clear advantage over standard CoT. The fine-tuning approaches cannot match the inference-time approaches where they show a clear advantage.

TLDR

Our research focuses on strengthening reasoning in Large Language Models (LLMs) in three ways. First, arithmetic reasoning - approaching math problems logically. Next, commonsense reasoning - grasping everyday situations and drawing conclusions. Finally, symbolic reasoning - handling abstract symbols by strict logic.

We explore two strategies to push these areas forward. First are training-time methods. These adjust AI’s learning process, adjusting it for specific tasks but needing time and computing power. For example, WizardMath teaches detailed problem-solving for math, while PEFT (Parameter Efficient Fine-Tuning) builds skills without huge resources. DCoT (Divergent Chain of Thought) allows AI to consider multiple solutions simultaneously.

The second approach is Inference-time methods. These enhance existing models without retraining, bringing quick improvements, though sometimes with less depth. Chain of Thought (CoT) prompts AI to explain each step it takes. Program of Thought (PoT) has AI write and run code to boost accuracy. Self-Consistency (SC) checks multiple paths to ensure reliable answers.

The below table is a summary of our findings.

Techniques for Enhancing AI Reasoning

By open-sourcing our research on AI reasoning, our team at Bagel aims to collaborate with the Open Source AI community to forge humanity's next chapter.

Everything Bagel

WorldDiT: A Unified Architecture for World and Action Modeling

Three reasons WorldDiT is different

Introducing Paris 2.0

Samples

Get Started

Heterogeneous Decentralized Diffusion Models

Background: Decentralized Flow Matching

Training

Routing

Heterogeneous Objectives

Two objectives, different emphasis

Complementary loss weighting

Inference-Time Unification

Step 1: Recover the clean-image estimate

Step 2: Derive the velocity

Step 3: Combine

Numerical stability

Efficient Training

Pretrained checkpoint conversion

Efficient architecture

True isolation

Experiments

Monolithic versus decentralized

Homogeneous versus heterogeneous

Conversion quality

Routing threshold analysis

Effects of expert ordering and router thresholds

Discussion

Resource efficiency in context

Limitations

What this enables

Citation

Stability–Quality Paradox in Decentralized Diffusion Models

Decentralized Diffusion Models

Training

Routing

The Stability Paradox

Expert-Data Alignment Is The Governing Principle

Experimental Validation

Cluster Distance Analysis

Per-Expert Prediction Quality

Expert Disagreement Analysis

Is Stability Still Useful?

Stability Measures Convergence, Not Correctness

Within-Strategy Diagnostics

Discussion

Limitations

Implications

Conclusion

Introducing Paris

Tiny Tool Use

Training Example Using Qwen3 Models

Evaluation and Benchmarking

BFCL Leaderboard

Return on Experience (RoE)

New experience costs more than extra GPU time

Recent AI breakthroughs point in an interesting direction

The Formula

A short history of RoE Loops

Prediction for the next 5 years

ZKLoRA

Train Fast, But Think Slow

Types of Reasoning

1. Training Time Approaches

1.1. Fine-Tuning Approaches

Parameter Efficient Fine-Tuning (PEFT)

WizardMath

Divergent Chain of Thought (DCoT)

1.2. Pre-training and Knowledge Transfer

Continued Pre-training

Curriculum Learning

CoT Knowledge Distillation

2. Inference Time Approaches

2.1. Chain-Based Methods

Chain of Thought (CoT)

Program of Thought (PoT)

2.2. Consistency and Verification Methods

Self-Consistency (SC)

Self-Endorsement (SE)