Handling Missing Modalities in Centralised and Decentralised Environments
Jack Geraghty
University College Dublin
Supervisor: Dr Fatemeh Golpayegani · Co-supervisor: Dr Andrew Hines
11 May 2026
| Dataset | Modality | Cohen's d | p-value | Better |
|---|---|---|---|---|
| AVMNIST | Audio | −3.34 | 0.014 | Zero |
| ↳ | Image | −3.33 | 0.014 | Zero |
| MM-IMDb | Text | −1.73 | 0.031 | Zero |
| ↳ | Image | −0.41 | 0.34 | n.s. |
| MOSEI | Audio | +3.54 | <0.05 | Noise |
| ↳ | Video | +1.85 | <0.05 | Noise |
| ↳ | Text | +4.22 | <0.05 | Noise |
Cohen's d sign: negative => zero better; positive => noise better. All significant effects are large (|d|≥1.85).












| Cond. | Random Init Δ | Fine-tuned Δ | ||
|---|---|---|---|---|
| Has0 Acc | Has0 F1W | Has0 Acc | Has0 F1W | |
| A | −0.0400 | −0.0543 | +0.0459 | +0.0284 |
| V | +0.0065 | +0.0047 | +0.0642 | +0.0545 |
| T | −0.0194 | −0.0081 | −0.0012 | +0.0020 |
| AV | −0.0015 | −0.0018 | −0.0089 | −0.0027 |
| AT | −0.0044 | −0.0034 | −0.0017 | −0.0024 |
| VT | +0.0035 | +0.0042 | +0.0011 | +0.0002 |
Δ = difference vs. frozen encoder baseline. Has0 = missing-modality condition. Negative = worse than frozen.
| Condition | Random Init | Fine-tuned |
|---|---|---|
| A | −0.0270 | +0.0400 |
| V | +0.0106 | +0.0451 |
| T | −0.0142 | −0.0019 |
| AV | +0.0105 | +0.0059 |
| AT | −0.0067 | −0.0050 |
| VT | +0.0046 | +0.0017 |
Mean Δ across Has0 Accuracy and Has0 F1 Weighted. Negative = worse than frozen baseline.
| Model & C-MAM | MAE | MSE |
|---|---|---|
| KS — Audio → Video | 3.82 | 32.79 |
| KS — Video → Audio | 0.69 | 1.32 |
| MM-IMDb — Image → Text | 0.30 | 0.16 |
| MM-IMDb — Text → Image | 0.29 | 0.14 |
| UTT-Fusion — AT → Video | 0.07 | 0.009 |
| UTT-Fusion — Audio → Video | 0.22 | 0.11 |
| UTT-Fusion — Audio → Text | 0.23 | 0.19 |
| UTT-Fusion — VT → Audio | 0.10 | 0.02 |
KS audio→video is the study maximum (MSE=32.79). High reconstruction error correlates with near-orthogonal encoder spaces, not decoder failure.
| Dataset | MI Reduction adding 2nd modality |
Modality Dominance |
PMI co-occurrence |
Cosine Similarity |
Sym. KL Divergence |
C-MAM Recovery |
|---|---|---|---|---|---|---|
| MOSEI | Moderate | Text | Low (0.096) | Moderate 0.28–0.38 |
0.338 | Moderate–High (text-bounded) |
| Kinetics-Sounds | Strong ↓ MI | Video | Very low (0.006) | Near-orthogonal −0.040 |
2.954 | Low–Moderate (encoder bottleneck) |
| AVMNIST | Weak | Image | High (6.894) | Near-orthogonal 0.006 |
0.852 | High (task redundancy) |
Kinetics-Sounds: high KL + near-orthogonal cosine = encoders learned competing decision boundaries. AVMNIST: near-orthogonal cosine BUT high PMI — task redundancy compensates for geometric misalignment.
| Mapping | Model | RMSE | √Var(Y|X) | Δ | H1 |
|---|---|---|---|---|---|
| A→T ★ | C-MAM | 0.588 | 0.154 | +0.434 | No |
| MMIN | 0.171 | 0.095 | +0.076 | No | |
| RedCore | 0.816 | 0.662 | +0.154 | No | |
| V→T | C-MAM | 0.262 | 0.155 | +0.107 | No |
| MMIN | 0.144 | 0.052 | +0.092 | No | |
| RedCore | 0.884 | 0.722 | +0.162 | No | |
| AT→V ★ | C-MAM | 0.213 | 0.161 | +0.052 | No |
| MMIN | 0.419 | 0.167 | +0.252 | No | |
| RedCore | 0.813 | 0.638 | +0.175 | No |
★ A→T: C-MAM worst RMSE (0.588) → best behavioural recall (0.762); AT→V: C-MAM nearest to bound (Δ = +0.052). MMIN/RedCore closer geometrically, far worse behaviourally. AV→T, T→V, VT→A omitted; all show No support.
| Mapping | Model | RMSE | √Var(Y|X) | Δ | H1 |
|---|---|---|---|---|---|
| A→V | C-MAM | 0.126 | 0.071 | +0.056 | No |
| MMIN | 0.419 | 0.167 | +0.252 | No | |
| RedCore | 0.765 | 0.700 | +0.064 | No | |
| AT→V ★ | C-MAM | 0.115 | 0.071 | +0.045 | Yes |
| MMIN | 0.144 | 0.052 | +0.092 | No | |
| RedCore | 0.813 | 0.638 | +0.175 | No | |
| TV→A ★ | C-MAM | 0.173 | 0.128 | +0.045 | Yes |
| MMIN | 0.209 | 0.121 | +0.087 | No | |
| RedCore | 0.840 | 0.674 | +0.166 | No |
★ C-MAM satisfies H1 in 2/6 mappings (AT→V, TV→A; Δ = +0.045 each). MMIN and RedCore fail all six. Remaining mappings (V→A, AV→T, T→A) all show No support.
| Mapping | Model | RMSE | √Var(Y|X) | Δ | H1 |
|---|---|---|---|---|---|
| A→T | C-MAM | 0.168 | 0.105 | +0.062 | No |
| MMIN | 0.184 | 0.097 | +0.087 | No | |
| RedCore | 0.821 | 0.742 | +0.079 | No | |
| V→A ★ | C-MAM | 0.139 | 0.116 | +0.024 | Yes |
| MMIN | 0.201 | 0.122 | +0.079 | No | |
| RedCore | 0.824 | 0.782 | +0.042 | Yes | |
| TV→A ★ | C-MAM | 0.137 | 0.116 | +0.021 | Yes |
| MMIN | 0.158 | 0.117 | +0.040 | Yes | |
| RedCore | 0.887 | 0.753 | +0.134 | No |
★ C-MAM supports H1 in 5/6 mappings (only A→T fails, Δ = +0.062). MMIN supports in 2 (T→V, TV→A). RedCore supports in 1 (V→A). Strongest H1 result across all three datasets.
| Mapping | C-MAM Neutral Recall |
MMIN Neutral Recall |
RedCore Neutral Recall |
C-MAM Bal. F1 |
MMIN Bal. F1 |
RedCore Bal. F1 |
H2.1 | H2.2 |
|---|---|---|---|---|---|---|---|---|
| A→T | 0.762 | 0.114 | 0.000 | 0.365 | 0.382 | 0.334 | Yes | Yes |
| V→T | 0.743 | 0.072 | 0.008 | 0.376 | 0.395 | 0.377 | Yes | Yes |
| AT→V | 0.378 | 0.382 | 0.450 | 0.615 | 0.622 | 0.601 | No | No |
| T→V | 0.434 | 0.327 | 0.446 | 0.619 | 0.616 | 0.600 | Yes | No |
| AV→T | 0.298 | 0.140 | 0.025 | 0.381 | 0.410 | 0.384 | No | Yes |
| TV→A | 0.233 | 0.321 | 0.397 | 0.586 | 0.617 | 0.596 | Yes | No |
H2.1: C-MAM ≥95% parity in 4/6 mappings. H2.2: RedCore recall < MMIN recall in 3/6 (A→T, V→T, AV→T); collapses to near-zero in unimodal text conditions.
| Mapping | C-MAM Neutral Recall |
MMIN Neutral Recall |
RedCore Neutral Recall |
C-MAM Bal. F1 |
MMIN Bal. F1 |
RedCore Bal. F1 |
H2.1 | H2.2 |
|---|---|---|---|---|---|---|---|---|
| A→V | 0.477 | 0.330 | 0.091 | 0.411 | 0.419 | 0.363 | Yes | Yes |
| AT→V | 0.610 | 0.588 | 0.441 | 0.603 | 0.580 | 0.528 | Yes | Yes |
| V→A | 0.116 | 0.047 | 0.009 | 0.461 | 0.476 | 0.469 | Yes | Yes |
| AV→T | 0.429 | 0.408 | 0.242 | 0.583 | 0.558 | 0.518 | Yes | Yes |
| T→A | 0.422 | 0.432 | 0.378 | 0.525 | 0.527 | 0.498 | Yes | Yes |
| TV→A | 0.467 | 0.541 | 0.438 | 0.647 | 0.558 | 0.506 | No | Yes |
H2.1: C-MAM ≥95% parity in 5/6 mappings (TV→A only failure: recall 0.467 vs. MMIN 0.541). H2.2: RedCore underperforms MMIN on neutral recall in all 6/6 mappings.
| Mapping | C-MAM Neutral Recall |
MMIN Neutral Recall |
RedCore Neutral Recall |
C-MAM Bal. F1 |
MMIN Bal. F1 |
RedCore Bal. F1 |
H2.1 | H2.2 |
|---|---|---|---|---|---|---|---|---|
| A→V | 0.270 | 0.457 | 0.419 | 0.448 | 0.550 | 0.534 | No | Yes |
| AT→V | 0.595 | 0.640 | 0.599 | 0.718 | 0.727 | 0.671 | No | Yes |
| V→A | 0.336 | 0.490 | 0.380 | 0.429 | 0.470 | 0.472 | No | Yes |
| AV→T | 0.452 | 0.489 | 0.427 | 0.629 | 0.640 | 0.625 | No | Yes |
| T→A | 0.597 | 0.645 | 0.599 | 0.561 | 0.640 | 0.640 | No | Yes |
| TV→A | 0.679 | 0.544 | 0.512 | 0.667 | 0.647 | 0.626 | Yes | Yes |
H2.1: C-MAM meets parity threshold in 1/6 mappings (TV→A only). H2.2: RedCore underperforms MMIN on neutral recall in all 6/6 mappings.
| Client modalities | Base model | FedC-MAMs | Δ |
|---|---|---|---|
| A only | 0.251 | 0.693 | +0.442 |
| V only | 0.343 | 0.661 | +0.318 |
| AV | 0.403 | 0.668 | +0.265 |
| T only | 0.667 | 0.730 | +0.063 |
| AT | 0.740 | 0.763 | +0.023 |
| TV | 0.717 | 0.751 | +0.034 |
Yellow: text-absent clients (A, V, AV), with gains statistically significant (Welch t > 30, p < 0.001). Text-containing clients: small, non-significant deltas.
| Client modalities | Base model | FedC-MAMs | Δ |
|---|---|---|---|
| A only | 0.362 | 0.396 | +0.034 |
| V only | 0.290 | 0.427 | +0.137 |
| AV | 0.500 | 0.502 | +0.002 |
| T only | 0.294 | 0.465 | +0.171 |
| AT | 0.570 | 0.442 | −0.128 |
| TV | 0.419 | 0.593 | +0.174 |
Yellow: modality-limited gains, significant (p < 0.001). Red: AT boundary case; reconstruction detrimental (p < 0.001).
| Dataset | MMIN | RedCore | C-MAMs |
|---|---|---|---|
| MOSEI | 8.19 | 19.70 | 4.03 |
| MSP-IMPROV | 16.78 | 41.14 | 6.63 |
kWh at 0.03 kWh GB⁻¹ · 1000 rounds · 20 clients · 100 simulations.
| Dataset | MMIN | RedCore | C-MAMs |
|---|---|---|---|
| MOSEI | 8.19 | 19.70 | 4.03 |
| MSP-IMPROV | 16.78 | 41.14 | 6.63 |
kWh at 0.03 kWh GB⁻¹ · 1000 rounds · 20 clients · 100 simulations.