Learning to Associate:

Handling Missing Modalities in Centralised and Decentralised Environments

Jack Geraghty

University College Dublin

Supervisor: Dr Fatemeh Golpayegani · Co-supervisor: Dr Andrew Hines

11 May 2026

Prelude

Housekeeping Prelude

Six datasets

AVMNIST · Kinetics-Sounds · MM-IMDb
CMU-MOSEI · MSP-IMPROV · IEMOCAP

Not all appear in every result.

Scope of results

Core narrative only.
Top-level metrics per RQ are skipped.
They show things work or don't. Not why.

No literature slide

You've read the thesis.

Just a few housekeeping notes before we start.

▶ Six datasets — not all appear everywhere

Six datasets were used across the thesis: AVMNIST, Kinetics-Sounds, MM-IMDb, CMU-MOSEI, MSP-IMPROV, and IEMOCAP. You'll see different subsets across the results — that's deliberate, not an oversight. Some experiments were only run on a subset, and presenting all six for every claim would be both time-consuming and visually overwhelming.

▶ Not all results from the thesis are covered

The presentation covers the core narrative and the key findings. What I've cut are the top-level performance metrics per RQ — degradation baselines in RQ1, C-MAM recovery ratios in RQ2, comparisons to MMIN and RedCore in RQ3. These results are important but they're the surface layer: they show that something is broken or working, but not why or how. The deeper analysis comes after them and that's what we're here for.

▶ No dedicated literature slide

There's no dedicated literature slide. You've read the thesis, so I don't need to walk you through it again. The context that matters will surface naturally in the first content slide — missing modalities in real-world systems, asymmetric degradation, and the retraining problem.

Problem

Missing Modalities Break Multimodal Systems Problem

Multimodal models assume joint availability of all modalities. In practice, this assumption is frequently violated.

Modalities may be absent

Sensor failure, privacy constraints, and bandwidth limitations.

Behaviour changes

Performance degradation is not uniform. Confidence and calibration may shift.

Retraining is often impractical

Systems may already be deployed or data unavailable.

When modalities are absent, behaviour changes, and retraining is impractical, how can these effects be understood and handled?

Problem

Multimodal systems are built on the assumption that all modalities will be present at inference time. That assumption is wrong more often than the literature acknowledges.

▶ Modalities may be absent

The reasons are simple: a sensor fails, a user opts out for privacy, a bandwidth constraint drops a stream. None of these are edge cases — they are normal conditions for any deployed system.

▶ Behaviour changes — animation starts

On the right is a multimodal sentiment analysis model that takes in audio, video, and text. The animation steps through different missing-modality configurations — video removed, audio and video removed, text removed, audio and text removed. The same model, the same task, but the output changes depending on what combination of modalities is actually available at inference time. That variability is the core of the problem.

▶ Retraining is often impractical

And you can't always just retrain. The model may already be deployed, the original training data may no longer be available, and retraining a large multimodal model is computationally expensive even when those things aren't a problem. Plus you'd need to employ some technique for handling missing modalities, and as I will show this is not always a simple or even the best thing to do.

▶ Closing question

So the question is: when modalities are absent, behaviour changes, and retraining is impractical — how can those effects be understood, and how can they be handled?

Measurement

Evaluating Multimodal Models Measurement

Understanding the impact of missing modalities requires considering how multimodal models are evaluated.

Common evaluation perspectives

Full-modality performance, modality ablation, and attribution.

What these capture

Performance reflects outcomes. Ablation reflects dependence. Attribution reflects contribution.

Limitation

These signals are typically analysed independently.

If contribution and dependence are analysed separately, can they explain behaviour under missing modalities?

Modality Contribution

Average marginal utility across modality subsets

Estimated via MM-SHAP

Modality Dependence

Effect of removing a modality

Estimated via ablation

Measurement

Before getting into the research questions, it's worth being precise about how multimodal models are evaluated — because the thesis works across three different evaluation perspectives and the distinction matters.

▶ Common evaluation perspectives

The three perspectives are full-modality performance, modality ablation, and attribution. These are the standard tools available for assessing how a multimodal model behaves.

▶ What these capture

Each one answers a different question. Performance reflects outcomes — what the model produces. Ablation reflects dependence — what the model loses when a modality is removed. Attribution reflects contribution — how much each modality influenced the output when all were present.

▶ Modality Contribution card appears

Modality contribution is estimated via MM-SHAP, which is derived from Shapley value theory. It quantifies the average marginal utility of a modality across all possible subsets of the available modalities — not just when all inputs are present.

▶ Modality Dependence card appears

Modality dependence is measured by removing a modality and observing the effect. It tells you how much the model relies on that modality functionally.

These two look like they should agree. The thesis shows they often don't — and that gap is where the work begins.

▶ Limitation

The problem is that these signals are typically analysed independently. Contribution is reported. Dependence is reported. But they're rarely examined together, and the relationship between them is assumed rather than tested.

▶ Closing question

If contribution and dependence are analysed separately, can they actually explain what happens to a model when a modality goes missing? That's the question this slide sets up — and the answer, as we'll see, is no.

RQ1 — Modality Absence and Behavioural Degradation

RQ1 — Modality Absence and Behavioural Degradation RQ1

Research question

How and to what extent does model performance change when an entire modality is unavailable at inference time, and how reliably can this change be anticipated?

— Under Review with TBD

Sub-questions

RQ1.1

How closely do MM-SHAP contributions align with accuracy drop, confidence change, and calibration shift once the modality is gone?

RQ1.2

Does Gaussian noise masking improve, worsen, or leave unchanged the baseline drop observed with zero masking?

RQ1.3

Can validation-time indicators or MM-SHAP contributions predict a model's dependence on a modality before removal?

Approach

Systematic modality ablation across datasets and modelsInterpretable Signals vs. Effective Utility

Comparison of attribution and observed behavioural changes; evaluation of masking strategies (zero and Gaussian noise)Representing Missingness: Zero-Masking vs Gaussian Noise

Tracking training-time modality dependence against validation behaviourTraining Dynamics and Early Indicators

The first research question asks how model performance changes when a modality is entirely absent at inference time, and whether that change can be anticipated. It covers three sub-questions, each targeting a different part of that problem.

▶ RQ1.1 + Approach

The first asks how well MM-SHAP contribution scores align with the actual behavioural shifts — accuracy drop, confidence change, and calibration shift — that occur once a modality is removed. The approach is systematic ablation across all six datasets and models, comparing attribution signals against observed behavioural changes.

▶ RQ1.2 + Approach

The second asks whether the way absence is represented matters — specifically, does Gaussian noise masking produce different results than zero masking? The approach compares both masking strategies across datasets and models to assess whether the choice of absence representation affects measured robustness.

▶ RQ1.3 + Approach

The third asks whether modality dependence can be detected before removal — using validation-time indicators or MM-SHAP scores during training. The approach tracks modality dependence trajectories across epochs and compares them against validation behaviour.

RQ1.1

RQ1 — Modality Absence and Behavioural Degradation RQ1.1 Results

RQ1.1 — Attribution vs dependence

MM-SHAP contributions do not consistently reflect functional dependence under modality removal.

AVMNIST False High

Balanced attribution masks extreme functional reliance on visual features; highly attributed modalities can be removed with zero accuracy impact.

Kinetics-Sounds Attribution Holds

MM-SHAP correctly reflects the asymmetry — audio scores near zero, video near 1.0 — one of the few cases where attribution and functional dependence align.

CMU-MOSEI Text Dominant

Text drives accuracy and calibration, while MM-SHAP over-attributes redundant audio/video streams that retain high confidence despite accuracy collapse.

MM-IMDb Masked Dominance

Symmetrical attribution masks structural text dominance; confidence shifts vary across the SHAP spectrum, showing attribution captures engagement rather than functional sensitivity.

What this shows

Modality contribution and modality dependence frequently disagree despite both aiming to quantify the same thing.

Balanced contribution scores mask skewed dependence across multiple datasets, and deeper behavioural probes expose structural fragility that contribution scores alone cannot surface.

Why it matters

MM-SHAP captures surface-level engagement, not functional necessity.

A complete evaluation requires measuring both contribution and dependence, neither in isolation tells the full story.

RQ1.1

Attribution and functional dependence are both trying to quantify the same thing, but the experiments show they frequently disagree — and the disagreement is informative.

RQ1.1 — Attribution vs Dependence

The answer is that MM-SHAP contributions do not consistently reflect functional dependence under modality removal. The figure panel on the right pairs a modality dependence plot with the MM-SHAP score distribution for each dataset — they should track each other, and they often don't.

▶ AVMNIST — False High

In AVMNIST, attribution scores are balanced across audio and image, suggesting roughly equal contribution. But functionally, image dominates — remove it and accuracy collapses; remove audio and the impact is minimal. The contribution score gave no indication that asymmetry was there.

▶ Kinetics-Sounds — Attribution Holds

Kinetics-Sounds shows the opposite pattern. Video removal causes performance to collapse while audio is largely dispensable, and MM-SHAP attribution reflects this directly: audio scores cluster near zero while video scores concentrate near 1.0, making this one of the few cases where attribution and functional dependence align.

▶ CMU-MOSEI — Text Dominant

MOSEI has a known characteristic: text drives almost everything. But MM-SHAP over-attributes audio and video, which retain high model confidence even as accuracy collapses when text is removed. Attribution inflates the apparent importance of modalities that are functionally redundant.

▶ MM-IMDb — Masked Dominance

MM-IMDb tells a similar story. Symmetric attribution masks structural text dominance. The confidence shifts under ablation vary across the SHAP distribution — showing that MM-SHAP captures engagement with the output, not sensitivity to removal.

▶ What this shows

Contribution and dependence frequently disagree. Balanced attribution scores can mask skewed dependence, and deeper behavioural probes — confidence and calibration — expose structural fragility that contribution scores alone cannot surface.

▶ Why it matters

MM-SHAP captures surface-level engagement, not functional necessity. A complete evaluation requires both contribution and dependence — neither on its own tells the full story.

RQ1.2

RQ1 — Modality Absence and Behavioural Degradation RQ1.2 Results

RQ1.2 — Zero masking vs Gaussian noise

Gaussian noise does not consistently improve on zero masking; the optimal absence representation is encoder-specific and must be validated empirically.

AVMNIST Zero better

Structured nulls are in-distribution; encoders were trained with silence and blank frames as the normal zero state.

MM-IMDb Mixed

Text favours stable nulls; image masking has negligible effect, as the visual stream is functionally weak and redundant.

MOSEI Noise better

Zero injects misleading semantic structure; noise better represents true absence for all three modalities.

Cross-dataset Cohen's d: zero vs. Gaussian noise masking

Dataset	Modality	Cohen's d	p-value	Better
AVMNIST	Audio	−3.34	0.014	Zero
↳	Image	−3.33	0.014	Zero
MM-IMDb	Text	−1.73	0.031	Zero
↳	Image	−0.41	0.34	n.s.
MOSEI	Audio	+3.54	<0.05	Noise
↳	Video	+1.85	<0.05	Noise
↳	Text	+4.22	<0.05	Noise

Cohen's d sign: negative => zero better; positive => noise better. All significant effects are large (|d|≥1.85).

What this shows

How absence is represented changes how the model behaves. Zero-masking and Gaussian noise produce inconsistent results across datasets.

Neither is a safe universal default, and models respond in ways that cannot be predicted from contribution scores alone.

Why it matters

Masking strategy is not a neutral implementation detail, it is a design decision that directly shapes measured robustness.

Credible evaluation requires understanding which representation of absence a given architecture is sensitive to.

RQ1.2

How absence is encoded at inference time is not a neutral choice — zero masking and Gaussian noise make different implicit claims about what missing looks like, and models respond to them inconsistently across datasets.

RQ1.2 — Zero Masking vs Gaussian Noise

The answer is that Gaussian noise does not consistently improve on zero masking. The table shows Cohen's d for each dataset and modality, comparing the two strategies. All significant effects are large — this isn't noise in the results, it's a real difference in how models respond to the two representations of absence.

▶ AVMNIST — Zero better (+ table appears)

In AVMNIST, zero masking significantly outperforms Gaussian noise for both audio and image, with Cohen's d around −3.3. The encoders were trained on data where silence and blank frames are the natural zero state, so a structured null is interpretable. Random noise is not.

▶ MM-IMDb — Mixed

MM-IMDb is mixed. Zero masking is better for text — a large effect. For image, the difference is not significant: the visual stream is functionally weak anyway, so the choice of masking barely matters when that modality is already dispensable.

▶ MOSEI — Noise better

MOSEI reverses the pattern completely. For all three modalities, Gaussian noise produces better performance than zero masking, with large positive Cohen's d values. In this model, zero vectors inject misleading semantic structure — silence or a blank frame carries meaning — and noise better represents true absence.

▶ What this shows

How absence is encoded changes how the model behaves. The two strategies produce inconsistent and sometimes opposing results across datasets. Neither is a safe universal default.

▶ Why it matters

Masking strategy is not a neutral implementation detail. It is a design decision that directly shapes what lower bound you are measuring. Credible robustness evaluation requires knowing which representation of absence a given architecture is sensitive to.

RQ1.3

RQ1 — Modality Absence and Behavioural Degradation RQ1.3 Results

RQ1.3 — Can dependence be anticipated?

Yes, but not from raw validation alone: modality dependence emerges early in training and remains broadly stable, while unimodal validation can misrepresent true reliance.

AVMNIST Early Lock-in

Image dependence appears within the first few epochs and remains consistently higher than audio, even when both unimodal metrics improve.

MM-IMDb Text Dominant

Both modalities begin comparably, then text quickly comes to dominate; partial rebalancing appears later, but validation F1 alone does not reveal how much the full model still relies on text.

MOSEI Deceptive Metrics

Audio-only and video-only can show moderate early performance, yet dependence remains low; text alone shows rising and sustained functional importance.

AVMNIST accuracy modality dependence per epoch

AVMNIST F1-weighted modality dependence per epoch

MM-IMDb F1 micro modality dependence per epoch

MM-IMDb F1 macro modality dependence per epoch

MOSEI Has0 training dynamics vs modality dependence

MOSEI Non0 training dynamics vs modality dependence

What this shows

Modality dependence emerges as a stable trajectory early in training, well before convergence.

Standard validation metrics miss this; a modality can appear performant while being effectively redundant in the multimodal context.

Why it matters

Dependence can be diagnosed during training, not just tested after it.

Early detection creates a window to intervene before over-reliance on a single modality becomes structural.

RQ1.3

Modality dependence is not a fixed property that appears at convergence — it emerges as a trajectory early in training, often well before validation metrics give any indication of what the model is actually relying on.

RQ1.3 — Can Dependence Be Anticipated?

The answer is yes, but not from raw validation metrics alone. The figures on the right show modality dependence tracked per epoch during training. The key result is that dependence trajectories emerge early and stabilise well before convergence — but validation metrics can actively mislead.

▶ AVMNIST — Early Lock-in

In AVMNIST, image dependence appears within the first few epochs and stays consistently higher than audio throughout training. This happens even when both unimodal metrics are improving. The model is settling on visual features early — that trajectory is visible in the dependence score before any ablation is applied.

▶ MM-IMDb — Text Dominant

MM-IMDb starts with both modalities contributing comparably, then text quickly comes to dominate. There is some partial rebalancing later in training, but validation F1 alone does not reveal how much the full multimodal model still relies on text. The dependence score surfaces what accuracy hides.

▶ MOSEI — Deceptive Metrics

MOSEI is the clearest case of misleading validation. Audio-only and video-only can show moderate early performance, which might suggest those modalities are doing real work. But their dependence scores remain low throughout. Text dependence rises and stays high. The unimodal metrics make the model look more balanced than it actually is.

▶ What this shows

Modality dependence emerges as a stable trajectory early in training, well before convergence. Standard validation metrics miss this — a modality can look performant in isolation while being effectively redundant in the multimodal context.

▶ Why it matters

Dependence can be diagnosed during training. Detecting over-reliance early creates a window to intervene — through regularisation or rebalancing — before it becomes structural. It would allow early stopping for research that attempts to balance the influence of each modality.

RQ1 synthesis

RQ1 — Modality Absence and Behavioural Degradation Synthesis

RQ1 synthesis

How and to what extent does model performance change when an entire modality is unavailable at inference time, and how reliably can this change be anticipated?

Resolved Sub-Questions

RQ1.1

How closely do MM-SHAP contributions align with accuracy drop, confidence change, and calibration shift once the modality is gone?

They do not align reliably: attribution under full input often diverges from functional dependence under removal.

RQ1.2

Does Gaussian noise masking improve, worsen, or leave unchanged the baseline drop observed with zero masking?

There is no consistent winner: zero and Gaussian masking can produce different, sometimes opposing, behaviours across datasets.

RQ1.3

Can validation-time indicators or MM-SHAP contributions predict a model's dependence on a modality before removal?

Yes, but only partially: dependence emerges early and remains stable, yet raw validation behaviour alone can be misleading.

Answer to RQ1 Multimodal models show substantial, asymmetric degradation when a modality is removed, and this behavior cannot be reliably predicted by attribution tools like MM-SHAP, which conflates representational participation with functional dependence.

Failures are non-random and masking-strategy-dependent, yet traceable to stable behavioral trajectories established early in training.

RQ1 Synthesis

So what did we learn?

▶ RQ1.1

Attribution and functional dependence do not align reliably. MM-SHAP scores frequently diverge from the behavioural shifts observed when a modality is actually removed. Contribution captures participation, not necessity.

▶ RQ1.2

There is no consistent winner between zero and Gaussian masking. The two strategies can produce large, opposing effects depending on the dataset and architecture. The choice of how to represent absence directly shapes what robustness you measure.

But it is a really fun question to ask! What does it mean to represent nothing? And funnily what are the consequences.

▶ RQ1.3

Dependence can be anticipated, but only partially. It emerges as a stable trajectory early in training, yet raw validation behaviour alone can be misleading — a modality can appear important in isolation while being redundant in the multimodal context. What the model locks onto in those early epochs becomes the structural constraint on what any post-training intervention can achieve — a connection that becomes central in RQ2.

▶ Answer to RQ1

Taken together: multimodal models show substantial and asymmetric degradation when a modality is removed. That behaviour cannot be reliably predicted by MM-SHAP, which conflates representational participation with functional dependence. The failures are non-random and masking-strategy-dependent, but they are traceable — to stable behavioural trajectories that are established early in training.

Transition

Fragility is established.

Can we fix it? Yes, we can!

RQ2 — Post Hoc Modular Reconstruction

RQ2 — Post Hoc Modular Reconstruction RQ2

Research question

To what extent can a modular, lightweight, and post-training feature reconstruction approach mitigate performance degradation caused by fully missing modalities during multimodal model inference?

— Published in ACM Transactions on Intelligent Systems and Technology (2025)

Sub-questions

RQ2.1

How does varying the amount of available training data affect the reconstruction capabilities and inference performance of C-MAMs?

RQ2.2

To what extent does the complexity of the chosen loss function influence the quality of modality embedding reconstruction and subsequent inference performance?

RQ2.3

How does fine-tuning pre-trained modality-specific encoders during C-MAM training influence inference-time reconstruction performance?

RQ2.4

What is the relationship between the quality of reconstructed modality embeddings and the degree of performance recovery achieved during inference?

RQ2.5

How do inherent modality interactions affect the capability of C-MAMs to reconstruct missing modality embeddings?

Approach

Train modular C-MAMs independently; systematically vary training data availabilityMinimum Training Requirements

Compare reconstruction objectives, including MSE and alternative lossesImpact of Loss Function on Reconstruction Quality

Evaluate frozen versus fine-tuned encoder configurationsThe Role of Modality-Specific Encoders in C-MAMs

Relate embedding reconstruction quality to downstream performance recoveryStatistical Analysis of Reconstructed Embeddings

Analyse how modality interactions influence reconstruction successContrastive Information in Multimodal Models

RQ2 asks to what extent a modular, lightweight, post-training reconstruction approach can mitigate the performance degradation caused by missing modalities. This is published work — it appeared in ACM Transactions on Intelligent Systems and Technology in 2025. There are five sub-questions, each targeting a different practical constraint on the approach.

▶ RQ2.1 + Approach

The first asks how much training data is actually needed. C-MAMs are trained independently per modality mapping; the approach varies the fraction of available data systematically to find the minimum viable requirement.

▶ RQ2.2 + Approach

The second asks whether the choice of loss function matters. Twelve different reconstruction objectives are compared — from simple MSE to distributional losses like MMD — to see if complexity in the training objective produces better downstream performance.

▶ RQ2.3 + Approach

The third asks whether fine-tuning the underlying encoders during C-MAM training helps. The approach compares frozen encoders, randomly initialised encoders, and fine-tuned encoders.

▶ RQ2.4 + Approach

The fourth asks whether better embedding reconstruction leads to better performance recovery. The approach correlates geometric similarity metrics against downstream inference performance.

▶ RQ2.5 + Approach

The fifth asks how the relationship between modalities in the base model limits what reconstruction can achieve. The approach analyses contrastive information and cross-modal alignment to understand where recovery has a hard ceiling.

RQ2.1

RQ2 — Post Hoc Modular Reconstruction RQ2.1 Results

RQ2.1 — Training data availability

C-MAMs achieve strong recovery with as little as 10% of training data when inter-modality redundancy is high; weaker-structure reconstructions benefit from more data but remain viable with limited samples.

AVMNIST Data Efficient

Training on just 10% of the dataset yields performance within 2–5% of the full-data baseline; the base model's latent space provides a robust foundation for mapping associations with minimal supervision.

MM-IMDb Higher Variance

F1 scores within 10% of full-data baselines using only 10% of training samples, though using the weaker image modality as input introduces higher variance during early convergence.

MOSEI Modality Dependent

Text reconstruction remains robust even under severe data constraints, but the framework requires larger datasets to stabilise reconstruction for weaker modalities like audio and video.

$AVMNIST audio-to-image performance vs training data fraction$

MOSEI video-to-text reconstruction vs training data size

MOSEI video-to-audio reconstruction vs training data size

What this shows

C-MAMs are data-efficient. Training on just 10% of available data keeps performance within 2–5% of the full-data baseline.

The base model's latent space provides the foundation, the C-MAM only needs to learn the associations between existing embeddings.

Why it matters

Low data requirements make C-MAMs viable in resource-constrained and privacy-sensitive deployments where large centralised datasets are not available.

RQ2.1

C-MAMs need far less data than you might expect — because the base model has already done the hard work, and the C-MAM only needs to learn a mapping, not a new representation.

RQ2.1 — Training Data Availability

The answer is that C-MAMs are data-efficient. The figures on the right show performance and validation loss across data fractions for each dataset.

▶ AVMNIST — Data Efficient

In AVMNIST, training on just 10% of the data yields performance within 2–5% of the full-data baseline. The base model's latent space already encodes structured representations; the C-MAM only needs to learn the associations between them, and that mapping converges fast with very little data.

▶ MM-IMDb — Higher Variance

MM-IMDb achieves F1 scores within 10% of the full-data baseline at 10% training data, but shows higher variance during early convergence when the weaker image modality is used as input. The signal from image to text is noisier, so the C-MAM takes longer to stabilise.

▶ MOSEI — Modality Dependent

MOSEI shows a split. Text reconstruction is robust even under severe data constraints. But audio and video, as weaker modalities, need more data to stabilise. Recovery is possible across all three, but the minimum viable data requirement depends on the modality being reconstructed.

▶ What this shows

C-MAMs need very little data. Training on 10% keeps performance within 2–5% of the full-data baseline. The base model's latent structure does the heavy lifting; the C-MAM learns the translation.

▶ Why it matters

Low data requirements make C-MAMs viable where large centralised datasets are not available — resource-constrained deployments, privacy-sensitive settings, or situations where the original training data no longer exists. The data efficiency is itself a consequence of the same principle that runs through all of RQ2: the base model has already learned the representations; the C-MAM learns a translation, not a language.

RQ2.2

RQ2 — Post Hoc Modular Reconstruction RQ2.2 Results

RQ2.2 — Loss function complexity

Loss function complexity has marginal influence: all 12 objectives converge to similar recovery levels across datasets. The objective is not the bottleneck — the encoder representations are.

AVMNIST Loss Insensitive

Across twelve loss configurations, the narrow performance spread shows the classifier is largely insensitive to the specific reconstruction objective once basic representational alignment is achieved.

MM-IMDb Geometry Bottleneck

Pointwise and distributional losses (such as MMD) yield nearly identical predictive outcomes; reconstruction efficacy is bottlenecked by the base model's geometric structure, not the optimisation objective.

MOSEI Diminishing Returns

The choice of loss function has marginal influence on final classification behaviour; complex or moment-based losses offer no meaningful gains over simple MSE.

What this shows

Across twelve loss configurations, downstream performance is largely insensitive to the choice of loss function.

All reasonable formulations converge to similar levels, with only marginal gains from combining cosine similarity and MSE in certain tasks.

Why it matters

The base model's latent space geometry is the dominant constraint on reconstruction quality, not the optimisation objective.

Simple losses like MSE are sufficient; complexity in the loss function yields diminishing returns.

RQ2.2

The base model's representational geometry is the dominant constraint — once that's fixed, the specific loss objective has relatively little room to make a difference.

RQ2.2 — Loss Function Complexity

The answer is that loss function complexity has marginal influence. Twelve objectives were tested across all three datasets and the downstream performance spread is narrow — all of them converge to similar recovery levels.

▶ AVMNIST — Loss Insensitive

In AVMNIST, performance across twelve loss configurations barely varies. Once basic representational alignment is achieved, the classifier is largely insensitive to which specific reconstruction objective was used to get there.

▶ MM-IMDb — Geometry Bottleneck

MM-IMDb shows the same pattern. Pointwise losses and distributional losses like MMD produce nearly identical predictive outcomes. The binding constraint is the geometric structure of the base model's latent space, not the optimisation objective used to learn the mapping.

▶ MOSEI — Diminishing Returns

MOSEI confirms it. Complex or moment-based losses offer no meaningful gain over simple MSE. Adding sophistication to the objective does not improve the outcome.

▶ What this shows

Across twelve configurations, performance is largely insensitive to the loss function. All reasonable formulations converge to similar levels, with only minor gains in specific settings from combining cosine similarity and MSE.

▶ Why it matters

The base model's latent geometry is the dominant constraint on reconstruction quality. Simple losses like MSE are sufficient — complexity in the training objective yields diminishing returns and is not worth the engineering overhead.

RQ2.3

RQ2 — Post Hoc Modular Reconstruction RQ2.3 Results

RQ2.3 — Encoder fine-tuning

Frozen encoders are the correct default: training from scratch degrades key conditions, fine-tuning is inconsistent, and only frozen encoders guarantee a stable representation space for C-MAMs to operate within.

MOSEI Frozen = Best

Frozen encoders reused from the base model provide the best balance of performance and efficiency. Retraining from scratch frequently degrades accuracy by disrupting the established alignment with the classification head.

MOSEI: per-condition performance (Has0 Accuracy / Has0 F1 Weighted)

Cond.	Random Init Δ		Fine-tuned Δ
	Has0 Acc	Has0 F1W	Has0 Acc	Has0 F1W
A	−0.0400	−0.0543	+0.0459	+0.0284
V	+0.0065	+0.0047	+0.0642	+0.0545
T	−0.0194	−0.0081	−0.0012	+0.0020
AV	−0.0015	−0.0018	−0.0089	−0.0027
AT	−0.0044	−0.0034	−0.0017	−0.0024
VT	+0.0035	+0.0042	+0.0011	+0.0002

Δ = difference vs. frozen encoder baseline. Has0 = missing-modality condition. Negative = worse than frozen.

Mean Δ across all conditions (Has0 Accuracy & Has0 F1 Weighted)

Condition	Random Init	Fine-tuned
A	−0.0270	+0.0400
V	+0.0106	+0.0451
T	−0.0142	−0.0019
AV	+0.0105	+0.0059
AT	−0.0067	−0.0050
VT	+0.0046	+0.0017

Mean Δ across Has0 Accuracy and Has0 F1 Weighted. Negative = worse than frozen baseline.

What this shows

Frozen encoders match or outperform fine-tuned alternatives at significantly lower cost.

Retraining from scratch actively degrades performance by disrupting alignment with the base model's classification head.

Why it matters

C-MAMs do not require encoder retraining to function. Task-relevant structure is already present, it only needs to be re-associated.

This preserves the lightweight, post-hoc modularity that makes the framework practical.

RQ2.3

Frozen encoders are the correct default: touching them disrupts the alignment with the classification head, and the task-relevant structure is already there to be re-associated, not re-learned.

RQ2.3 — Encoder Fine-Tuning

The answer is that frozen encoders are the correct default. The tables show per-condition and mean delta relative to the frozen baseline across MOSEI.

▶ MOSEI results appear

Frozen encoders reused from the base model give the best balance of performance and efficiency across most conditions. Retraining from scratch — random initialisation — actively degrades performance in several conditions: audio is down by 0.04 in Has0 accuracy, text is down by 0.019. These are not marginal differences. Disrupting the encoder breaks the alignment between encoder representations and the classification head that was established during base model training.

Fine-tuning is more mixed. For audio and video it can help, with mean deltas of +0.04 and +0.045 respectively. But for text — the dominant modality — fine-tuning is essentially neutral or slightly negative. And the inconsistency across conditions makes it unreliable as a general strategy.

▶ What this shows

Frozen encoders match or outperform fine-tuned alternatives at significantly lower compute cost. Retraining from scratch is actively harmful. The task-relevant structure is already in the representations — the C-MAM just needs to re-associate it.

▶ Why it matters

C-MAMs do not require encoder retraining. This preserves the lightweight, post-hoc, modular character of the framework. You don't touch the base model; you add a small module alongside it. That's what makes the approach practical.

RQ2.4

RQ2 — Post Hoc Modular Reconstruction RQ2.4 Results

RQ2.4 — Reconstruction quality vs performance recovery

Reconstruction geometry and performance recovery are decoupled: what determines recovery is inter-modal representational alignment, not geometric precision in embedding space.

AVMNIST Near-Baseline

Cosine 0.883. Most dimensions not significantly different. Recovery matches baseline without geometric precision.

MM-IMDb Semantic Alignment

Cosine 0.051, near-orthogonal, yet CRR reaches 1.08. Looks nothing like the original; the classifier cannot tell the difference.

MOSEI Utility vs Fidelity

Cosine 0.79, many significant dimensions, but Cohen's d near 0. Statistical significance is not practical significance.

Kinetics-Sounds Alignment Barrier

MSE 32.79, CRR only 6%. Cosine approx. -0.04. The ceiling is set by encoder alignment, not decoder quality.

AVMNIST Audio to Image statistical analysis

MM-IMDb Image to Text statistical analysis

MOSEI Audio to Text statistical analysis

Reconstruction Error — MAE / MSE

Model & C-MAM	MAE	MSE
KS — Audio → Video	3.82	32.79
KS — Video → Audio	0.69	1.32
MM-IMDb — Image → Text	0.30	0.16
MM-IMDb — Text → Image	0.29	0.14
UTT-Fusion — AT → Video	0.07	0.009
UTT-Fusion — Audio → Video	0.22	0.11
UTT-Fusion — Audio → Text	0.23	0.19
UTT-Fusion — VT → Audio	0.10	0.02

KS audio→video is the study maximum (MSE=32.79). High reconstruction error correlates with near-orthogonal encoder spaces, not decoder failure.

What this shows

Geometric fidelity and functional utility are decoupled.

Cosine similarity 0.051 (near-orthogonal) still yields near-baseline or above-baseline accuracy in MM-IMDb. High reconstruction error does not prevent meaningful performance gains where encoder spaces are compatible.

Why it matters

The goal of reconstruction is functional sufficiency, not geometric replication.

Evaluation criteria based on embedding similarity alone are the wrong measure of whether a reconstruction module is working.

RQ2.4 — Reconstruction Quality vs Performance Recovery

The central finding: geometric fidelity and functional utility are decoupled. A reconstruction can be near-orthogonal to the original in vector space and still preserve everything the downstream task needs.

The right panel is a carousel. Each card on the left advances to the corresponding statistical analysis figure. Kinetics-Sounds shows the MAE/MSE error table instead, since no comparable statistical analysis plot exists for that dataset.

▶ AVMNIST — Near-Baseline

The statistical analysis figure shows cosine similarity mean=0.883, median=0.924. Bonferroni-corrected t-tests show most embedding dimensions are NOT significantly different from the originals. Cohen's d near 0.004 — negligible. The reconstruction is geometrically faithful and the task recovers near-baseline performance. This is the straightforward end of the spectrum.

▶ MM-IMDb — Semantic Alignment

Image-to-Text is the most striking case. The statistical analysis shows mean cosine similarity of just 0.051 — the reconstructed embeddings are nearly orthogonal to the originals. Despite this, downstream classification closely matches the full-modality baseline. Effect sizes remain close to zero. A cosine loss ablation confirms that forcing better angular alignment does NOT improve F1. The C-MAM captures task-relevant structure without replicating geometric direction.

▶ MOSEI — Utility vs Fidelity

Audio-to-Text achieves cosine similarity of 0.79 — apparently strong alignment. The statistical analysis shows many dimensions flagged as statistically significant. But Cohen's d is near 0.006 — negligible. Statistical significance and practical significance are different things. Many dimensions deviate measurably but not meaningfully for the task. Performance recovery is strong regardless.

▶ Kinetics-Sounds — Alignment Barrier

This is the failure case. The table shows Audio→Video MSE=32.79 — the study maximum — with CRR=0.061, meaning only 6% of lost performance is recovered. The modalities are near-orthogonal (cosine near -0.04) with KL divergence of 2.95. The encoder spaces are essentially incompatible. This is the ceiling imposed by encoder alignment — no decoder improvement will fix it.

▶ Why it matters

Cosine similarity and MSE are the wrong measures of reconstruction quality. What matters is whether the reconstructed embedding preserves the semantic subspace relevant to the task. That is only measurable through task-level evaluation.

RQ2.5

RQ2 — Post Hoc Modular Reconstruction RQ2.5 Results

RQ2.5 — Modality interactions and reconstruction ceiling

The reconstruction ceiling is set by how well the encoder spaces align — contrastive or orthogonal encoder representations impose a hard limit that no decoder design, loss function, or additional data can overcome.

AVMNIST Task Redundancy

High task redundancy and dataset co-occurrence allow significant performance recovery even though base model embeddings are geometrically contrastive — task redundancy compensates for geometric misalignment.

Kinetics-Sounds Alignment Bottleneck

Extreme contrastiveness and near-orthogonality of audio and video embeddings define a structural boundary making post-hoc reconstruction inherently difficult — a hard limit no C-MAM design can overcome.

MOSEI Bounded Recovery

Moderate cross-modal alignment facilitates robust reconstruction, yet the C-MAM's corrective capacity is strictly bounded by the semantic correspondences established during the base model's initial training.

Contrastive information analysis — representational alignment across datasets predicts recovery ceiling

Dataset	MI Reduction adding 2nd modality	Modality Dominance	PMI co-occurrence	Cosine Similarity	Sym. KL Divergence	C-MAM Recovery
MOSEI	Moderate	Text	Low (0.096)	Moderate 0.28–0.38	0.338	Moderate–High (text-bounded)
Kinetics-Sounds	Strong ↓ MI	Video	Very low (0.006)	Near-orthogonal −0.040	2.954	Low–Moderate (encoder bottleneck)
AVMNIST	Weak	Image	High (6.894)	Near-orthogonal 0.006	0.852	High (task redundancy)

Kinetics-Sounds: high KL + near-orthogonal cosine = encoders learned competing decision boundaries. AVMNIST: near-orthogonal cosine BUT high PMI — task redundancy compensates for geometric misalignment.

What this shows

Reconstruction capability is bounded by inter-modal alignment.

Semantically aligned modalities recover well; contrastive or orthogonal modalities, such as Kinetics-Sounds, are significantly harder to reconstruct.

Why it matters

Reconstructability is an intrinsic property of how the base model encoded its modalities, not of the reconstructor itself.

This makes catastrophic sensor failures predictable in advance, before they occur.

RQ2.5

The reconstruction ceiling is set by the inter-modal alignment the base model established during training — not by C-MAM design or capacity, and not fixable post-hoc.

RQ2.5 — Modality Interactions and Reconstruction Ceiling

The answer is that the reconstruction ceiling is set by how well the encoder spaces align. The table shows contrastive information analysis across datasets — mutual information reduction, PMI co-occurrence, cosine similarity, and symmetric KL divergence — alongside C-MAM recovery outcomes.

▶ AVMNIST — Task Redundancy

AVMNIST has near-orthogonal cosine similarity between audio and image embeddings — geometrically they are not aligned. But the PMI co-occurrence is very high at 6.894, meaning the two modalities are highly correlated at the task level. That task redundancy compensates for the geometric misalignment, and C-MAM achieves high recovery as a result.

▶ Kinetics-Sounds — Alignment Bottleneck

Kinetics-Sounds is the hard case. Near-orthogonal cosine similarity at −0.040, symmetric KL divergence of 2.954 — the audio and video encoders learned competing decision boundaries. They don't share meaningful structure. That defines a structural boundary that no C-MAM design, no loss function, and no additional data can overcome. It is a property of the base model, not of the reconstruction module.

▶ MOSEI — Bounded Recovery

MOSEI sits between these extremes. Moderate cross-modal alignment supports reasonable recovery, but the C-MAM's ceiling is strictly bounded by the semantic correspondences the base model established during training.

▶ What this shows

Reconstruction capability is bounded by inter-modal alignment in the base model. Semantically aligned modalities recover well; contrastive or weakly aligned ones are hard limits.

▶ Why it matters

Reconstructability is an intrinsic property of how the base model encoded its modalities, not of the reconstructor itself. This makes failure cases predictable in advance — before deployment, before a sensor ever fails.

RQ2 synthesis

RQ2 — Post Hoc Modular Reconstruction Synthesis

RQ2 synthesis

To what extent can a modular, lightweight, and post-training feature reconstruction approach mitigate performance degradation caused by fully missing modalities during multimodal model inference?

Resolved Sub-Questions

RQ2.1

How does varying training data availability affect C-MAMs' reconstruction and inference performance?

C-MAMs achieve substantial recovery even with limited data; gains plateau early when inter-modality structure is strong.

RQ2.2

How does loss function complexity influence reconstruction quality and inference performance?

Simple MSE-based objectives are competitive; more complex losses do not consistently improve downstream performance.

RQ2.3

How does fine-tuning pre-trained encoders during C-MAM training affect reconstruction?

Fine-tuning does not reliably improve reconstruction; frozen encoders achieve comparable recovery in most settings.

RQ2.4

What is the relationship between embedding reconstruction quality and performance recovery?

Performance recovery does not scale linearly with geometric reconstruction quality; high error can coexist with substantial inference improvement.

RQ2.5

How do inherent modality interactions affect C-MAMs' reconstruction capability?

Recovery correlates with inter-modality structure; contrastive or weakly aligned modalities constrain reconstruction ceiling.

Answer to RQ2 Post-training feature reconstruction via C-MAMs can recover over 50–90%+ of performance lost to missing modalities, requiring minimal data and no architectural changes.

Efficacy is bounded by the base model's latent structure, succeeding when modalities are semantically aligned but struggling when they are contrastive or orthogonal.

RQ2 Synthesis

▶ RQ2.1

C-MAMs achieve substantial recovery even with limited data. Gains plateau early when inter-modality structure is strong, because the latent space already encodes the associations the C-MAM needs to learn.

▶ RQ2.2

Simple MSE-based objectives are competitive with every alternative tested. More complex losses do not consistently improve downstream performance. The objective is not the bottleneck.

▶ RQ2.3

Frozen encoders achieve comparable recovery to fine-tuned alternatives in most settings, at far lower cost. Fine-tuning is not reliably beneficial and retraining from scratch is actively harmful.

▶ RQ2.4

Performance recovery does not scale with geometric reconstruction quality. High embedding error can coexist with substantial inference improvement. Functional sufficiency is what matters, not geometric precision.

▶ RQ2.5

Recovery correlates with inter-modality structure in the base model. Contrastive or weakly aligned modalities impose a hard ceiling that no reconstructor can overcome.

▶ Answer to RQ2

Post-training feature reconstruction via C-MAMs can recover over 50–90% of performance lost to missing modalities, requiring minimal data and no changes to the underlying architecture. Efficacy is bounded by the base model's latent structure — it succeeds when modalities are semantically aligned and struggles when they are contrastive or orthogonal. That ceiling is a property of how the model was trained, not of the reconstruction approach.

Transition

Recovery is possible.

But is everything behaving correctly?

RQ2 answered the question of whether C-MAMs can recover missing modalities. They can, substantially, with minimal data, without touching the base model. But performance recovery measures the output. It does not tell us whether the model is doing the right thing for the right reasons.

A model that recovers accuracy might still be overconfident. It might handle majority classes well and fail silently on the minority ones. RQ3 moves below the surface. It asks whether the behaviour restored by reconstruction is actually faithful — calibrated, stable across classes, and honest in its confidence estimates. And it also is a good time to introduce other reconstruction methods to compare against C-MAMs, to see how they fare on these deeper questions of behaviour. The shift in dataset focus to MSP-IMPROV and IEMOCAP is intentional — these datasets have the class imbalance and bimodal structure needed to test calibration and minority-class reliability, properties that AVMNIST and MM-IMDb are not well-suited to probe.

RQ3 — Behavioural Fidelity of Reconstruction

RQ3 — Behavioural Fidelity of Reconstruction RQ3

Research question

How do reconstructed modality embeddings affect the decision behaviour of multimodal models relative to missing-modality baselines, and how do different reconstruction methods compare in terms of information recovery, calibration behaviour, and class-conditional predictive structure?

— Under Review with TBD

Sub-questions

RQ3.1

To what extent do reconstructed modality embeddings recover the information that is theoretically recoverable from the available inputs?

RQ3.2

How does reconstruction model complexity affect the generalisability of recovered predictive behaviour under missing-modality inference?

RQ3.3

Do reconstructed embeddings induce class-specific behavioural biases, and how do these manifest across reconstruction methods in confidence and calibration?

Approach

Compare reconstructed and ground-truth embeddings using geometric similarity and information recovery metricsH1

Analyse generalisation across reconstruction methods and model capacity; evaluate predictive divergenceH2.1 · H2.2

Assess class-conditional confidence and calibration behaviour; quantify behavioural variance relative to full-modality baselinesH3 · H4 · H5

RQ3 asks how reconstructed embeddings affect model decision behaviour, and how different reconstruction methods compare in information recovery, calibration, and class-conditional predictive structure. Where RQ2 asked whether C-MAMs can recover performance, RQ3 asks whether the behaviour they restore is the right behaviour — whether the model's decisions are not just correct but well-calibrated and reliable.

▶ RQ3.1 + Approach

The first sub-question asks how much of the theoretically recoverable signal a reconstruction method actually captures. The approach compares reconstructed embeddings against the conditional variance bound — the maximum predictability given the available inputs — using geometric and information-theoretic metrics.

▶ RQ3.2 + Approach

The second asks whether more complex reconstruction models generalise better. The approach evaluates predictive divergence across reconstruction methods and model capacities, focusing on minority-class generalisation under modality absence.

▶ RQ3.3 + Approach

The third asks whether reconstruction methods induce class-specific behavioural biases — in confidence and calibration. The approach assesses class-conditional reliability using reliability diagrams and ECE across C-MAMs, MMIN, and RedCore.

RQ3.1

RQ3 — Behavioural Fidelity of Reconstruction RQ3.1 Results

RQ3.1 — Information recoverability

Lightweight C-MAM decoders more frequently match conditional variance bounds, recovering the full predictable cross-modal signal, while higher-capacity models often fall short, confirming that reconstruction fidelity is constrained by encoder representations rather than decoder capacity.

CMU-MOSEI Partial Saturation

Only a widened C-MAM for the audio-to-text mapping satisfies the conditional variance bound, while MMIN and RedCore fail to reach saturation across all tested configurations.

MSP-IMPROV Bimodal Saturation

C-MAM successfully recovers the information-theoretic limit in two bimodal mappings, whereas MMIN and RedCore's reconstruction errors remain significantly above the conditional variance threshold.

IEMOCAP 5/6 Saturated

C-MAM demonstrates the strongest information recovery, saturating five out of six mappings, while MMIN and RedCore succeed in only two and one cases, respectively.

Mapping	Model	RMSE	√Var(Y\|X)	Δ	H1
A→T ★	C-MAM	0.588	0.154	+0.434	No
	MMIN	0.171	0.095	+0.076	No
	RedCore	0.816	0.662	+0.154	No
V→T	C-MAM	0.262	0.155	+0.107	No
	MMIN	0.144	0.052	+0.092	No
	RedCore	0.884	0.722	+0.162	No
AT→V ★	C-MAM	0.213	0.161	+0.052	No
	MMIN	0.419	0.167	+0.252	No
	RedCore	0.813	0.638	+0.175	No

★ A→T: C-MAM worst RMSE (0.588) → best behavioural recall (0.762); AT→V: C-MAM nearest to bound (Δ = +0.052). MMIN/RedCore closer geometrically, far worse behaviourally. AV→T, T→V, VT→A omitted; all show No support.

Mapping	Model	RMSE	√Var(Y\|X)	Δ	H1
A→V	C-MAM	0.126	0.071	+0.056	No
	MMIN	0.419	0.167	+0.252	No
	RedCore	0.765	0.700	+0.064	No
AT→V ★	C-MAM	0.115	0.071	+0.045	Yes
	MMIN	0.144	0.052	+0.092	No
	RedCore	0.813	0.638	+0.175	No
TV→A ★	C-MAM	0.173	0.128	+0.045	Yes
	MMIN	0.209	0.121	+0.087	No
	RedCore	0.840	0.674	+0.166	No

★ C-MAM satisfies H1 in 2/6 mappings (AT→V, TV→A; Δ = +0.045 each). MMIN and RedCore fail all six. Remaining mappings (V→A, AV→T, T→A) all show No support.

Mapping	Model	RMSE	√Var(Y\|X)	Δ	H1
A→T	C-MAM	0.168	0.105	+0.062	No
	MMIN	0.184	0.097	+0.087	No
	RedCore	0.821	0.742	+0.079	No
V→A ★	C-MAM	0.139	0.116	+0.024	Yes
	MMIN	0.201	0.122	+0.079	No
	RedCore	0.824	0.782	+0.042	Yes
TV→A ★	C-MAM	0.137	0.116	+0.021	Yes
	MMIN	0.158	0.117	+0.040	Yes
	RedCore	0.887	0.753	+0.134	No

★ C-MAM supports H1 in 5/6 mappings (only A→T fails, Δ = +0.062). MMIN supports in 2 (T→V, TV→A). RedCore supports in 1 (V→A). Strongest H1 result across all three datasets.

What this shows

C-MAMs consistently reach the information-theoretic limit on reconstruction quality, the conditional variance bound, while higher-capacity models like MMIN and RedCore frequently do not.

Decoder complexity is not the constraint.

Why it matters

RQ3.1 — Information Recoverability

The answer is that lightweight C-MAM decoders more frequently reach the conditional variance bound — the information-theoretic limit on what is recoverable — while higher-capacity models like MMIN and RedCore frequently do not. The tables show RMSE against the conditional variance floor for each modality mapping across three datasets.

▶ CMU-MOSEI — Partial Saturation

MOSEI is the hardest case. No standard C-MAM configuration satisfies the bound. Only a widened C-MAM on the audio-to-text mapping gets close. MMIN and RedCore fail across all tested configurations — and notably, the A→T mapping has MMIN closer to the bound geometrically while C-MAM is further away, yet C-MAM achieves far better behavioural recall. RMSE ranking and behavioural quality are not the same thing.

▶ MSP-IMPROV — Bimodal Saturation

MSP-IMPROV shows C-MAM satisfying the bound in two out of six bimodal mappings — AT-to-video and TV-to-audio. MMIN and RedCore fail all six. Where the C-MAM reaches saturation, it means it has extracted the full predictable cross-modal signal available. There is nothing more to recover given the encoder representations.

▶ IEMOCAP — 5/6 Saturated

IEMOCAP is the strongest result. C-MAM satisfies the bound in five out of six mappings. MMIN satisfies two; RedCore satisfies one. The pattern across all three datasets is consistent: decoder capacity does not predict saturation — the C-MAM's simplicity is not a limitation here.

▶ What this shows

C-MAMs consistently reach the information-theoretic limit where MMIN and RedCore do not. Decoder complexity is not the constraint on how much signal can be recovered.

▶ Why it matters

When a lightweight decoder saturates the recoverable signal, further architectural scaling is futile. Poor reconstruction performance points to the encoder, not the decoder. The base model never learned representations that support better recovery. This is the information-theoretic confirmation of what RQ2.5 established from the contrastive analysis — the ceiling was always in the encoder, and no reconstruction architecture changes that.

RQ3.2

RQ3 — Behavioural Fidelity of Reconstruction RQ3.2 Results

RQ3.2 — Reconstruction model complexity

Increased reconstruction complexity undermines rather than improves minority-class generalisation. Lightweight C-MAMs achieve at least 95% behavioural parity with MMIN in most settings (H2.1), while RedCore, the highest-capacity model, consistently underperforms MMIN on minority-class recall across all datasets (H2.2).

CMU-MOSEI

H2.1: 4/6 H2.2: 3/6

C-MAM meets the 95% parity threshold in four of six mappings (H2.1). RedCore underperforms MMIN on neutral recall in three mappings, with near-zero recall in unimodal text conditions (H2.2).

MSP-IMPROV

H2.1: 5/6 H2.2: 6/6

C-MAM meets the 95% behavioural parity threshold in five of six mappings (H2.1). RedCore underperforms MMIN on neutral recall in all six mappings (H2.2 fully supported).

IEMOCAP

H2.1: 1/6 H2.2: 6/6

C-MAM meets the 95% parity threshold in only one mapping on this dataset (H2.1: 1/6). RedCore underperforms MMIN on neutral recall in all six mappings (H2.2 fully supported).

Mapping	C-MAM Neutral Recall	MMIN Neutral Recall	RedCore Neutral Recall	C-MAM Bal. F1	MMIN Bal. F1	RedCore Bal. F1	H2.1	H2.2
A→T	0.762	0.114	0.000	0.365	0.382	0.334	Yes	Yes
V→T	0.743	0.072	0.008	0.376	0.395	0.377	Yes	Yes
AT→V	0.378	0.382	0.450	0.615	0.622	0.601	No	No
T→V	0.434	0.327	0.446	0.619	0.616	0.600	Yes	No
AV→T	0.298	0.140	0.025	0.381	0.410	0.384	No	Yes
TV→A	0.233	0.321	0.397	0.586	0.617	0.596	Yes	No

H2.1: C-MAM ≥95% parity in 4/6 mappings. H2.2: RedCore recall < MMIN recall in 3/6 (A→T, V→T, AV→T); collapses to near-zero in unimodal text conditions.

Mapping	C-MAM Neutral Recall	MMIN Neutral Recall	RedCore Neutral Recall	C-MAM Bal. F1	MMIN Bal. F1	RedCore Bal. F1	H2.1	H2.2
A→V	0.477	0.330	0.091	0.411	0.419	0.363	Yes	Yes
AT→V	0.610	0.588	0.441	0.603	0.580	0.528	Yes	Yes
V→A	0.116	0.047	0.009	0.461	0.476	0.469	Yes	Yes
AV→T	0.429	0.408	0.242	0.583	0.558	0.518	Yes	Yes
T→A	0.422	0.432	0.378	0.525	0.527	0.498	Yes	Yes
TV→A	0.467	0.541	0.438	0.647	0.558	0.506	No	Yes

H2.1: C-MAM ≥95% parity in 5/6 mappings (TV→A only failure: recall 0.467 vs. MMIN 0.541). H2.2: RedCore underperforms MMIN on neutral recall in all 6/6 mappings.

Mapping	C-MAM Neutral Recall	MMIN Neutral Recall	RedCore Neutral Recall	C-MAM Bal. F1	MMIN Bal. F1	RedCore Bal. F1	H2.1	H2.2
A→V	0.270	0.457	0.419	0.448	0.550	0.534	No	Yes
AT→V	0.595	0.640	0.599	0.718	0.727	0.671	No	Yes
V→A	0.336	0.490	0.380	0.429	0.470	0.472	No	Yes
AV→T	0.452	0.489	0.427	0.629	0.640	0.625	No	Yes
T→A	0.597	0.645	0.599	0.561	0.640	0.640	No	Yes
TV→A	0.679	0.544	0.512	0.667	0.647	0.626	Yes	Yes

H2.1: C-MAM meets parity threshold in 1/6 mappings (TV→A only). H2.2: RedCore underperforms MMIN on neutral recall in all 6/6 mappings.

What this shows

H2.1 confirmed: C-MAMs match or exceed MMIN on minority-class recall and balanced F1 in most settings (4/6 MOSEI, 5/6 MSP-IMPROV, 1/6 IEMOCAP).

H2.2 confirmed: RedCore systematically underperforms MMIN on neutral recall, fully across MSP-IMPROV and IEMOCAP and partially on MOSEI.

Why it matters

Complexity introduces overfitting to training-time modality patterns, increasing brittleness under genuine modality absence.

For safety-critical systems, minimal and modular decoders offer more stable and predictable generalisation.

RQ3.2 — Reconstruction Model Complexity

The answer is that increased reconstruction complexity undermines rather than improves minority-class generalisation. The tables show neutral-class recall and balanced F1 for C-MAM, MMIN, and RedCore across each modality mapping and dataset.

▶ CMU-MOSEI — H2.1: 4/6, H2.2: 3/6

In MOSEI, C-MAM meets the 95% behavioural parity threshold with MMIN in four of six mappings. The striking result is in the text-reconstruction conditions — A-to-text and V-to-text — where C-MAM achieves neutral recall of 0.762 and 0.743 respectively, while MMIN sits at 0.114 and 0.072, and RedCore collapses to effectively zero. RedCore, the highest-capacity model, produces the worst minority-class recall on the conditions where text is being reconstructed.

▶ MSP-IMPROV — H2.1: 5/6, H2.2: 6/6

MSP-IMPROV is even cleaner. C-MAM meets parity in five of six mappings. RedCore underperforms MMIN on neutral recall in all six — fully supporting the hypothesis that complexity hurts minority-class generalisation. The C-MAM beats or matches MMIN across the board.

▶ IEMOCAP — H2.1: 1/6, H2.2: 6/6

IEMOCAP is the mixed case. C-MAM only meets parity in one mapping — TV-to-audio, where it leads. In the others, MMIN is stronger. But RedCore underperforms MMIN on neutral recall in all six mappings without exception. The second hypothesis is fully confirmed; the first is dataset-dependent.

▶ What this shows

C-MAMs match or exceed MMIN on minority-class recall in most settings. RedCore consistently underperforms MMIN across MSP-IMPROV and IEMOCAP, and partially on MOSEI.

▶ Why it matters

Complexity introduces overfitting to training-time modality patterns, making the model more brittle under genuine modality absence. Lightweight modular decoders generalise more stably — which matters especially in safety-critical systems where minority-class errors have consequences.

RQ3.3

RQ3 — Behavioural Fidelity of Reconstruction RQ3.3 Results

RQ3.3 — Class-specific behavioural biases

C-MAMs provide superior geometric fidelity and calibration compared to the overconfident behaviour of end-to-end baselines, which frequently achieve accuracy through distorted embeddings. This confirms that predictive accuracy and representational faithfulness are decoupled properties, not jointly guaranteed by model complexity.

CMU-MOSEI Fidelity Leader

C-MAM dominates in both geometric fidelity and calibration, whereas RedCore induces substantial output distribution shifts and severe overconfidence.

MSP-IMPROV Mixed Calibration

While C-MAM maintains geometric precision, MMIN and RedCore show better behavioural stability; RedCore surprisingly provides better calibration, likely due to its overparameterisation smoothing noisy signals.

IEMOCAP Best Calibration

C-MAM achieves the highest joint geometric-behavioural fidelity and decisive calibration advantages, with predictions closely tracing the diagonal while baselines deviate toward overconfidence.

MSP-IMPROV AT→V reliability diagram class 0

MSP-IMPROV AT→V reliability diagram class 1

IEMOCAP AT→V reliability diagram class 0

IEMOCAP AT→V reliability diagram class 1

What this shows

No reconstruction method eliminates class drift, but C-MAMs provide the highest geometric fidelity and, critically, the most reliable calibration.

RedCore and MMIN exhibit overconfidence, maintaining high confidence as correctness collapses.

Why it matters

Accuracy alone is not a sufficient measure of reconstruction quality.

Miscalibrated confidence under modality absence is a practical liability. C-MAMs ensure confidence remains an honest reflection of correctness.

RQ3.3 — Class-Specific Behavioural Biases

The answer is that C-MAMs provide better geometric fidelity and calibration than end-to-end baselines. The figures on the right show reliability diagrams — class-conditional confidence versus actual accuracy. A well-calibrated model tracks the diagonal; overconfident models sit above it.

▶ CMU-MOSEI — Fidelity Leader

In MOSEI, C-MAM dominates in both geometric fidelity and calibration. RedCore induces substantial output distribution shifts and severe overconfidence — it maintains high confidence as accuracy collapses under modality absence. MMIN is better than RedCore but still drifts from the diagonal.

▶ MSP-IMPROV — Mixed Calibration

MSP-IMPROV is more mixed. C-MAM maintains geometric precision, but MMIN and RedCore show better behavioural stability in some configurations. RedCore surprisingly provides better calibration in this dataset, likely due to overparameterisation smoothing out noisy reconstruction signals rather than genuinely improving fidelity. It is not a principled advantage.

▶ IEMOCAP — Best Calibration

IEMOCAP is C-MAM's strongest calibration result. Predictions closely trace the diagonal while baselines deviate toward overconfidence. The calibration advantage is decisive and consistent across class conditions.

▶ What this shows

No reconstruction method eliminates class drift entirely, but C-MAMs provide the highest geometric fidelity and the most reliable calibration. RedCore and MMIN maintain high confidence as correctness collapses — a structural overconfidence problem.

▶ Why it matters

Accuracy alone is not sufficient. A model that is highly confident while being wrong is a practical liability. C-MAMs ensure confidence remains an honest reflection of correctness, which is what you need when a modality goes missing in a deployed system.

RQ3 synthesis

RQ3 — Behavioural Fidelity of Reconstruction Synthesis

RQ3 synthesis

Resolved Sub-Questions

RQ3.1

To what extent do reconstructed embeddings recover theoretically recoverable information?

Reconstructed embeddings substantially recover task-relevant information, but geometric similarity does not reliably predict how much behaviour is restored.

RQ3.2

How does reconstruction model complexity affect the generalisability of recovered predictive behaviour?

Lightweight decoders often outperform higher-capacity alternatives in behavioural stability; capacity does not guarantee fidelity.

RQ3.3

Do reconstructed embeddings induce class-specific behavioural biases across methods?

Yes: calibration benefits are clearest in bimodal reconstruction; unimodal conditions expose class-specific failures not visible in aggregate metrics.

Answer to RQ3 Reconstruction architecture shapes decision behaviour beyond accuracy, with lightweight modular decoders like C-MAMs outperforming complex end-to-end alternatives in geometric fidelity and confidence calibration. Behavioural fidelity is not a function of model capacity but of encoder-decoder alignment; inductive simplicity generalises better and avoids the overfitting that plagues overparameterised approaches.

RQ3 Synthesis

▶ RQ3.1

Reconstructed embeddings substantially recover task-relevant information, but geometric similarity does not reliably predict how much behaviour is restored. Lightweight C-MAMs saturate the information-theoretic bound more frequently than higher-capacity alternatives.

▶ RQ3.2

Lightweight decoders often outperform higher-capacity alternatives in behavioural stability. Capacity does not guarantee fidelity — RedCore is the strongest result here, underperforming MMIN on minority-class recall across MSP-IMPROV and IEMOCAP in every single mapping tested.

▶ RQ3.3

Calibration benefits are clearest in bimodal reconstruction. Unimodal conditions expose class-specific failures not visible in aggregate metrics. C-MAMs produce the most honest confidence estimates; baselines are structurally overconfident. The one exception — RedCore on MSP-IMPROV showing apparent calibration gains — is an artefact of overparameterisation smoothing rather than genuine fidelity, which reinforces rather than undermines the finding.

▶ Answer to RQ3

Reconstruction architecture shapes decision behaviour beyond accuracy. Lightweight modular decoders outperform complex end-to-end alternatives in geometric fidelity and confidence calibration. Behavioural fidelity is not a function of model capacity — it is a function of encoder-decoder alignment. Inductive simplicity generalises better and avoids the overfitting that causes overparameterised approaches to fail on minority classes and miscalibrate their confidence.

Transition

Centralised robustness holds.

Does it survive federated deployment?

RQ4 — Federated and Decentralised Reconstruction

RQ4 — Federated and Decentralised Reconstruction RQ4

Research question

Can modular reconstruction methods be adapted effectively for robust multimodal learning within incongruent federated systems with heterogeneous modality availability and local data access constraints?

Sub-questions

RQ4.1

Do modular C-MAMs improve global and client-level predictive performance when clients observe disjoint subsets of modalities?

RQ4.2

What are the communication and compute costs of deploying modular C-MAMs in federated multimodal systems?

Approach

Simulate federated settings with heterogeneous modality availability; train and distribute C-MAMs alongside a shared global model; evaluate global and client-level performance under modality incongruenceH1

Measure communication and computation costs of modular reconstruction; analyse behavioural variance and investigate selective aggregation effectsH2

RQ4 asks whether modular reconstruction can be extended to federated settings — specifically, systems where clients have heterogeneous modality availability and cannot share raw data. The question is whether the modularity that makes C-MAMs practical in centralised settings also works in a decentralised environment.

▶ RQ4.1 + Approach

The first sub-question asks whether FedC-MAMs improve performance for clients that can only observe a subset of the available modalities. The approach simulates federated settings with heterogeneous modality availability, trains and distributes C-MAMs alongside a shared global model, and evaluates both global and client-level performance.

▶ RQ4.2 + Approach

The second asks what the communication and compute costs actually are. The approach measures cumulative communication volume and energy cost across training rounds, comparing FedC-MAMs against MMIN and RedCore, and analyses how selective aggregation — only sharing modules relevant to a client's modality set — affects both cost and variance.

RQ4.1

RQ4 — Federated and Decentralised Reconstruction RQ4.1 Results

RQ4.1 — C-MAMs in federated settings

FedC-MAMs restore substantial performance for modality-limited clients without destabilising global optimisation, but corrective benefit is strictly conditional on meaningful baseline degradation. Text-dominant configurations exhibit a ceiling effect where reconstruction offers negligible or detrimental returns.

CMU-MOSEI Substantial Recovery

FedC-MAMs substantially restored accuracy for modality-limited clients, with gains up to +0.47 in audio-only settings, while maintaining stability for text-dominant configurations.

MSP-IMPROV Boundary Case (AT)

Most groups experienced significant performance recovery, though the AT configuration emerged as a boundary case where reconstruction was detrimental due to low baseline degradation and conflicting gradients.

Client modalities	Base model	FedC-MAMs	Δ
A only	0.251	0.693	+0.442
V only	0.343	0.661	+0.318
AV	0.403	0.668	+0.265
T only	0.667	0.730	+0.063
AT	0.740	0.763	+0.023
TV	0.717	0.751	+0.034

Yellow: text-absent clients (A, V, AV), with gains statistically significant (Welch t > 30, p < 0.001). Text-containing clients: small, non-significant deltas.

Client modalities	Base model	FedC-MAMs	Δ
A only	0.362	0.396	+0.034
V only	0.290	0.427	+0.137
AV	0.500	0.502	+0.002
T only	0.294	0.465	+0.171
AT	0.570	0.442	−0.128
TV	0.419	0.593	+0.174

Yellow: modality-limited gains, significant (p < 0.001). Red: AT boundary case; reconstruction detrimental (p < 0.001).

What this shows

FedC-MAMs restore substantial performance for modality-limited clients without destabilising the global model.

Gains are largest where local disadvantage is greatest, with a ceiling effect where available modalities are already highly informative.

Why it matters

Federated robustness does not require retraining the global backbone.

Local, modular reconstruction is sufficient to address client-level modality incongruence, validating modularity as a practical design principle for heterogeneous deployments.

RQ4.1 — FedC-MAMs in Federated Settings

The answer is that FedC-MAMs restore substantial performance for modality-limited clients without destabilising the global model. But the benefit is conditional — it scales with how much the baseline degrades, and text-dominant configurations show a ceiling effect.

▶ CMU-MOSEI

The MOSEI table is the clearest result. Clients with audio only go from 0.251 to 0.693 — a gain of +0.442. Video-only clients gain +0.318. AV clients gain +0.265. All highlighted in yellow — all statistically significant with Welch t > 30, p < 0.001. The gains are large and real. Text-dominant clients — T, AT, TV — gain a little, but far less, because the baseline was already high. There is a ceiling where further reconstruction offers negligible return.

▶ MSP-IMPROV

MSP-IMPROV largely confirms the pattern. Most groups recover meaningfully. But the AT configuration emerges as a boundary case — reconstruction is actually detrimental there, with a delta of −0.128. This is the case where two already reasonably informative modalities are present, the baseline degradation is low, and the additional gradients from reconstruction interfere with the existing optimisation. Reconstruction is not always free. When there is nothing to recover, it can hurt.

▶ What this shows

FedC-MAMs restore substantial performance for clients with the greatest modality disadvantage, without destabilising the global model. Gains are largest where local disadvantage is greatest. Text-dominant configurations show a ceiling effect.

▶ Why it matters

Federated robustness does not require retraining the global backbone. Local, modular reconstruction is sufficient to address client-level modality incongruence. That validates modularity as a practical design principle for heterogeneous, resource-constrained deployments.

RQ4.2

RQ4 — Federated and Decentralised Reconstruction RQ4.2 Results

RQ4.2 — Communication and compute costs

FedC-MAMs use ~50% of MMIN's energy and ~20% of RedCore's at 1000 rounds. Clients transmit only the modules for their active modalities; cost scales with the client's modality set, not global model size.

CMU-MOSEI ~50% vs MMIN

FedC-MAMs achieved a 50% reduction in cumulative communication volume and energy cost compared to mid-capacity monolithic models like MMIN.

MSP-IMPROV <17% vs RedCore

The efficiency gains were even more pronounced due to larger modality encoders, with FedC-MAMs consuming less than 17% of the energy required by high-capacity transformer baselines.

MOSEI cumulative communication cost over rounds

Dataset	MMIN	RedCore	C-MAMs
MOSEI	8.19	19.70	4.03
MSP-IMPROV	16.78	41.14	6.63

kWh at 0.03 kWh GB⁻¹ · 1000 rounds · 20 clients · 100 simulations.

MSP-IMPROV cumulative communication cost over rounds

Dataset	MMIN	RedCore	C-MAMs
MOSEI	8.19	19.70	4.03
MSP-IMPROV	16.78	41.14	6.63

kWh at 0.03 kWh GB⁻¹ · 1000 rounds · 20 clients · 100 simulations.

What this shows

FedC-MAMs transmit only the parameters relevant to a client's active modalities, reducing cumulative communication and energy costs by ~50% against MMIN and over 75% against RedCore across 1,000 training rounds.

Why it matters

Bandwidth and energy constraints are the binding limitation for federated edge deployments.

Modularity resolves this. Robustness and communication efficiency are not a trade-off, they are achieved together.

RQ4.2 — Communication and Compute Costs

The answer is that FedC-MAMs are significantly cheaper than the alternatives. Clients transmit only the modules relevant to their active modalities — not the full model. Cost scales with the client's modality set, not with global model size.

▶ CMU-MOSEI — ~50% vs MMIN

In MOSEI, FedC-MAMs use 4.03 kWh over 1,000 training rounds across 20 clients. MMIN uses 8.19 kWh — roughly double. RedCore uses 19.70 kWh — nearly five times as much. FedC-MAMs achieve a 50% reduction against MMIN and over 75% against RedCore.

▶ MSP-IMPROV — <17% vs RedCore

The efficiency gap is even wider for MSP-IMPROV, because the modality encoders are larger. FedC-MAMs use 6.63 kWh; RedCore uses 41.14 kWh. FedC-MAMs consume less than 17% of the energy required by the transformer-based baseline.

▶ What this shows

FedC-MAMs transmit only client-relevant parameters, cutting cumulative communication volume and energy cost by roughly 50% against MMIN and over 75% against RedCore across 1,000 rounds.

▶ Why it matters

Bandwidth and energy are the binding constraints for federated edge deployments. Modularity resolves both. Robustness and communication efficiency are not a trade-off here — they are achieved together.

RQ4 synthesis

RQ4 — Federated and Decentralised Reconstruction Synthesis

RQ4 synthesis

Resolved Sub-Questions

RQ4.1

Do modular C-MAMs improve global and client-level predictive performance when clients observe disjoint subsets of modalities?

Yes, where the baseline degrades: audio-only, video-only, and audio-video clients gain substantially; text-dominant clients see minimal or negligible gains; negative impact is limited to configurations where unimodal dominance is already near-ceiling (e.g. MSP-IMPROV AT).

RQ4.2

What are the communication and compute costs of deploying modular C-MAMs in federated multimodal systems?

FedC-MAMs transmit only client-relevant modules, achieving substantially lower cumulative communication cost than MMIN or RedCore; local compute overhead is bounded and independent of global model size.

Answer to RQ4 Modular reconstruction adapts effectively to federated settings, with FedC-MAMs restoring performance for modality-limited clients via selective parameter aggregation without destabilising global optimisation.

Corrective capacity is contingent on meaningful baseline degradation, but the framework offers a scalable path to robust multimodal learning in heterogeneous, resource-constrained environments.

RQ4 Synthesis

▶ RQ4.1

FedC-MAMs improve client-level robustness where modality degradation is meaningful. Audio-only, video-only, and audio-video clients gain substantially. Text-dominant clients see minimal or negligible gains, with one negative case where reconstruction conflicts with an already near-ceiling baseline.

▶ RQ4.2

FedC-MAMs transmit only client-relevant modules, achieving substantially lower cumulative communication and energy cost than MMIN or RedCore. Local compute overhead is bounded and independent of global model size.

▶ Answer to RQ4

Modular reconstruction adapts effectively to federated settings. FedC-MAMs restore performance for modality-limited clients through selective parameter aggregation, without destabilising global optimisation. Corrective capacity is contingent on meaningful baseline degradation — reconstruction is not unconditionally beneficial, as the MSP-IMPROV AT configuration demonstrated, where adding reconstruction incurred a net delta of −0.128. But for the clients that need it, the gains are large, the cost is low, and the global model remains stable. The framework offers a scalable path to robust multimodal learning in heterogeneous, resource-constrained environments.

Thesis Synthesis

Thesis — Final Synthesis Conclusion

Thesis claim

Missing-modality robustness is a behavioural, architectural, and systems-level design challenge that can be mitigated post-training through modular reconstruction without retraining or architectural modification.

Research Questions — Resolved

RQ1

How and to what extent does model performance change when an entire modality is unavailable, and how reliably can this be anticipated?

Degradation is substantial, asymmetric, and partially predictable; attribution signals do not reliably reflect functional dependence, and modality reliance emerges early in training.

RQ2

To what extent can modular post-training reconstruction mitigate missing-modality performance degradation?

C-MAMs recover a substantial fraction of lost performance with shallow networks on limited data; recovery correlates with inter-modality structure, not reconstruction geometry.

RQ3

How do reconstructed embeddings affect model behaviour, and how do methods compare in fidelity and calibration?

Reconstruction improves behaviour but geometric fidelity and calibration diverge systematically; lightweight decoders can achieve stronger behavioural stability than high-capacity alternatives.

RQ4

Can modular reconstruction be adapted for federated systems with heterogeneous modality availability?

FedC-MAMs improve client-level robustness where degradation is meaningful and reduce communication cost relative to monolithic baselines; they are scalable and behaviourally effective.

Thesis conclusion Practical robustness to missing modalities can be achieved through simple, modular, post-training reconstruction that respects deployed architectures and scales naturally across centralised and decentralised environments.

Final Synthesis

The thesis claim is that missing-modality robustness is a behavioural, architectural, and systems-level design challenge that can be mitigated post-training through modular reconstruction, without retraining or architectural modification. Each research question addressed one layer of that claim.

▶ RQ1

Degradation is substantial, asymmetric, and partially predictable. Attribution signals do not reliably reflect functional dependence. Modality reliance emerges early in training and stabilises — which means it can be observed before deployment, not just after failure.

▶ RQ2

C-MAMs recover a substantial fraction of lost performance with shallow networks trained on limited data. Recovery correlates with inter-modality structure in the base model, not with reconstruction geometry. The ceiling is a property of how the model was originally trained.

▶ RQ3

Reconstruction architecture shapes decision behaviour beyond accuracy. Lightweight modular decoders outperform complex end-to-end alternatives in calibration and minority-class generalisation. Behavioural fidelity is not a function of capacity — it is a function of encoder-decoder alignment.

▶ RQ4

FedC-MAMs improve client-level robustness in heterogeneous federated settings, with substantially lower communication and energy cost than monolithic alternatives. The modularity that makes C-MAMs practical in centralised settings scales naturally to decentralised environments.

▶ Thesis Conclusion

Practical robustness to missing modalities can be achieved through simple, modular, post-training reconstruction that respects deployed architectures and scales across both centralised and decentralised environments. The key insight across all four research questions is the same: the constraints are in the base model, not in the recovery mechanism. Understanding where those constraints are — and why — is what this thesis provides.

Looking Back and Looking Forward

Thesis — Retrospective Limitations & Future Work

Looking Back

There is always something more that could have been done

Controlled conditions only

Analysis targets complete modality absence at inference. Partial degradation, intermittency, and temporal failure are distinct problems outside the intended scope.

Embedding-level reconstruction only

Reconstruction operates on learned representations, not raw signals. Claims do not extend to audio synthesis, image generation, or other generative modality recovery.

No temporal modelling

Prediction is made from a single multimodal instance. Sequential, streaming, and event-driven settings introduce qualitatively different failure modes.

Empirical regularities, not formal guarantees

Behavioural fidelity is established empirically under controlled conditions. Formal error bounds and decision-theoretic guarantees remain open problems.

Federated proof-of-concept

FedC-MAM experiments use fixed client sets, IID partitions, and FedAvg. Non-IID drift, client churn, and richer aggregation strategies are not addressed.

Looking Forward

There is always something more to do

Trustworthiness of reconstructed representations

Under what conditions should inferred embeddings support decision-making? Reconstruction must be integrated with reliability and failure-mode assessment, not treated as a standalone robustness tool.

Privacy & security under reconstruction

Effective reconstruction may recover sensitive attributes never explicitly shared. Robustness and privacy guarantees must be studied jointly in decentralised multimodal systems.

Behavioural evaluation as standard practice

Accuracy and geometric similarity are insufficient. Evaluation protocols must specify calibration, confidence, and class-conditional criteria before reconstruction methods are deployed.

Representations designed for reconstructability

Robustness is constrained by the structure of learned embeddings, not decoder capacity. Cross-modal substitutability must become an explicit representation learning objective, not an implicit by-product of fusion.

Federated robustness as a coupled systems problem

Reconstruction, aggregation dynamics, privacy leakage, and client heterogeneity interact. Longitudinal evaluation and integration with privacy-preserving techniques are necessary next steps.

Multimodal learning is promised as the better approach, more context, more robustness. Making that promise real is hard work. This thesis, I hope, brings us one step closer to that ideal.

Looking Back and Looking Forward

▶ Looking Back heading

This slide is about scope — what the thesis does and does not claim. Being precise here matters.

▶ Controlled conditions only

Everything in this thesis targets complete modality absence at inference time. Partial degradation, intermittency, and temporal failure — a sensor that cuts in and out — are distinct problems. The framework handles a well-defined failure mode; it does not generalise automatically to less structured ones.

▶ Embedding-level reconstruction only

C-MAMs operate on learned representations, not raw signals. This is not audio synthesis or image generation — the claim is narrower and more defensible. Embedding-level recovery has different failure modes, different theoretical grounding, and different practical constraints than generative modality recovery.

▶ No temporal modelling

Every prediction is made from a single multimodal instance. The framework does not account for sequential, streaming, or event-driven settings where the history of what was available matters for what you infer.

▶ Empirical regularities, not formal guarantees

The behavioural fidelity results are established empirically under controlled conditions. Formal error bounds and decision-theoretic guarantees — the kind of thing you would need for certification — remain open. The work builds the empirical case; the formal theory is future work.

▶ Federated proof-of-concept

The FedC-MAM experiments use fixed client sets, IID data partitions, and FedAvg. Non-IID client drift, client churn, and richer aggregation strategies are not addressed. The federated results are a proof of concept that the approach transfers, not a full evaluation of federated robustness.

▶ Looking Forward heading

On the other side — where this goes next.

▶ Trustworthiness of reconstructed representations

The harder question is not whether reconstruction works, but when you should trust it. Inferred embeddings need reliability assessment integrated with them, not applied after. Reconstruction as a standalone robustness tool is not enough.

▶ Privacy and security under reconstruction

Effective reconstruction may recover sensitive attributes that were never explicitly shared. If a C-MAM can infer a missing modality from what is available, it might also infer things the user did not intend to share. Robustness and privacy must be studied together, especially in federated settings.

▶ Behavioural evaluation as standard practice

Accuracy and geometric similarity are not sufficient evaluation criteria. This thesis makes that case empirically — but the field has not yet adopted calibration, confidence, and class-conditional criteria as standard. That needs to change before reconstruction methods are deployed in practice.

▶ Representations designed for reconstructability

The reconstruction ceiling is set by the base model's latent structure. If reconstructability is a property we want, it needs to be an explicit objective during representation learning — not an accidental by-product of how modalities were fused. Cross-modal substitutability should be designed in, not hoped for.

▶ Federated robustness as a coupled systems problem

Reconstruction, aggregation dynamics, privacy leakage, and client heterogeneity interact in ways that are not yet well understood. The federated setting is not just a deployment context — it is a qualitatively different research problem.

▶ Closing statement

The goal is multimodal learning whose robustness is not assumed but empirically justified and socially defensible. That is what this thesis moves toward.

Thank You

Thank you

Publications

Publications Appendix

PhD Related

Directly related to the work presented in this thesis

ECAI — MRC Workshop (2023)

Geraghty, Hines & Golpayegani. "Understanding the Relevancy of Modality Information in Multimodal Machine Learning". In: Modelling and Representing Context (MRC), ECAI.

ACM TIST — Journal (2025)

Geraghty, Hines & Golpayegani. "Learning to Associate: Multimodal Inference with Fully Missing Modalities". In: ACM Trans. Intell. Syst. Technol. 16.5. DOI: 10.1145/3746456

Interpreting the Behaviour of Reconstructed Modalities

Under Review

Journal paper — details to be confirmed

Behavioural Failures in Multimodal Models Under Missing Modalities

Under Review

Journal paper — details to be confirmed

Non-PhD Related

Not directly related to the work presented in this thesis

ACM MMSys (2022)

Geraghty et al. "AQP: an open modular Python platform for objective speech and audio quality metrics". In: Proc. 13th ACM Multimedia Systems Conference, pp. 191–196. DOI: 10.1145/3524273.3532885

IEEE Access — Journal (2022)

Golpayegani et al. "Intelligent Shared Mobility Systems: A Survey on Whole System Design Requirements, Challenges and Future Direction". In: IEEE Access 10, pp. 35302–35320. DOI: 10.1109/ACCESS.2022.3162848

Springer Book Chapter (2026)

Geraghty et al. "Traffic Flow Breakdown Prediction for the M50 Motorway in Ireland". In: Transport Transitions: Advancing Sustainable and Inclusive Mobility. Springer Nature Switzerland, pp. 514–520.

ACM MMSys (2026)

Geraghty, Golpayegani & Hines. "Audio Made Simple: A Modern Framework for Audio Processing". In: Proc. ACM Multimedia Systems Conference 2026, pp. 436–442. DOI: 10.1145/3793853.3799811