Learning to Associate:

Handling Missing Modalities in Centralised and Decentralised Environments

Jack Geraghty

University College Dublin

Supervisor: Associate Professor Fatemeh Golpayegani · Co-supervisor: Dr Andrew Hines

RSP: Dr Rob Brennan · Dr Gavin McArdle

11 May 2026

Problem

Missing Modalities Break Multimodal Systems Problem

Multimodal models assume joint availability of all modalities. In practice, this assumption is frequently violated.

Modalities may be absent

Sensor failure, privacy constraints, and bandwidth limitations.

Behaviour changes

Performance degradation is not uniform. Confidence and calibration may shift.

Retraining is often impractical

Systems may already be deployed or data unavailable.

When modalities are absent, behaviour changes, and retraining is impractical, how can these effects be understood and handled?

Problem

Multimodal systems are built on the assumption that all modalities will be present at inference time. That assumption is wrong more often than the literature acknowledges.

▶ Modalities may be absent

The reasons are simple: a sensor fails, a user opts out for privacy, a bandwidth constraint drops a stream. None of these are edge cases, they are normal conditions for any deployed system.

▶ Behaviour changes, animation starts

On the right is a multimodal sentiment analysis model that takes in audio, video, and text. The animation steps through different missing-modality configurations, video removed, audio and video removed, text removed, audio and text removed. The same model, the same task, but the output changes depending on what combination of modalities is actually available at inference time. That variability is the core of the problem.

▶ Retraining is often impractical

And you can't always just retrain. The model may already be deployed, the original training data may no longer be available, and retraining a large multimodal model is computationally expensive even when those things aren't a problem. Plus you'd need to employ some technique for handling missing modalities, and as I will show this is not always a simple or even the best thing to do.

▶ Closing question

So the question is: when modalities are absent, behaviour changes, and retraining is impractical, how can those effects be understood, and how can they be handled?

Evaluation

Evaluating Multimodal Models Evaluation

Full-Modality Performance

Baseline: all modalities present at inference

The ceiling against which robustness is measured

Modality Contribution

Average marginal utility across modality subsets

Estimated via MM-SHAP

Modality Dependence

Effect of removing a modality

Estimated via ablation

Evaluation

Before getting into the research questions, it's worth being precise about how multimodal models are evaluated, because the thesis works across three different evaluation perspectives and the distinction matters.

▶ Full-Modality Performance

The baseline, the model operating under ideal conditions with all modalities present. Everything else is measured relative to this.

▶ Modality Contribution

Modality contribution is estimated via MM-SHAP, which is derived from Shapley value theory. It quantifies the average marginal utility of a modality across all possible subsets of the available modalities, not just when all inputs are present.

▶ Modality Dependence

Modality dependence is measured by removing a modality and observing the effect. It tells you how much the model relies on that modality functionally. These two look like they should agree. The thesis shows they often don't, and that gap is where the work begins.

RQ1: Modality Absence and Behavioural Degradation

RQ1: Modality Absence and Behavioural Degradation RQ1

Research question

How and to what extent does model performance change when an entire modality is unavailable at inference time, and how reliably can this change be anticipated?

Under Review with Information Fusion

Multimodal models are not evaluated properly.

Research answer

Degradation is substantial, asymmetric, and partially predictable; attribution signals do not reliably reflect functional dependence, and modality reliance emerges early in training.

RQ1 asks how model performance changes when a modality is entirely absent at inference time, and whether that change can be anticipated. Three sub-questions each target a different part of that problem.

▶ RQ1.1: Attribution vs Behaviour

How well do MM-SHAP contribution scores align with the actual behavioural shifts when a modality is removed? The approach is systematic ablation across six datasets and models, comparing attribution signals against observed behavioural changes.

▶ RQ1.2: Zero vs Gaussian Masking

Does how absence is represented matter? The approach compares zero and Gaussian masking strategies across all datasets to assess whether the choice of absence representation affects measured robustness.

▶ RQ1.3: Training Dynamics

Can modality dependence be detected before removal? The approach tracks dependence trajectories across training epochs and compares them against validation behaviour.

RQ1: Results

RQ1: Modality Absence and Behavioural Degradation RQ1 Results

RQ1: Key findings

Attribution under full input does not reliably predict removal impact; masking strategy directly shapes what robustness is measured; and modality dependence is detectable during training but not from validation metrics alone.

RQ1.1 — Contribution vs Dependence

RQ1.2 — Masking

RQ1.3 — Training-Time Indicators

Dataset	Modality	Cohen's d	p-value	Better
AVMNIST	Audio	−3.34	0.014	Zero
↳	Image	−3.33	0.014	Zero
MM-IMDb	Text	-8.26	0.002	Zero
↳	Image	−0.87	0.134	n.s.
MOSEI	Audio	+3.54	<0.05	Noise
↳	Video	+1.85	<0.05	Noise
↳	Text	+4.22	<0.05	Noise

Negative d = zero better; positive d = noise better. All significant effects are large (|d|≥1.85).

AVMNIST accuracy modality dependence per epoch

MOSEI F1-weighted modality dependence per epoch

What this shows

Three independent probes. Each reveals what the others miss.
No single metric captures the full picture.
Run absence tests. Attribution proxies are not enough.

Why it matters

Masking strategy is a design decision. Validate it empirically.
Training trajectories give early warning before deployment failures.
Attribution-based robustness estimates are unreliable.

RQ1: Condensed Results

Three sub-questions, three failure modes, each requiring a different probe to detect.

▶ RQ1.1: Misaligned

MM-SHAP and functional dependence often disagree. AVMNIST: balanced attribution masks extreme image reliance, remove image, accuracy collapses; remove audio, negligible impact. Kinetics-Sounds is the rare exception where attribution and dependence align. CMU-MOSEI and MM-IMDb both show attribution inflating the apparent importance of functionally redundant modalities.

▶ RQ1.2: No Universal Default

The Cohen's d values show the magnitude and direction of effect. Large negative values (AVMNIST, MM-IMDb) mean zero masking is strongly better. Large positive values (MOSEI) mean noise is strongly better. The reversal is complete and cannot be predicted from model architecture alone. Masking strategy is not an implementation detail, it is a design decision that shapes what robustness lower bound you are measuring.

▶ RQ1.3: Partially Predictable

The dependence trajectories for AVMNIST show image dependence establishing itself early, within the first few epochs, and remaining consistently higher than audio throughout training. This is visible before any ablation is applied. But the validation metrics for audio and image can both look healthy, hiding what the dependence score is showing. MOSEI is the clearest case of misleading validation: audio and video can show moderate early performance while their dependence scores remain low throughout.

▶ What this shows / Why it matters

Attribution, masking, and training dynamics are three independent lenses, each reveals what the others miss. Together they explain why robustness failures are non-random: they are traceable to stable early-training decisions that attribution scores can't surface, and that masking choice can amplify or conceal.

Transition

Fragility is established.

Can we fix it? Yes, we can!

RQ2: Post Hoc Modular Reconstruction

RQ2: Post Hoc Modular Reconstruction RQ2

Research question

To what extent can a modular, lightweight, and post-training feature reconstruction approach mitigate performance degradation caused by fully missing modalities during multimodal model inference?

Published in ACM Transactions on Intelligent Systems and Technology (2025)

Modular post-hoc reconstruction recovers from missing modalities.

Research answer

C-MAMs recover a substantial fraction of lost performance with shallow networks on limited data; recovery correlates with inter-modality structure, not reconstruction geometry.

RQ2 asks to what extent a modular, lightweight, post-training reconstruction approach can mitigate missing-modality degradation. Published in ACM TIST 2025. Five sub-questions each target a different practical constraint.

▶ RQ2.1: Minimum Training Requirements

How much data is actually needed? C-MAMs are trained independently per modality mapping; training data is varied systematically from 10% upward.

▶ RQ2.2: Loss Function Complexity

Does the choice of loss function matter? Twelve objectives are compared, MSE, cosine, MMD, hybrids.

▶ RQ2.3: Encoder Configuration

Should encoders be frozen or fine-tuned? Frozen, random-init, and fine-tuned variants are compared.

▶ RQ2.4: Geometry vs Utility

Does better embedding geometry mean better performance? Cosine similarity and effect sizes are correlated against CRR.

▶ RQ2.5: Contrastive Information / Alignment

Do encoder spaces set a hard ceiling? Cross-modal alignment is measured to understand where reconstruction fails regardless of decoder.

RQ2: Results

RQ2: Post Hoc Modular Reconstruction RQ2 Results

RQ2: Key findings

Reconstruction quality is bounded by encoder representational structure, not loss complexity or geometric precision: C-MAMs are data-efficient, loss-insensitive, frozen-encoder-friendly, and constrained where encoder alignment fails.

RQ2.1 — Training Efficiency

RQ2.2 — Impact of Loss Function

RQ2.3 — To Train or Not to Train

RQ2.4 — Geometry vs Performance

RQ2.5 — Encoder Alignment

$AVMNIST audio-to-image performance vs training data fraction$

$AVMNIST val loss vs training data fraction$

MOSEI: Mean Δ vs Frozen Baseline (Has0 Acc & Has0 F1 Weighted)

Condition	Random Init Δ	Fine-tuned Δ
A	−0.027	+0.040
V	+0.011	+0.045
T	−0.014	−0.002
AV	+0.011	+0.006
AT	−0.007	−0.005
VT	+0.005	+0.002

Δ = difference vs frozen encoder. Negative = worse than frozen. Random init harms A and T; fine-tuning helps A and V but is neutral/negative for T.

AVMNIST Audio to Image statistical analysis

Reconstruction Error: MAE / MSE (Table 5.12)

Model & C-MAM	MAE	MSE
KS: Audio → Video	3.82	32.79
KS: Video → Audio	0.69	1.32
MM-IMDb: Image → Text	0.30	0.16
MM-IMDb: Text → Image	0.29	0.14
UTT-Fusion: AT → Video	0.07	0.009
UTT-Fusion: Audio → Text	0.23	0.19

KS audio→video peaks at MSE=32.79. This is a consequence of near-orthogonal encoder spaces, not decoder choice.

Contrastive Information Analysis (Table 5.13)

Dataset	MI Red.	Dominance	Cosine	KL Div.
AVMNIST	Weak	Image	0.006	0.85
Kinetics-Sounds	Strong	Video	−0.04	2.95
MOSEI	Moderate	Text	0.28–0.38	0.34

High KL + near-orthogonal cosine = strong contrastive behaviour. In KS, video and audio encode largely distinct, conflicting information. The ceiling is set by the encoders, not the C-MAM.

What this shows

Data-efficient, loss-insensitive, frozen-encoder-friendly.
Geometric fidelity does not predict utility.
Ceiling is the encoders.

Why it matters

Embedding similarity is the wrong metric.
Check encoder alignment before training a C-MAM.
Loss complexity and fine-tuning are not worth the cost.

RQ2: Condensed Results

Five sub-questions, one unified finding: the C-MAM framework is bounded by encoder alignment, not decoder complexity.

▶ RQ2.1: Data Efficient

10% of training data yields performance within 2–5% of the full-data baseline. The base model's latent space already encodes the structure; C-MAM only learns the mapping.

▶ RQ2.2: Loss Insensitive

All 12 objectives converge to similar recovery. The encoder geometry is the dominant constraint, MSE is sufficient.

▶ RQ2.3: Frozen = Best

Frozen encoders match or outperform fine-tuning. Random init actively degrades audio and text conditions. The task-relevant structure is already in the representations.

▶ RQ2.4: Decoupled

Mean cosine 0.051 (near-orthogonal, MM-IMDb) still yields recovery. Geometric fidelity is not a reliable predictor of functional utility, embedding similarity can be misleading.

▶ RQ2.5: Alignment Barrier

Kinetics-Sounds has the strongest contrastive profile: KL divergence=2.95, cosine≈−0.04, low PMI (0.006). Video dominates prediction; audio encodes largely distinct and conflicting information (strong MI reduction, conditional entropy similar with or without audio). This produces the highest reconstruction error (MSE=32.79) and lowest recovery (CRR 6%). The contrastive structure is an encoder property, it emerges before C-MAM training, and no decoder configuration can overcome it.

Transition

Recovery is possible.

But is everything behaving correctly?

RQ2 answered the question of whether C-MAMs can recover missing modalities. They can, substantially, with minimal data, without touching the base model. But performance recovery measures the output. It does not tell us whether the model is doing the right thing for the right reasons.

A model that recovers accuracy might still be overconfident. It might handle majority classes well and fail silently on the minority ones. RQ3 moves below the surface. It asks whether the behaviour restored by reconstruction is actually faithful, calibrated, stable across classes, and honest in its confidence estimates. And it also is a good time to introduce other reconstruction methods to compare against C-MAMs, to see how they fare on these deeper questions of behaviour. The shift in dataset focus to MSP-IMPROV and IEMOCAP is intentional, these datasets have the class imbalance and bimodal structure needed to test calibration and minority-class reliability, properties that AVMNIST and MM-IMDb are not well-suited to probe.

RQ3: Behavioural Fidelity of Reconstruction

RQ3: Behavioural Fidelity of Reconstruction RQ3

Research question

How do reconstructed modality embeddings affect the decision behaviour of multimodal models relative to missing-modality baselines, and how do different reconstruction methods compare in information recovery, calibration behaviour, and class-conditional predictive structure?

Under Review with TBD

Under proper scrutiny, simple reconstruction beats complex SOTA.

Research answer

Reconstruction improves behaviour but geometric fidelity and calibration diverge systematically; lightweight decoders can achieve stronger behavioural stability than high-capacity alternatives.

RQ3 asks how reconstructed embeddings affect model decision behaviour, and how different reconstruction methods compare in information recovery, calibration, and class-conditional structure. Where RQ2 asked whether C-MAMs can recover performance, RQ3 asks whether the behaviour they restore is correct, calibrated, and reliable.

▶ RQ3.1: H1: Information Recoverability

The approach compares reconstructed embeddings against the conditional variance bound to measure how much of the theoretically recoverable cross-modal signal each method actually captures.

▶ RQ3.2: H2.1 / H2.2: Complexity

The approach evaluates whether C-MAMs match MMIN's minority-class recall (H2.1) and whether RedCore underperforms MMIN on that metric (H2.2), across CMU-MOSEI, MSP-IMPROV, and IEMOCAP.

▶ RQ3.3: Calibration Biases

The approach uses reliability diagrams and ECE to assess whether reconstruction methods preserve calibrated confidence or introduce systematic overconfidence under modality absence.

RQ3: Results

RQ3: Behavioural Fidelity of Reconstruction RQ3 Results

RQ3: Key findings

C-MAMs more frequently reach the information-theoretic recovery limit, provide superior minority-class generalisation over higher-capacity models, and maintain better confidence calibration under modality absence.

RQ3.1 — Conditional Variance

RQ3.2 — Complexity vs Generalisation

RQ3.3 — Calibration

Mapping	Model	RMSE	√Var(Y\|X)	Δ	H1
V→A ★	C-MAM	0.139	0.116	+0.024	Yes
	MMIN	0.201	0.122	+0.079	No
	RedCore	0.824	0.782	+0.042	Yes
TV→A ★	C-MAM	0.137	0.116	+0.021	Yes
	MMIN	0.158	0.117	+0.040	Yes
	RedCore	0.887	0.753	+0.134	No

IEMOCAP (★ = H1 satisfied). C-MAM: 5/6 saturated. MMIN: 2/6. RedCore: 1/6. RMSE proximity to √Var(Y|X) is the criterion. Not absolute RMSE.

Mapping	C-MAM Neutral Recall	MMIN Neutral Recall	RedCore Neutral Recall	H2.1	H2.2
A→V	0.477	0.330	0.091	Yes	Yes
AT→V	0.610	0.588	0.441	Yes	Yes
V→A	0.116	0.047	0.009	Yes	Yes
AV→T	0.429	0.408	0.242	Yes	Yes
T→A	0.422	0.432	0.378	Yes	Yes
TV→A	0.467	0.541	0.438	No	Yes

MSP-IMPROV. H2.1: C-MAM ≥95% parity in 5/6 mappings. H2.2: RedCore underperforms MMIN on neutral recall in all 6/6 mappings.

What this shows

C-MAMs lead on recovery, minority recall, and calibration.
Capacity does not predict bound saturation.
Capacity can harm confidence.

Why it matters

Miscalibrated confidence is a liability.
C-MAMs keep confidence honest.
Calibration belongs in reconstruction evaluation.

RQ3: Condensed Results

Three sub-questions: information recovery, complexity vs generalisation, and calibration fidelity. C-MAMs lead on all three.

▶ RQ3.1: C-MAM Best

IEMOCAP is the strongest H1 result: C-MAM satisfies the conditional variance bound in 5 of 6 mappings. MMIN satisfies 2, RedCore 1. On MSP-IMPROV, C-MAM satisfies 2, MMIN and RedCore fail all 6. Decoder complexity is not the constraint.

▶ RQ3.2: Complexity Hurts

MSP-IMPROV: C-MAM ≥95% parity with MMIN in 5/6 mappings; RedCore underperforms MMIN in all 6. IEMOCAP: C-MAM only wins 1/6 but RedCore underperforms in all 6. MOSEI: C-MAM wins 4/6; RedCore collapses to near-zero recall on text-absent conditions. H2.2 is fully confirmed, capacity undermines minority-class generalisation.

▶ RQ3.3: Calibration

On MOSEI A→T, C-MAM predictions track the reliability diagonal. MMIN and RedCore deviate toward overconfidence, high predicted confidence even when predictions are wrong. This pattern holds across MOSEI and IEMOCAP. MSP-IMPROV is the mixed case where RedCore's overparameterisation incidentally smooths the signal.

Transition

Centralised robustness holds.

Does it survive federated deployment?

RQ4: Federated and Decentralised Reconstruction

RQ4: Federated and Decentralised Reconstruction RQ4

Research question

Can modular reconstruction methods be adapted effectively for robust multimodal learning within incongruent federated systems with heterogeneous modality availability and local data access constraints?

Federated multimodal models can be trained under modality incongruence.

Research answer

FedC-MAMs improve client-level robustness where degradation is meaningful and reduce communication cost relative to monolithic baselines; they are scalable and behaviourally effective.

RQ4: Results

RQ4: Federated and Decentralised Reconstruction RQ4 Results

RQ4: Key findings

FedC-MAMs restore substantial performance for modality-limited clients and consume ~50% of MMIN's energy and ~17% of RedCore's; modularity achieves both robustness and communication efficiency together.

RQ4.1 — Performance Under Incongruency

RQ4.2 — Communication Efficiency

MOSEI: Has0 Accuracy

Client	Base	FedC-MAMs	Δ
A only	0.251	0.693	+0.442
V only	0.343	0.661	+0.318
AV	0.403	0.668	+0.265
T only	0.667	0.730	+0.063
AT	0.740	0.763	+0.023
TV	0.717	0.751	+0.034

MSP-IMPROV: F1 Weighted

Client	Base	FedC-MAMs	Δ
A only	0.362	0.396	+0.034
V only	0.290	0.427	+0.137
AV	0.500	0.502	+0.002
T only	0.294	0.465	+0.171
AT	0.570	0.442	−0.128
TV	0.419	0.593	+0.174

MOSEI cumulative communication cost over rounds

What this shows

Modular reconstruction works in federated settings.
Gains scale with degradation severity.
Not unconditional: reconstruction can hurt when degradation is already low.

Why it matters

Robustness and efficiency together. Not a trade-off.
Selective aggregation cuts cumulative energy by 50–83%.
Practical federated robustness without end-to-end overhead.

RQ4: Condensed Results

Two sub-questions: federated performance recovery and communication cost. Both answered positively with an important caveat on RQ4.1.

▶ RQ4.1: Performance Recovery

MOSEI: audio-only clients go from 0.251 to 0.693, a gain of +0.442. Video-only +0.318. AV +0.265. All statistically significant. Text-dominant clients gain marginally because the baseline was already high. MSP-IMPROV: most groups recover. The AT configuration is the boundary case, reconstruction is detrimental (−0.128) where two already-informative modalities are present and baseline degradation is low. Reconstruction is not unconditionally beneficial.

▶ RQ4.2: Communication Costs

MOSEI: FedC-MAMs 4.03 kWh vs MMIN 8.19 vs RedCore 19.70. MSP-IMPROV: 6.63 vs 16.78 vs 41.14. Clients transmit only modules for their active modalities. Cost scales with the client's modality set, not global model size. Robustness and efficiency are achieved together.

Thesis Contributions

Thesis: Contributions Conclusion

Thesis claim

Missing-modality robustness is a behavioural, architectural, and systems-level design challenge that can be mitigated post-training through modular reconstruction without retraining or architectural modification.

Thesis conclusion Practical robustness to missing modalities can be achieved through simple, modular, post-training reconstruction that respects deployed architectures and scales naturally across centralised and decentralised environments.

Research Questions: Resolved

RQ1

How and to what extent does model performance change when an entire modality is unavailable, and how reliably can this be anticipated?

Degradation is substantial, asymmetric, and partially predictable; attribution signals do not reliably reflect functional dependence, and modality reliance emerges early in training.

RQ2

To what extent can modular post-training reconstruction mitigate missing-modality performance degradation?

C-MAMs recover a substantial fraction of lost performance with shallow networks on limited data; recovery correlates with inter-modality structure, not reconstruction geometry.

RQ3

How do reconstructed embeddings affect model behaviour, and how do methods compare in fidelity and calibration?

Reconstruction improves behaviour but geometric fidelity and calibration diverge systematically; lightweight decoders can achieve stronger behavioural stability than high-capacity alternatives.

RQ4

Can modular reconstruction be adapted for federated systems with heterogeneous modality availability?

FedC-MAMs improve client-level robustness where degradation is meaningful and reduce communication cost relative to monolithic baselines; they are scalable and behaviourally effective.

Thesis Contributions

The thesis claim is that missing-modality robustness is a behavioural, architectural, and systems-level design challenge that can be mitigated post-training through modular reconstruction, without retraining or architectural modification. Each research question addressed one layer of that claim.

▶ Thesis Conclusion

Practical robustness to missing modalities can be achieved through simple, modular, post-training reconstruction that respects deployed architectures and scales across both centralised and decentralised environments. The key insight across all four research questions is the same: the constraints are in the base model, not in the recovery mechanism. Understanding where those constraints are, and why, is what this thesis provides.

▶ RQ1

Degradation is substantial, asymmetric, and partially predictable. Attribution signals do not reliably reflect functional dependence. Modality reliance emerges early in training and stabilises, which means it can be observed before deployment, not just after failure.

▶ RQ2

C-MAMs recover a substantial fraction of lost performance with shallow networks trained on limited data. Recovery correlates with inter-modality structure in the base model, not with reconstruction geometry. The ceiling is a property of how the model was originally trained.

▶ RQ3

Reconstruction architecture shapes decision behaviour beyond accuracy. Lightweight modular decoders outperform complex end-to-end alternatives in calibration and minority-class generalisation. Behavioural fidelity is not a function of capacity, it is a function of encoder-decoder alignment.

▶ RQ4

FedC-MAMs improve client-level robustness in heterogeneous federated settings, with substantially lower communication and energy cost than monolithic alternatives. The modularity that makes C-MAMs practical in centralised settings scales naturally to decentralised environments.

Publications

Contributions Contributions

Thesis Related

Directly related to the work presented in this thesis

ACM TIST: Journal (2025)

Geraghty, Hines & Golpayegani. "Learning to Associate: Multimodal Inference with Fully Missing Modalities". In: ACM Trans. Intell. Syst. Technol. 16.5. DOI: 10.1145/3746456

ECAI: MRC Workshop (2023)

Geraghty, Hines & Golpayegani. "Understanding the Relevancy of Modality Information in Multimodal Machine Learning". In: Modelling and Representing Context (MRC), ECAI.

Behavioural Failures in Multimodal Models Under Missing Modalities

Under Review

Journal paper. Under review with Information Fusion

Interpreting the Behaviour of Reconstructed Modalities

Under Review

Journal paper. Details to be confirmed

Reproducibility

All model, training, and evaluation code is public on GitHub.
Datasets are public and used as-released; same versions, same splits.
Only AVMNIST and Kinetics-Sounds need preprocessing: audio to spectrograms.

Non-Thesis Related

Not directly related to the work presented in this thesis

ACM MMSys (2026)

Geraghty, Golpayegani & Hines. "Audio Made Simple: A Modern Framework for Audio Processing". In: Proc. ACM Multimedia Systems Conference 2026, pp. 436–442. DOI: 10.1145/3793853.3799811

Springer Book Chapter (2026)

Geraghty et al. "Traffic Flow Breakdown Prediction for the M50 Motorway in Ireland". In: Transport Transitions: Advancing Sustainable and Inclusive Mobility. Springer Nature Switzerland, pp. 514–520.

IEEE Access: Journal (2022)

Golpayegani et al. "Intelligent Shared Mobility Systems: A Survey on Whole System Design Requirements, Challenges and Future Direction". In: IEEE Access 10, pp. 35302–35320. DOI: 10.1109/ACCESS.2022.3162848

ACM MMSys (2022)

Geraghty et al. "AQP: an open modular Python platform for objective speech and audio quality metrics". In: Proc. 13th ACM Multimedia Systems Conference, pp. 191–196. DOI: 10.1145/3524273.3532885

Looking Back and Looking Forward

Thesis: Conclusion Limitations & Future Work

Looking Back

There is always something more that could have been done

Controlled conditions only

Complete modality absence at inference. Partial degradation and intermittency are out of scope.

Embedding-level reconstruction only

Operates on learned representations. Does not extend to raw signal recovery or generation.

No temporal modelling

Single-instance prediction only. Streaming and event-driven failure modes are not addressed.

Empirical regularities, not formal guarantees

Fidelity established empirically. Error bounds and decision-theoretic guarantees remain open.

Federated proof-of-concept

Fixed client sets, IID partitions, FedAvg. Non-IID drift and client churn not addressed.

Looking Forward

There is always something more to do

Reconstructable representation design

Robustness ceiling is in the encoder, not the decoder. Cross-modal substitutability must be an explicit training objective.

Multimodal federated learning under incongruence

Non-IID drift, varying incongruency, client churn. Privacy and trustworthiness co-designed with reconstruction, not added post-hoc.

Trustworthiness

Reliability bounds on when reconstructed embeddings can be trusted.

Privacy & security

Reconstruction may recover sensitive attributes never explicitly shared.

Behavioural evaluation

Calibration, class-conditional fidelity as standard criteria. Not geometric similarity.

Multimodal learning is promised as the better approach, more context, more robustness. Making that promise real is hard work. This thesis, I hope, brings us one step closer to that ideal.

Looking Back and Looking Forward

▶ Left column, Looking Back

This slide is about scope, what the thesis does and does not claim. Being precise here matters.

Controlled conditions only: Everything in this thesis targets complete modality absence at inference time. Partial degradation, intermittency, and temporal failure are distinct problems.

Embedding-level reconstruction only: C-MAMs operate on learned representations, not raw signals. Embedding-level recovery has different failure modes than generative modality recovery.

No temporal modelling: Every prediction is made from a single multimodal instance.

Empirical regularities: Formal error bounds and decision-theoretic guarantees remain open.

Federated proof-of-concept: Fixed client sets, IID partitions, FedAvg. The federated results show the approach transfers, not a full evaluation of federated robustness.

▶ Right column, Looking Forward

Trustworthiness: Reconstruction must be integrated with reliability assessment, not treated as a standalone tool.

Privacy: Effective reconstruction may recover sensitive attributes never explicitly shared.

Behavioural evaluation: Accuracy and geometric similarity are not sufficient. Calibration, confidence, and class-conditional criteria must become standard.

Reconstructability by design: Cross-modal substitutability must be an explicit representation learning objective.

Federated robustness: Reconstruction, aggregation, privacy leakage, and client heterogeneity interact in ways not yet well understood.

▶ Closing statement

The goal is multimodal learning whose robustness is not assumed but empirically justified and socially defensible. That is what this thesis moves toward.

Thank You

Thank you

How Did We Get Here?

A Personal Note Origin & Reflections

The Brief

Emergency Response · Smart Cities · ITS

IoT sensor fusion, audio and video processing, real-time decision support. Applied and systems-level.

→

Where It Ended Up

Missing Modalities · C-MAMs · Behavioural Fidelity · FedC-MAMs

The ITS context anchors the federated chapter. The focus narrowed to the foundational problem: complete modality availability is assumed, rarely questioned, and often wrong.

How the questions formed

RQ1: What does missing look like?

Papers treated missing-modality results as a footnote. The gap was worth examining properly.

RQ2: Simple and post-hoc

Existing methods modified training. Small networks on frozen representations offered a cleaner separation and matching performance.

RQ3: Accuracy is not enough

Calibration and class-conditional behaviour told a different story than accuracy alone.

RQ4: The system as a whole

Federated settings with heterogeneous sensors were always the real context. Nobody had treated it as a unified problem.

What I learned along the way

Writing

Improving the writing improved the thinking. They turned out to be the same process.

Engineering

Code was never the problem. Framing and explaining the work was.

Questioning

Every significant result started with something that looked a bit off and was worth following.

Communication

Knowing when not to speak matters as much as knowing what to say.

Failure

Rejections came. Learning not to take them personally takes time.

How Did We Get Here?

Informal slide. Show the human arc of the research: where it started, how each question developed, and what the process taught.

▶ The Brief → Where It Ended Up

The original project was grounded in ITS: smart city infrastructure, cross-platform sensor fusion, emergency response decision support. That context never left the thesis; it motivates the whole work and returns as the central focus in the federated chapter. But the direction narrowed: the systems-level application revealed a foundational ML assumption that most multimodal work simply takes for granted.

▶ RQ1

Reading the literature on multimodal performance, there was a consistent gap: missing modality results were either absent or treated as a footnote. Separately, looking at modality information balance led toward the same problem from a different angle. The question of what actually happens when a modality disappears had not been answered carefully.

▶ RQ2

Every reconstruction method modified the base model during training. The insight, partly drawn from thinking about synaesthesia and cross-sensory association, was that stable learned representations could be associated post-hoc by a small dedicated network. That separates the reconstruction problem from the base model entirely. The secondary question: did the field's reliance on high-capacity architectures actually help? It did not.

▶ RQ3

Having a method requires testing it rigorously against alternatives and probing what it actually does, not just what accuracy it achieves. The behavioural analysis revealed that models claiming strong reconstruction did not always behave consistently with that claim. Calibration and class-conditional behaviour told a different story.

▶ RQ4

The federated setting was always where the ITS motivation pointed. The field had studied federated learning and multimodal learning separately; missing modalities in federated systems with heterogeneous sensor availability had not been treated as a unified problem. It was.

▶ What I learned

Writing: The lightbulb moment in the final year. Writing and thinking are the same process, and learning to write well is inseparable from learning to think clearly about the work.

Engineering: Technical ability is necessary but not sufficient. The PhD added the layer above: how to frame and communicate what you built. That framing discipline changes what gets built and how.

Questioning: The most important habit. Every significant result in the thesis came from noticing something that did not look quite right and following it. That applies to the literature, to established tools, and to your own assumptions.

Communication: Research is collaborative. Knowing when to speak and when to let someone else is a skill that takes time to develop. It is not about having the most to say.

Failure: Rejection is standard. The skill is in distinguishing feedback that is worth incorporating from feedback that is not, and in not conflating criticism of the work with criticism of the person. That distinction takes years to learn.