The paper introduces a multi-modal heterogeneous graph-based approach with a modality-aware graph convolutional network for Fact-based Visual Question Answering, enhancing the selection and aggregation of relevant evidence across modalities, thereby achieving state-of-the-art performance and interpretability.