Abstract
Large-scale, high-throughput technologies and genome-wide studies have been pivotal in the identification of disease-gene candidates from patient cohorts. Output from these studies often result in gene candidate lists which are large in size. Therefore, there is a pressing need for computational tools to integrate heterogeneous data and prioritize disease-gene candidates for further experimental investigation. To address this need, we propose a computational pipeline for the prioritization of disease-gene candidates. Our pipeline integrates diverse heterogeneous data including: gene-expression, protein-protein interaction network, ontology-based similarity and betweenness measures. Furthermore, we incorporate tissue-specific gene expression data into the evaluation section of our approach. The pipeline was applied to prioritize Alzheimer's Disease (AD) genes, whereby a list of 31 prioritized genes was generated. This approach correctly identified key AD susceptible genes: INPP5D and PSEN1. Biological process enrichment analysis revealed the prioritized genes are modulated in AD pathogenesis including: regulation of neurogenesis and generation of neurons. KEGG pathway analysis identified significant hub involvement in the Neurotrophin signaling and Huntington Disease pathways. Furthermore, our evaluation demonstrated a relatively high predictive performance (AUC: 0.73) when classifying AD and normal gene expression profiles from individuals using leave-one-out cross validation. This work provides a foundation for future investigation of diverse heterogeneous data integration for disease-gene prioritization.