Statistical Analysis of Patient-Derived Sequences Discovers Biologically Significant Insights in Highly Mutable Viruses
2pm
Room 5564 (Lifts 27-28), 5/F Academic Building, HKUST

Supporting the below United Nations Sustainable Development Goals:支持以下聯合國可持續發展目標:支持以下联合国可持续发展目标:

Examination Committee

Prof Fangzhen LIN, CSE/HKUST (Chairperson)
Prof Matthew MCKAY, ECE/HKUST (Thesis Supervisor)
Prof I-ming HSING, CBME/HKUST (Thesis Co-Supervisor)
Prof Leo Lit Man POON, Division of Public Health Laboratory Sciences, The University of Hong Kong (External Examiner)
Prof Bertram E SHI, ECE/HKUST
Prof Weichuan YU, ECE/HKUST
Prof Xuhui HUANG, CHEM/HKUST

Abstract

The advancement in fast DNA sequencing technologies in the last decade has opened up unprecedented opportunities to explore diverse questions in biomedical research. This thesis utilizes tools from statistics and statistical signal processing for analyzing sequences of viral proteins to uncover novel immunological and biochemical (structural or functional) insights. Central to our approach is the use of robust correlation matrix estimation for high-dimensional data, drawing upon concepts from random matrix theory (RMT).

In the first main part of this work, an RMT-based “noise cleaning” correlation matrix estimator is employed to reveal sites with immunological significance in a protein of the hepatitis C virus (HCV). There is no working vaccine against HCV, mainly due to its extreme variability which enables it to evade immune surveillance. Despite this variability, our statistical approach reveals a novel group of “multi-dimensionally conserved sites” that may be highly susceptible to immune pressure, as the virus appears to resist simultaneous mutations at these positions. This statistical approach demonstrates for the first time the existence of such a vulnerable part of the HCV genome. Our results are corroborated by linking with clinical evidence available in the literature. We propose two novel vaccine designs which aim to preferentially drive mutations in the identified vulnerable region, while also leveraging relevant population-level human immune statistics to provide broad coverage.

In the second main part of this work, we focus on extracting biochemical insights from viral sequence data. The aim is to infer groups of sites which are involved with certain structural or functional characteristics. It is shown that the RMT-based noise cleaning method successfully predicts protein sites with biochemical significance, while the existing state-of-the-art methods (which have shown success for non-viral proteins) generally fail. The prediction accuracy of the method is demonstrated for proteins of HCV and the human immunodeficiency virus (HIV), another highly mutable virus. In both cases, while demonstrating a strong ability to distinguish biologically important sites from seemingly less important ones, the proposed method generally fails to identify “distinct” groups of sites which associate with distinct function or structure. To tackle this problem, a robust method is proposed which, in addition to using RMT concepts, exploits the embedded sparsity in the problem using a suitably-tailored sparse principal component analysis technique. For multiple proteins of HCV and HIV, this sophisticated approach remarkably identifies multiple distinct groups of sites with each of them associated to a specific structural or functional property. Hence, it is the first time that statistical analysis based on sequence data alone has been employed to successfully reveal the “modular” structure of viral proteins.

In the third main part, a simulation model is presented that provides a cohesive statistical ground-truth (GT) understanding of the results obtained using the developed methods. Specifically, these GT tests show the enhanced robustness and decoupling power of the sparsity-based method as compared to the method presented in the first part.

讲者/ 表演者:
Mr Ahmed Abdul QUADEER
语言
英文
新增活动
请各校内团体将活动发布至大学活动日历。