Project Title: A Recursive Formulation for a Rank Sum Statistic Used to Detect Genomic Copy Number Variation
Student: Nam Tran Hoang
Mentor: Dr. Craig Jackson

Copy number variation (CNV) results from amplifications and deletions of genomic DNA and is often associated with cancer and other diseases that use CNV as a mechanism to overcome restrictions on cell growth. Detecting and characterizing CNV is a major goal of cancer research. A variety of methods have been proposed to detect CNV, many of which use commercially available DNA microarrays (e.g., Affymetrix chips). Recently, a rank-based method has been developed that is relevant across a variety of DNA microarrays and is flexible enough to detect multiple molecular phenomena: somatic lesions, germline deletions, and germline gains[1].

This rank-based method involves a rank comparison of a sample DNA across multiple chromosomal markers and against multiple control samples. The CNV of the test sample is then determined by a statistical comparison of a sample “rank sum” against a null distribution, which is derived under the hypothesis that the sample shows no CNV. As such, the accuracy of this method depends, to a large degree, on an accurate representation of the null distribution. However, the exact null distribution is difficult to compute and, so far, has only been approximated[1].

This study gives a rigorous proof of several recursive formulations for the exact null distribution of the rank sum statistic. The approximate null distribution currently employed is also shown to overemphasize values near the mean, which can overestimate any CNV that is present in the test sample, possibly leading to false positives. Hence, the use of these recursive formulae improves the ability of rank based methods to detect CNV.


[1] LaFramboise et al., A flexible rank-based framework for detecting copy number aberrations from array data, Bioinformatics 25:6 (2009), 722-728.