**Prostate Cancer: Essential Diagnostic and Therapeutic Considerations**

Paul Bradley and Philippe E. Spiess *Department of Urologic Oncology, Moffitt Cancer Center, Tampa, FL USA* 

## **1. Introduction**

The potential impact imparted by prostate cancer on our society is immense. In this chapter, we provide a general overview of current concepts, diagnostic advances, and novel therapeutic approaches to the management of prostate cancer. It is widely reported that prostate cancer is the most common malignancy diagnosed in U.S. males. The current lifetime risk of developing prostate cancer is 17%, with an estimated lifetime cancerspecific mortality risk of 3%. The diagnosis of prostate cancer rapidly increased in the mid-1980s, with the advent of the serological biomarker prostate specific antigen (PSA) which has played a pivotal role in the screening and early detection of prostate cancer worldwide.

From what was almost uniformly a poor prognostic malignancy associated with highly morbid therapies, prostate cancer emerged as a potentially curable disease, with state of the art diagnostic and therapeutic approaches. Advancements in the last 30 years have redefined the surgical approach of localized prostate cancer with minimally invasive surgical approaches for the most part being our primary treatment approach. Additionally radiotherapy techniques have been refined increasing the efficacy and limiting the toxicity to adjacent organs. New systemic and vaccine therapies have most recently emerged as effective approaches to advanced disease.

## **2. Prostate cancer screening**

Prostate cancer screening has evolved since the initial introduction of serum PSA in our screening armamentarium. The discovery of PSA in the 1980s along with an appreciation of its prognostic significance has greatly impacted patient education and surveillance recommendations. Over the last two decades, a stage migration has occurred in favor of small volume, localized disease which we believe is in large part as a direct consequence of the utilization of serum PSA screening. Between 1986 and 1999, there has been a dramatic reduction in the incidence of locally advanced, high volume disease for the similarly proposed reasons. The Prostate, Lung, Colorectal, and Ovary (PLCO) cancer trial of the National Cancer Institute (NCI) was designed to evaluate the effectiveness of prostate cancer screening. It began accruing patients between 1993 and 2001. The study demonstrated a 22% increase in the detection of prostate cancer at 7 years and a 17% increase at 10 years in the patient cohort undergoing annual digital rectal examination

Prostate Cancer: Essential Diagnostic and Therapeutic Considerations 5

thereafter at the age of 50. Prostatic biopsies are recommended based on the NCCN guidelines when a total serum PSA is greater than 2.5 ng/mL. On the contrary, the United States Preventative Services Task Force (USPSTF) provide no guidelines stating that present research on the subject matter is inconclusive and in fact their position statement is against

Not surprisingly, along with the rise in the incidence of prostate cancer interest in prostate cancer prevention has come to the forefront. The Selenium and Vitamin E Cancer Prevention Trial (SELECT) is the largest chemoprevention trial in prostate cancer. They enrolled 35,534 participants that were randomized to three arms (one, both, or none of the supplements). Several smaller trials had previously shown a possible reduction in the incidence of prostate cancer among patients taking these proposed chemoprevention agents nevertheless it was felt a larger study was required to more definitely address this clinical question. Unfortunately, the final analysis of this study did not demonstrate a decrease in the risk of developing prostate cancer in either of the chemoprevention arms. In fact, a small increase in the diagnosis of prostate cancer was noted in the group taking vitamin E alone in addition to an increase in the incidence of diabetes among those taking selenium. At this time, the men in this trial have been instructed to discontinue taking these supplements and will be followed for the next three years to evaluate the long-term sequelae of this

The concept of chemoprevention of prostate cancer has gained popularity in the last decade with the results of two large prospective studies (PCPT and REDUCE trials). Both of these trials assessed the impact of selective androgen blockade through 5-α reductase inhibition in an attempt to prevent prostate carcinogenesis. The two trials differed in their design, with the PCPT trial recruiting healthy men with no increased risk of developing prostate cancer whereas the REDUCE trial included men with a higher relative risk of developing prostate cancer. Both studies were positive in that they demonstrated a reduction in the overall

The PCPT trial evaluated the specific effects of finasteride, with men receiving this agent being 24.8% less likely to develop prostate cancer. The trial however revealed the alarming concern of an increased incidence of higher grade tumors (Gleason 7-10 tumors) in the treatment arm. One proposed explanation for this was postulated to be the change in the appearance of the prostatic cytoarchitecture and cellular morphology as a result of the hormonal effects of finasteride. Another theory is that finasteride selectively inhibits lower grade tumors and shrinks the overall size of the prostatic gland resulting in an overall

The REDUCE trial differed from the PCPT trial in that the inclusion criteria were modified to target men with risk factors for developing prostate cancer as well as using a different 5-α reductase inhibitor (dutasteride) in the treatment arm. This trial included 8,200 men between the ages of 50 and 75 years of age. Men younger than 60 had a baseline total serum PSA between 2.5 and 10 ng/mL whereas older men had a total serum PSA value between 3.0 and 10 ng/mL. Additionally, the participants had a prostate biopsy within 6 months of enrollment demonstrating no evidence of cancer, atypical small acinar proliferation (ASAP), or high-grade prostatic intraepithelial neoplasia. Men were randomized to placebo or the dutasteride treatment arm. The study's primary treatment

incidence of prostate cancer among men receiving an oral 5-α reductase inhibitor.

increase in the number of higher grade cancers detected within a now smaller gland.

screening men over the age of 75.

**3. Chemoprevention** 

chemoprevention trial.

(DRE) and PSA. Despite these findings, the number of deaths between the two study arms were not statistically significant suggesting that screening did not impart a survival benefit with short to intermediate follow-up. The European Randomized Study of Screening for Prostate Cancer (ERSPC) was a similarly designed prospective randomized trial that enrolled 162,387 men. The authors concluded that to prevent a single death from prostate cancer, 1410 men needed to be screened and 48 needed to be treated nevertheless this was a positive study in terms of demonstrating the survival imparted from prostate cancer screening.

Screening using PSA has evolved in recent years by looking at various PSA isoforms and exploring new serological and urinary biomarkers. The evaluation of unbound PSA or free PSA has been proposed in an attempt to improve cancer screening efforts while reducing the number of patients undergoing transrectal prostatic biopsies with its inherent risks, discomfort, and associated healthcare costs. Other novel markers have as well been considered including Prostate Cancer Antigen 3 (PCA3) which was first cited as a potential biomarker for prostate cancer in 1999. It differs from PSA in that it is acquired from the urine following a prostatic massage. Among prostate cancer patients, PCA3 has been found to be upregulated up to a 66-fold in 95% of patients. Although a large validation study has yet to be completed, PCA3 appears to show promise as an adjunctive screening tool to PSA. Additionally, there is an isoform of PSA called proPSA that is mostly found within the peripheral zone of the prostate and within the circulatory system. Multiple variants of this isoform exist with the [-2]proPSA showing the most compelling results pertaining to screening for prostate cancer. [-2]proPSA is the most prevalent form found within prostate cancer cells. Elevations of this variant have also been found to directly correlate with the underlying Gleason score of the tumor. Despite these promising data, the clinical use of these biomarkers remain to be defined until more robust clinical studies validate its superiority versus total serum PSA.

Evolving data pertaining to screening biomarkers and two recent large prospective clinical trials have fueled the controversy regarding screening guidelines and recommendations put forth by various medical governing bodies and organizations. According to the PLCO trial, harm associated directly with screening remains relatively low. DRE can lead to bleeding or significant discomfort/pain in 0.3 per 10,000 screened men. The blood procurement (i.e. phlebotomy) required in PSA screening imparts an associated risk of dizziness, bruising, or hematoma in 26.2 per 10,000 screened men along with 3 episodes of fainting per 10,000. Most importantly, complications from the resulting transrectal prostatic biopsy occurred in 68 per 10,000 cases which included infection, bleeding, and voiding difficulties. In addition, the overtreatment of clinically insignificant tumors which would likely never be biologically aggressive may for the most part only expose patients to the inherent risks and potential morbidities of localized prostate cancer treatments including incontinence and/or erectile dysfunction.

Four organizations currently provide some degree of guidance on the topic of screening. The American Urologic Association (AUA) and the American Cancer Society (ACS) both recommend giving men the option of screening starting at the age of 50 using the combination of PSA and DRE and starting earlier in higher risk men (e.g. patients with first and/or second degree relatives with prostate cancer, certain racial ethnicities such as African Americans). The National Comprehensive Cancer Network (NCCN) has some of the most definitive screening guidelines which include recommending a single total serum PSA test at the age of 40 followed by another PSA at the age of 45, with annual screening thereafter at the age of 50. Prostatic biopsies are recommended based on the NCCN guidelines when a total serum PSA is greater than 2.5 ng/mL. On the contrary, the United

States Preventative Services Task Force (USPSTF) provide no guidelines stating that present research on the subject matter is inconclusive and in fact their position statement is against screening men over the age of 75.

#### **3. Chemoprevention**

4 Prostate Cancer – From Bench to Bedside

(DRE) and PSA. Despite these findings, the number of deaths between the two study arms were not statistically significant suggesting that screening did not impart a survival benefit with short to intermediate follow-up. The European Randomized Study of Screening for Prostate Cancer (ERSPC) was a similarly designed prospective randomized trial that enrolled 162,387 men. The authors concluded that to prevent a single death from prostate cancer, 1410 men needed to be screened and 48 needed to be treated nevertheless this was a positive study in terms of demonstrating the survival imparted from prostate

Screening using PSA has evolved in recent years by looking at various PSA isoforms and exploring new serological and urinary biomarkers. The evaluation of unbound PSA or free PSA has been proposed in an attempt to improve cancer screening efforts while reducing the number of patients undergoing transrectal prostatic biopsies with its inherent risks, discomfort, and associated healthcare costs. Other novel markers have as well been considered including Prostate Cancer Antigen 3 (PCA3) which was first cited as a potential biomarker for prostate cancer in 1999. It differs from PSA in that it is acquired from the urine following a prostatic massage. Among prostate cancer patients, PCA3 has been found to be upregulated up to a 66-fold in 95% of patients. Although a large validation study has yet to be completed, PCA3 appears to show promise as an adjunctive screening tool to PSA. Additionally, there is an isoform of PSA called proPSA that is mostly found within the peripheral zone of the prostate and within the circulatory system. Multiple variants of this isoform exist with the [-2]proPSA showing the most compelling results pertaining to screening for prostate cancer. [-2]proPSA is the most prevalent form found within prostate cancer cells. Elevations of this variant have also been found to directly correlate with the underlying Gleason score of the tumor. Despite these promising data, the clinical use of these biomarkers remain to be defined until more robust clinical studies validate its

Evolving data pertaining to screening biomarkers and two recent large prospective clinical trials have fueled the controversy regarding screening guidelines and recommendations put forth by various medical governing bodies and organizations. According to the PLCO trial, harm associated directly with screening remains relatively low. DRE can lead to bleeding or significant discomfort/pain in 0.3 per 10,000 screened men. The blood procurement (i.e. phlebotomy) required in PSA screening imparts an associated risk of dizziness, bruising, or hematoma in 26.2 per 10,000 screened men along with 3 episodes of fainting per 10,000. Most importantly, complications from the resulting transrectal prostatic biopsy occurred in 68 per 10,000 cases which included infection, bleeding, and voiding difficulties. In addition, the overtreatment of clinically insignificant tumors which would likely never be biologically aggressive may for the most part only expose patients to the inherent risks and potential morbidities of localized prostate cancer treatments including incontinence

Four organizations currently provide some degree of guidance on the topic of screening. The American Urologic Association (AUA) and the American Cancer Society (ACS) both recommend giving men the option of screening starting at the age of 50 using the combination of PSA and DRE and starting earlier in higher risk men (e.g. patients with first and/or second degree relatives with prostate cancer, certain racial ethnicities such as African Americans). The National Comprehensive Cancer Network (NCCN) has some of the most definitive screening guidelines which include recommending a single total serum PSA test at the age of 40 followed by another PSA at the age of 45, with annual screening

cancer screening.

superiority versus total serum PSA.

and/or erectile dysfunction.

Not surprisingly, along with the rise in the incidence of prostate cancer interest in prostate cancer prevention has come to the forefront. The Selenium and Vitamin E Cancer Prevention Trial (SELECT) is the largest chemoprevention trial in prostate cancer. They enrolled 35,534 participants that were randomized to three arms (one, both, or none of the supplements). Several smaller trials had previously shown a possible reduction in the incidence of prostate cancer among patients taking these proposed chemoprevention agents nevertheless it was felt a larger study was required to more definitely address this clinical question. Unfortunately, the final analysis of this study did not demonstrate a decrease in the risk of developing prostate cancer in either of the chemoprevention arms. In fact, a small increase in the diagnosis of prostate cancer was noted in the group taking vitamin E alone in addition to an increase in the incidence of diabetes among those taking selenium. At this time, the men in this trial have been instructed to discontinue taking these supplements and will be followed for the next three years to evaluate the long-term sequelae of this chemoprevention trial.

The concept of chemoprevention of prostate cancer has gained popularity in the last decade with the results of two large prospective studies (PCPT and REDUCE trials). Both of these trials assessed the impact of selective androgen blockade through 5-α reductase inhibition in an attempt to prevent prostate carcinogenesis. The two trials differed in their design, with the PCPT trial recruiting healthy men with no increased risk of developing prostate cancer whereas the REDUCE trial included men with a higher relative risk of developing prostate cancer. Both studies were positive in that they demonstrated a reduction in the overall incidence of prostate cancer among men receiving an oral 5-α reductase inhibitor.

The PCPT trial evaluated the specific effects of finasteride, with men receiving this agent being 24.8% less likely to develop prostate cancer. The trial however revealed the alarming concern of an increased incidence of higher grade tumors (Gleason 7-10 tumors) in the treatment arm. One proposed explanation for this was postulated to be the change in the appearance of the prostatic cytoarchitecture and cellular morphology as a result of the hormonal effects of finasteride. Another theory is that finasteride selectively inhibits lower grade tumors and shrinks the overall size of the prostatic gland resulting in an overall increase in the number of higher grade cancers detected within a now smaller gland.

The REDUCE trial differed from the PCPT trial in that the inclusion criteria were modified to target men with risk factors for developing prostate cancer as well as using a different 5-α reductase inhibitor (dutasteride) in the treatment arm. This trial included 8,200 men between the ages of 50 and 75 years of age. Men younger than 60 had a baseline total serum PSA between 2.5 and 10 ng/mL whereas older men had a total serum PSA value between 3.0 and 10 ng/mL. Additionally, the participants had a prostate biopsy within 6 months of enrollment demonstrating no evidence of cancer, atypical small acinar proliferation (ASAP), or high-grade prostatic intraepithelial neoplasia. Men were randomized to placebo or the dutasteride treatment arm. The study's primary treatment

Prostate Cancer: Essential Diagnostic and Therapeutic Considerations 7

among patients in the low to intermediate D'Amico risk groups. Using this technique, radioactive seeds are implanted within the prostate under transrectal ultrasound guidance through the perineum and under general anesthesia. Contraindications to interstitial brachytherapy include patients with prostate sizes greater than 60 grams, a previous transurethral resection of the prostate, or significant baseline voiding complaints. The prostate size limitation imparted by brachytherapy result from the fact: (1) large prostates are unable to be completely treated due to the interference by the pubic bone during implantation and (2) patients with large prostates being at increased risk of postimplantation urinary retention. Despite these limitations, brachytherapy remains a popular treatment modality among low D'Amico risk patients, with comparable oncological

In addition to surgery and radiotherapy, novel ablative therapies including cryotherapy and high-intensity focused ultrasound (HIFU) have emerged as primary local treatments to prostate cancer. Cryotherapy can be used as a primary treatment to localized disease but remains most frequently used as a salvage local therapy. HIFU has gained increasing popularity in recent years as a potential new local treatment for prostate cancer. HIFU was first proposed as a treatment option for prostate cancer in 1999 and entails the use of ultrasonic waves administered using a transrectal ultrasound probe technique. This approach provides an alternative to more traditional modalities such as surgery or radiotherapy. Although not FDA approved in the United States, this technology has been used in many European countries including France and Germany for several years. Longterm data evaluating the oncological outcome of HIFU and contrasting it to other currently

HIFU has also been evaluated as a form of focal therapy for prostate cancer. Unfortunately, data evaluating focal therapy in general have been disappointing. Up to 21% of patients treated by focal therapy may be undertreated according to a study by Katz el al. evaluating post-prostatectomy specimens. All patients in this study who met the original focal therapy criteria would have had a significant secondary tumor missed, with 58.3% of these patients having a final pathological Gleason score of ≥ 7 and 25% exhibiting a final pathological stage of pT3. Based on this and other compelling data, focal therapy appears to be an ineffective therapy in its current applications until we develop better imaging modalities for detecting

The use of neoadjuvant and adjuvant androgen deprivation therapy (ADT) has also been thoroughly investigated. The CaPSURE trial found a 2.6 fold increase in cardiovascular events in men treated with both ADT and RRP compared to RRP alone. Even though one could conceptually see how preoperative ADT would be beneficial, it has never been shown to improve the cancer specific outcomes of patients undergoing RRP (even amongst patients

Active surveillance has gained increasing popularity as a treatment alternative for a subset of patients with prostate cancer over the past decade. This has likely resulted from the previously mentioned stage migration of prostate cancer, with a significant proportion of patients presenting with clinically "insignificant" prostate cancer. Debate continues to ensue on the exact therapeutic role and which patients are ideally suited for active surveillance; nevertheless, its viability as a treatment alternative for a significant subset of prostate cancer

The management of locally advanced prostate cancer can constitute a therapeutic dilemma for clinicians. Treatment alternatives for these patients include: (1) XRT and adjuvant ADT

clinically significant prostate cancer foci within an individual patient's prostate.

outcomes to other treatment modalities.

available treatment modalities are lacking.

falling within the high risk D'Amico group).

patients is undeniable.

endpoint was the reduced risk of developing prostate cancer, with the dutasteride treatment arm having a 22.8% relative risk reduction and no increased incidence of higher Gleason grade tumors within the treatment arm.

## **4. Patient risk stratification**

With the increased incidence of prostate cancer cases detected in recent years, a new therapeutic dilemma has arisen in that clinicians have contemplated whether a subset of tumors were clinically insignificant and could be surveilled rather than undergo definitive local therapy. In this regard, risk stratification models have been proposed as a means to tailor treatment recommendations based on the biological aggressivity and risk of progression of the tumor. The Partin tables were first proposed in 1993 in which they developed a cancer specific predictive model based on total serum PSA, biopsy Gleason score, and clinical stage. Patients and clinicians benefited from the use of the Partin tables to counsel and tailor patient treatment recommendations. The D'Amico risk stratification is a simplified version of the Partin tables dividing patients into one of three prognostic categories based on the same three pre-treatment diagnostic parameters (PSA, biopsy Gleason score, and clinical stage). Using the D'Amico risk groups, patients can be stratified into the low, intermediate, and high risk groups serving as an important clinical tool for guiding therapeutic options and suitability for clinical trials.

## **5. Treatment recommendations**

The treatment modalities suitable for localized prostate cancer continue to evolve and expand, with novel technological advances. Dr. Walsh's pioneering work in the anatomical and surgical approach to the radical prostatectomy resulted in a dramatic decline in the reported morbidity of this procedure. Thereafter, Schuessler et al. published the first description of a laparoscopic radical prostatectomy in 1997 with the first case performed in 1991. This technique was however initially abandoned due to its steep learning curve and long operating times, without a clear benefit imparted to patients. With advances in laparoscopic skills and instrumentation, this minimally invasive approach was resurrected before being integrated with modern robotic assisted techniques. Today, nearly 85-90% of all radical prostatectomies are performed using a robotic assisted technique. The questions currently being raised with this technology relates to the expense of this procedure and whether it in fact imparts a clear benefit versus open surgery in terms of cancer control, continence, and potency preservation.

Radiotherapy for localized prostate cancer has evolved by leaps and bounds. External beam radiation therapy (XRT), brachytherapy, and high dose rate (HDR) implant therapy can all be utilized as first line treatment options for localized prostate cancer. Historically, XRT began as a four field box technique fractionating doses over several weeks. This was later replaced by three dimensional conformal radiation therapy and further refined using intensity modulated radiation therapy to limit the dose/toxicity delivered to surrounding tissues while maximizing the therapeutic effective dose delivered to the prostate.

Interstitial brachytherapy can be performed using three different radioisotopes (Iodine 125, Palladium 103, and Cesium 131). Each of these isotopes have different half-lives although all three have reported similar cancer specific outcomes. Interstitial brachytherapy as a monotherapy has become a viable treatment alternative to radical prostatectomy or XRT

endpoint was the reduced risk of developing prostate cancer, with the dutasteride treatment arm having a 22.8% relative risk reduction and no increased incidence of higher

With the increased incidence of prostate cancer cases detected in recent years, a new therapeutic dilemma has arisen in that clinicians have contemplated whether a subset of tumors were clinically insignificant and could be surveilled rather than undergo definitive local therapy. In this regard, risk stratification models have been proposed as a means to tailor treatment recommendations based on the biological aggressivity and risk of progression of the tumor. The Partin tables were first proposed in 1993 in which they developed a cancer specific predictive model based on total serum PSA, biopsy Gleason score, and clinical stage. Patients and clinicians benefited from the use of the Partin tables to counsel and tailor patient treatment recommendations. The D'Amico risk stratification is a simplified version of the Partin tables dividing patients into one of three prognostic categories based on the same three pre-treatment diagnostic parameters (PSA, biopsy Gleason score, and clinical stage). Using the D'Amico risk groups, patients can be stratified into the low, intermediate, and high risk groups serving as an important clinical tool for

The treatment modalities suitable for localized prostate cancer continue to evolve and expand, with novel technological advances. Dr. Walsh's pioneering work in the anatomical and surgical approach to the radical prostatectomy resulted in a dramatic decline in the reported morbidity of this procedure. Thereafter, Schuessler et al. published the first description of a laparoscopic radical prostatectomy in 1997 with the first case performed in 1991. This technique was however initially abandoned due to its steep learning curve and long operating times, without a clear benefit imparted to patients. With advances in laparoscopic skills and instrumentation, this minimally invasive approach was resurrected before being integrated with modern robotic assisted techniques. Today, nearly 85-90% of all radical prostatectomies are performed using a robotic assisted technique. The questions currently being raised with this technology relates to the expense of this procedure and whether it in fact imparts a clear benefit versus open surgery in terms of cancer control,

Radiotherapy for localized prostate cancer has evolved by leaps and bounds. External beam radiation therapy (XRT), brachytherapy, and high dose rate (HDR) implant therapy can all be utilized as first line treatment options for localized prostate cancer. Historically, XRT began as a four field box technique fractionating doses over several weeks. This was later replaced by three dimensional conformal radiation therapy and further refined using intensity modulated radiation therapy to limit the dose/toxicity delivered to surrounding

Interstitial brachytherapy can be performed using three different radioisotopes (Iodine 125, Palladium 103, and Cesium 131). Each of these isotopes have different half-lives although all three have reported similar cancer specific outcomes. Interstitial brachytherapy as a monotherapy has become a viable treatment alternative to radical prostatectomy or XRT

tissues while maximizing the therapeutic effective dose delivered to the prostate.

Gleason grade tumors within the treatment arm.

guiding therapeutic options and suitability for clinical trials.

**4. Patient risk stratification** 

**5. Treatment recommendations** 

continence, and potency preservation.

among patients in the low to intermediate D'Amico risk groups. Using this technique, radioactive seeds are implanted within the prostate under transrectal ultrasound guidance through the perineum and under general anesthesia. Contraindications to interstitial brachytherapy include patients with prostate sizes greater than 60 grams, a previous transurethral resection of the prostate, or significant baseline voiding complaints. The prostate size limitation imparted by brachytherapy result from the fact: (1) large prostates are unable to be completely treated due to the interference by the pubic bone during implantation and (2) patients with large prostates being at increased risk of postimplantation urinary retention. Despite these limitations, brachytherapy remains a popular treatment modality among low D'Amico risk patients, with comparable oncological outcomes to other treatment modalities.

In addition to surgery and radiotherapy, novel ablative therapies including cryotherapy and high-intensity focused ultrasound (HIFU) have emerged as primary local treatments to prostate cancer. Cryotherapy can be used as a primary treatment to localized disease but remains most frequently used as a salvage local therapy. HIFU has gained increasing popularity in recent years as a potential new local treatment for prostate cancer. HIFU was first proposed as a treatment option for prostate cancer in 1999 and entails the use of ultrasonic waves administered using a transrectal ultrasound probe technique. This approach provides an alternative to more traditional modalities such as surgery or radiotherapy. Although not FDA approved in the United States, this technology has been used in many European countries including France and Germany for several years. Longterm data evaluating the oncological outcome of HIFU and contrasting it to other currently available treatment modalities are lacking.

HIFU has also been evaluated as a form of focal therapy for prostate cancer. Unfortunately, data evaluating focal therapy in general have been disappointing. Up to 21% of patients treated by focal therapy may be undertreated according to a study by Katz el al. evaluating post-prostatectomy specimens. All patients in this study who met the original focal therapy criteria would have had a significant secondary tumor missed, with 58.3% of these patients having a final pathological Gleason score of ≥ 7 and 25% exhibiting a final pathological stage of pT3. Based on this and other compelling data, focal therapy appears to be an ineffective therapy in its current applications until we develop better imaging modalities for detecting clinically significant prostate cancer foci within an individual patient's prostate.

The use of neoadjuvant and adjuvant androgen deprivation therapy (ADT) has also been thoroughly investigated. The CaPSURE trial found a 2.6 fold increase in cardiovascular events in men treated with both ADT and RRP compared to RRP alone. Even though one could conceptually see how preoperative ADT would be beneficial, it has never been shown to improve the cancer specific outcomes of patients undergoing RRP (even amongst patients falling within the high risk D'Amico group).

Active surveillance has gained increasing popularity as a treatment alternative for a subset of patients with prostate cancer over the past decade. This has likely resulted from the previously mentioned stage migration of prostate cancer, with a significant proportion of patients presenting with clinically "insignificant" prostate cancer. Debate continues to ensue on the exact therapeutic role and which patients are ideally suited for active surveillance; nevertheless, its viability as a treatment alternative for a significant subset of prostate cancer patients is undeniable.

The management of locally advanced prostate cancer can constitute a therapeutic dilemma for clinicians. Treatment alternatives for these patients include: (1) XRT and adjuvant ADT

Prostate Cancer: Essential Diagnostic and Therapeutic Considerations 9

Chapelon JY, Ribault M, Vernier F, Souchon R, Gelet A. (1999). Treatment of localised

Civantos F, Soloway MS, Pinto JE. (1996). Histopathological effects of androgen deprivation in prostatic cancer. *Semin Urol Oncol,* Vol. 14, No. 2, (May 1996), pp. 22-31. D'Amico AV, Whittington R, Malkowicz SB et al. (1998). Biochemical outcome after radical

Jemal A, Tiwari RC, Murray T, et al. (2004). American Cancer Society: Cancer statistics 2004.

Katz B, Srougi M, Dall'oglio M, Nesrallah AJ, Sant'anna AC, Pontes J Jr, Reis ST, Sañudo A,

Klein EA, Thompson IM, Lippman SM, Goodman PJ, Albanes D, Taylor PR, Coltman C.

Partin AW, Yoo J, Carter HB, Pearson JD, Chan DW, Epstein JI, Walsh PC. (1993). The use of

Schuessler WW, Schulam PG, Clayman RV, Kavoussi LR. (1997). Laparoscopic radical

Smith RA, Cokkinides V, Eyre HJ. (2006). American Cancer Society guidelines for the early detection of cancer, 2006. *CA Cancer J Clin*, Vol 56, No. 1, (Jan-Feb 2006), pp. 11-25 Sokoll LJ, Wang Y, Feng Z et al. (2008). [-2]proenzyme prostate specific antigen for prostate

Thompson IM, Goodman PJ, Tangen CM, et al. (2003). The influence of finasteride on the

Tsai HK, D'Amico AV, Sadestsky N et al. (2007). Androgen deprivation therapy for

U.S. Preventive Services Task Force Grade Definitions After May 2007. (May 2008).

Walsh PC, Donker PJ. (1982). Impotence following radical prostatectomy: insight into etiology and prevention. *J Urol,* Vol. 128, No. 3, (Sept 1982), pp. 492-7

validation study. *J Urol*, Vol. 180, No. 2, (Aug 2008), pp. 539-43

http://www.uspreventiveservicestaskforce.org/uspstf/gradespost.htm

Vol. 99, No. 20, (Oct 2007), pp. 1516-24

Cancer Prevention Trial. *J Urol*, Vol. 166, No. 4, (Oct 2001), pp. 1311-5 Mikolajczyk SD, Millar LS, Wang TJ et al. (2000) A precursor form of prostate-specific

zone prostate tissue. *Cancer Res*, Vol. 60, No. 3, (Feb 2000), pp. 756-9

*CA Cancer J Clin*, Vol. 54, No. 1, (Jan-Feb 2004), pp. 8-29

Vol. 9, No. 1, (Mar 1999), pp. 31-8

2011), Ahead of publication.

pp. 8-15

110-4

854-7

224

Available from:

prostate cancer with transrectal high intensity focused ultrasound. *Eur J Ultrasound*,

prostatectomy, external beam radiation therapy, or interstitial radiation therapy for clinically localized prostate cancer. *JAMA*, Vol. 280, No. 11, (Sept 1998), pp. 969-74 Hessels D, Klein Gunnewiek JM, van Oort I et al. (2003). DD3(PCA3)-based molecular urine

analysis for the diagnosis of prostate cancer. *Eur Urol*, Vol. 44, No. 1, (July 2003),

Camara-Lopes LH, Leite KR. (2011). Are we able to correctly identify prostate cancer patients who could be adequately treated by focal therapy? *Urol Oncol*, (Mar

(2001). SELECT: the next prostate cancer prevention trial. Selenum and Vitamin E

antigen is more highly elevated in prostate cancer compared with benign transition

prostate specific antigen, clinical stage and Gleason score to predict pathological stage in men with localized prostate cancer. *J Urol*, Vol. 150, No. 1, (Jul 1993), pp.

prostatectomy: initial short-term experience. *Urology*, Vol. 50, No. 6, (Dec 1997), pp.

cancer detection: a national cancer institute early detection research network

development of prostate cancer. *N Engl J Med,* Vol.349, No. 3, (July 2003), pp. 215-

localized prostate cancer and the risk of cardiovascular mortality. *J Natl Cancer Inst,*

for 2 to 3 years and (2) RRP +/-neoadjuvant/adjuvant chemohormonal therapy preferably as part of a clinical trial. Recent data from the Memorial Sloan Kettering Cancer Center on the surgical management of locally advanced and high-risk prostate cancer have been encouraging. In this study, 4,708 post-prostatectomy men were retrospectively evaluated and categorized as high-risk based on eight different definitions. Depending on the definition used, between 3 to 38% of these patients were considered high-risk. Interestingly, 22-63% of these men had pathologically organ confined cancer and a 5 year relapse-free probability of 49-80%. This suggests that men with traditionally high-risk disease may still be candidates for potentially curative surgical resection.

There have as well been major advances made in the management of hormone-refractory and metastatic prostate cancer. In this regard, taxotere remains the first line agent for the management of hormone refractory, metastatic prostate cancer. In addition, exciting new data pertaining to prostate cancer include vaccine therapy using Sipuleucel T, abiraterone, and cabazitaxel; all of which may potentially redefine our therapeutic approach to the management of advanced disease.

#### **6. Conclusions**

Significant advances have been made in the field of prostate cancer research and clinical/surgical care. Nevertheless, many unanswered questions remain. In addition, the true survival benefit imparted by prostate cancer screening remains a hotly debated issue within the scientific literature. Emerging prostate cancer tumor markers such as PCA3 and [- 2]proPSA have not yet been shown to be superior to the current biomarker benchmark PSA. Advancements in laparoscopic technique and technology will hopefully continue to reduce the morbidity associated with radical prostatectomy. Lastly, new therapeutic agents have been discovered which have redefined the management of advanced prostate cancer. We hope the present chapter has highlighted some of the key clinical concepts and treatment principles pertaining to this highly prevalent malignancy.

The aim of the present textbook is to provide an in depth understanding of the intricacies encompassing the diagnosis and management of prostate cancer and encapsulate some of the current exciting areas of active clinical/translational research within our scientific community.

#### **7. References**


AUA Clincial Guidelines. (2009). Prostate-Specific Antigen Best Practices Statement

Chan TY, Mikolajczyk SD, Lecksell K et al. (2003). Immunohistochemical staining of prostate cancer with monoclonal antibodies to the precursor of prostate-specific antigen. *Urology,* Vol. 62, No. 1, (Jul 2003), pp. 177-81

for 2 to 3 years and (2) RRP +/-neoadjuvant/adjuvant chemohormonal therapy preferably as part of a clinical trial. Recent data from the Memorial Sloan Kettering Cancer Center on the surgical management of locally advanced and high-risk prostate cancer have been encouraging. In this study, 4,708 post-prostatectomy men were retrospectively evaluated and categorized as high-risk based on eight different definitions. Depending on the definition used, between 3 to 38% of these patients were considered high-risk. Interestingly, 22-63% of these men had pathologically organ confined cancer and a 5 year relapse-free probability of 49-80%. This suggests that men with traditionally high-risk disease may still

There have as well been major advances made in the management of hormone-refractory and metastatic prostate cancer. In this regard, taxotere remains the first line agent for the management of hormone refractory, metastatic prostate cancer. In addition, exciting new data pertaining to prostate cancer include vaccine therapy using Sipuleucel T, abiraterone, and cabazitaxel; all of which may potentially redefine our therapeutic approach to the

Significant advances have been made in the field of prostate cancer research and clinical/surgical care. Nevertheless, many unanswered questions remain. In addition, the true survival benefit imparted by prostate cancer screening remains a hotly debated issue within the scientific literature. Emerging prostate cancer tumor markers such as PCA3 and [- 2]proPSA have not yet been shown to be superior to the current biomarker benchmark PSA. Advancements in laparoscopic technique and technology will hopefully continue to reduce the morbidity associated with radical prostatectomy. Lastly, new therapeutic agents have been discovered which have redefined the management of advanced prostate cancer. We hope the present chapter has highlighted some of the key clinical concepts and treatment

The aim of the present textbook is to provide an in depth understanding of the intricacies encompassing the diagnosis and management of prostate cancer and encapsulate some of the current exciting areas of active clinical/translational research within our scientific

Andriole GL et al. (2009). Mortality Results from a Randomized Prostate-Cancer Screening

Andriole GL, Bostwick DG, Brawley OW, Gomella LG, Marberger M, Montorsi F, Pettaway

of Prostate Cancer. *N Engl J Med*, Vol. 362, No. 13, (Apr 2010) pp. 1192-202

Chan TY, Mikolajczyk SD, Lecksell K et al. (2003). Immunohistochemical staining of prostate

CA, Tammela TL, Teloken C, Tindall DJ, Somerville MC, Wilson TH, Fowler IL, Rittmaster RS; REDUCE Study Group. (2010). The Effect of Dutasteride on the Risk

cancer with monoclonal antibodies to the precursor of prostate-specific antigen.

Trial. *N Engl J Med*, Vol. 360, No .13, (Mar 2009), pp. 1310-1319

AUA Clincial Guidelines. (2009). Prostate-Specific Antigen Best Practices Statement

be candidates for potentially curative surgical resection.

principles pertaining to this highly prevalent malignancy.

*Urology,* Vol. 62, No. 1, (Jul 2003), pp. 177-81

management of advanced disease.

**6. Conclusions** 

community.

**7. References** 


http://www.uspreventiveservicestaskforce.org/uspstf/gradespost.htm

Walsh PC, Donker PJ. (1982). Impotence following radical prostatectomy: insight into etiology and prevention. *J Urol,* Vol. 128, No. 3, (Sept 1982), pp. 492-7

**Part 2** 

**Methods in Cancer Biology** 

Yossepowitch O, Eggener SE, Bianco FJ Jr, Carver BS, Serio A, Scardino PT, Eastham JA. (2007). Radical Prostatectomy for Clinically Localized, High Risk Prostate Cancer: Critical Analysis of Risk Assessment Methods*. J Urol*, Vol. 178, No 2, (Aug 2007), pp. 493-9

**Part 2** 

**Methods in Cancer Biology** 

10 Prostate Cancer – From Bench to Bedside

Yossepowitch O, Eggener SE, Bianco FJ Jr, Carver BS, Serio A, Scardino PT, Eastham JA.

pp. 493-9

(2007). Radical Prostatectomy for Clinically Localized, High Risk Prostate Cancer: Critical Analysis of Risk Assessment Methods*. J Urol*, Vol. 178, No 2, (Aug 2007),

**0**

**2**

*Japan*

**on Microarray Studies**

Makoto Aoshima and Kazuyoshi Yata

*Institute of Mathematics, University of Tsukuba, Ibaraki*

**Effective Methodologies for Statistical Inference**

A common feature of high-dimensional data such as genetic microarrays is that the data dimension is extremely high, however the sample size is relatively small. This type of data is called the high-dimension, low-sample-size (HDLSS) data. Such HDLSS data present with substantial challenges to reconsider existing methods in the multivariate statistical analysis. Unfortunately, it has been known that most conventional methods break down in HDLSS situations and alternative methods are often highly sensitive to the curse of dimensionality. In this chapter, we present modern statistical methodologies that are very effective to draw statistical inference from HDLSS data. We focus on a series of effective HDLSS methodologies developed by Aoshima and Yata (2011) and Yata and Aoshima (2009, 2010a,b, 2011a,b). We demonstrate how those methodologies perform well and bring a new insight into researches

In Section 2, we first consider Principal Component Analysis (PCA) for microarray data to visualize a data structure having tens of thousands of dimension by projecting on a few dimensional PC space. We note that classical PCA cannot sufficiently visualize a latent structure of microarray data because of the curse of dimensionality. We overcome the difficulty with the help of the *cross-data-matrix (CDM) methodology* that was developed by Yata

Next, in Section 3, we consider an effective clustering for microarray data. We apply the CDM methodology to estimating the principal component (PC) scores. We show that a clustering method given by using a CDM-based first PC score effectively classifies individuals into two groups. We demonstrate accurate clustering by using prostate cancer data given by Singh et

Further, in Section 4, we consider an effective classification for microarray data. We pay special attention to the quadratic-type classification methodology developed by Aoshima and Yata (2011). We give a sample size determination for the classification so that the misclassification rates are controlled by a prespecified upper bound. We examine how the classification

Finally, in Section 5, we consider a variable selection procedure to select a set of significant variables from microarray data. In most gene expression studies, it is important to select relevant genes for classification so that researchers can identify the smallest possible set of genes that can still achieve good predictive performance. We implement the two-stage

methodology performs well by using some microarray data sets.

**1. Introduction**

on prostate cancer.

and Aoshima (2010a,b).

al. (2002).

## **Effective Methodologies for Statistical Inference on Microarray Studies**

Makoto Aoshima and Kazuyoshi Yata *Institute of Mathematics, University of Tsukuba, Ibaraki Japan*

#### **1. Introduction**

A common feature of high-dimensional data such as genetic microarrays is that the data dimension is extremely high, however the sample size is relatively small. This type of data is called the high-dimension, low-sample-size (HDLSS) data. Such HDLSS data present with substantial challenges to reconsider existing methods in the multivariate statistical analysis. Unfortunately, it has been known that most conventional methods break down in HDLSS situations and alternative methods are often highly sensitive to the curse of dimensionality.

In this chapter, we present modern statistical methodologies that are very effective to draw statistical inference from HDLSS data. We focus on a series of effective HDLSS methodologies developed by Aoshima and Yata (2011) and Yata and Aoshima (2009, 2010a,b, 2011a,b). We demonstrate how those methodologies perform well and bring a new insight into researches on prostate cancer.

In Section 2, we first consider Principal Component Analysis (PCA) for microarray data to visualize a data structure having tens of thousands of dimension by projecting on a few dimensional PC space. We note that classical PCA cannot sufficiently visualize a latent structure of microarray data because of the curse of dimensionality. We overcome the difficulty with the help of the *cross-data-matrix (CDM) methodology* that was developed by Yata and Aoshima (2010a,b).

Next, in Section 3, we consider an effective clustering for microarray data. We apply the CDM methodology to estimating the principal component (PC) scores. We show that a clustering method given by using a CDM-based first PC score effectively classifies individuals into two groups. We demonstrate accurate clustering by using prostate cancer data given by Singh et al. (2002).

Further, in Section 4, we consider an effective classification for microarray data. We pay special attention to the quadratic-type classification methodology developed by Aoshima and Yata (2011). We give a sample size determination for the classification so that the misclassification rates are controlled by a prespecified upper bound. We examine how the classification methodology performs well by using some microarray data sets.

Finally, in Section 5, we consider a variable selection procedure to select a set of significant variables from microarray data. In most gene expression studies, it is important to select relevant genes for classification so that researchers can identify the smallest possible set of genes that can still achieve good predictive performance. We implement the two-stage

Here, *<sup>p</sup>*

as follows:

**2.2 Beyond naive PCA**

<sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup>*

Note that *<sup>S</sup>D*(1)*S<sup>T</sup>*

*under the conditions:*

**(YA-i')** *p* → ∞ and *n* → ∞ for *j* such that *α<sup>j</sup>* > 1;

**(YA-ii')** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> and *<sup>p</sup>*1−*αj*/*<sup>n</sup>* <sup>→</sup> 0 for *<sup>j</sup>* such that *<sup>α</sup><sup>j</sup>* <sup>∈</sup> (0, 1].

inconvenient for the experimenter to handle PCA in HDLSS data situations.

*X* = [*x*1, ..., *xn*]=[*x*11, ..., *x*1*n*(1)

and *<sup>X</sup>*<sup>2</sup> are independent. Let *<sup>X</sup>oi* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> [*xi*, ..., *<sup>x</sup>i*], *<sup>i</sup>* <sup>=</sup> 1, 2, where *<sup>x</sup><sup>i</sup>* <sup>=</sup> <sup>∑</sup>*n*(*i*)

We define *p* × *n*(*i*) data matrices, *X*<sup>1</sup> and *X*2, by *X<sup>i</sup>* = [*xi*1, ..., *xin*(*i*)

define a cross data matrix by *<sup>S</sup>D*(1) = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup>*

the singular value decomposition of *<sup>S</sup>D*(1), it follows that *<sup>S</sup>D*(1) <sup>=</sup> <sup>∑</sup>*n*(2)−<sup>1</sup>

**(Step 1)** Define a cross data matrix by *<sup>S</sup>D*(1) = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup>*

**(Step 2)** Calculate the singular values, *<sup>λ</sup>*˜ *<sup>j</sup>*'s, of *<sup>S</sup>D*(1) for the estimation of *<sup>λ</sup>j*'s.

*<sup>j</sup> <sup>u</sup>*˜*j*(1)*u*˜ *<sup>T</sup> j*(1)

the following properties. For the details, see Yata and Aoshima (2010a,b).

*λ*˜ *j λj p*

left- (or right-) singular vector corresponding to *<sup>λ</sup>*˜ *<sup>j</sup>* (*<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*(2) <sup>−</sup> <sup>1</sup>).

*D*(1)

*<sup>o</sup>*2*Xo*<sup>1</sup> (= *<sup>S</sup><sup>T</sup>*

**[Cross-data-matrix (CDM) methodology]**

**Theorem 2.1.** *For j* = 1, ..., *m, it holds that*

*<sup>D</sup>*(1) <sup>=</sup> <sup>∑</sup>*n*(2)−<sup>1</sup>

**(i)** *p* → ∞ *and n* → ∞ *for j such that α<sup>j</sup>* > 1/2*;*

*<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ <sup>2</sup>

by the positive square-root of the eigenvalues of *<sup>S</sup>D*(1)*S<sup>T</sup>*

→ denotes the convergence in probability. If *zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) are independent, the above conditions are improved by the necessary and sufficient conditions

Effective Methodologies for Statistical Inference on Microarray Studies 15

For the details including the limiting distribution of *λ*ˆ *<sup>j</sup>*, see Yata and Aoshima (2009). If the population distribution is *Np*(*μ*, **Σ**), one may consider that *zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) are independent. When *α<sup>j</sup>* > 1, the sample size *n* is free from *p* in (YA-i) or (YA-i'). However, when *α<sup>j</sup>* ∈ (0, 1], one would find difficulty in naive PCA in view of (YA-ii) or (YA-ii') in HDLSS data situations. Let us see a simple case that *<sup>p</sup>* <sup>=</sup> 10000, *<sup>λ</sup>*<sup>1</sup> <sup>=</sup> *<sup>p</sup>*1/2 and *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>λ</sup><sup>p</sup>* <sup>=</sup> 1. Then, we observe from (YA-ii) that it should be *n* >> *p*2−2*α*<sup>1</sup> = *p* = 10000. It is somewhat

Yata and Aoshima (2010a,b) created an effective methodology called the *cross-data-matrix (CDM) methodology* to handle HDLSS data situations: Let *n*(1) = [*n*/2] + 1 and *n*(2) = *n* − *n*(1), where [*x*] denotes the largest integer less than *x*. Suppose that we have a *p* × *n* data matrix,

*<sup>λ</sup>*˜ <sup>1</sup> ≥ ··· ≥ *<sup>λ</sup>*˜ *<sup>n</sup>*(2)−1(<sup>≥</sup> <sup>0</sup>) denote singular values of *<sup>S</sup>D*(1), and ˜*uj*(1) (or ˜*uj*(2)) denotes a unit

, *x*21, ..., *x*2*n*(2)

). Note that rank(*SD*(1)) ≤ *n*(2) − 1. When we consider

. Thus one can calculate the singular values, *λ*˜ *<sup>j</sup>*'s,

→ 1 (4)

*D*(1)

]. (3)

], *i* = 1, 2. Note that *X*<sup>1</sup>

*<sup>o</sup>*1*Xo*<sup>2</sup> or *SD*(2) = ((*n*(1) −

*<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ *<sup>j</sup>u*˜*j*(1)*u*˜ *<sup>T</sup>*

*<sup>o</sup>*1*Xo*2.

. The CDM methodology assures

*<sup>j</sup>*=<sup>1</sup> *xij*/*n*(*i*). We

*j*(2)

, where

variable selection procedure, developed by Aoshima and Yata (2011), that provides screening of variables in the first stage. We select a significant set of associated variables from among a set of candidate variables in the second stage. We show that the selection procedure assures a high accuracy by eliminating redundant variables. We identify predictive genes to classify patients according to disease outcomes on prostate cancer.

#### **2. PCA for high-dimension, low-sample-size data**

Suppose we have a *<sup>p</sup>* <sup>×</sup> *<sup>n</sup>* data matrix *<sup>X</sup>* = [*x*1, ..., *<sup>x</sup>n*] with *<sup>p</sup>* <sup>&</sup>gt; *<sup>n</sup>*, where *<sup>x</sup><sup>k</sup>* = (*x*1*k*, ..., *xpk*)*T*, *k* = 1, ..., *n*, are independent and identically distributed as a *p*-dimensional distribution having mean *μ* and positive-definite covariance matrix **Σ**. The eigen-decomposition of **Σ** is given by **<sup>Σ</sup>** <sup>=</sup> *<sup>H</sup>***Λ***HT*, where **<sup>Λ</sup>** is a diagonal matrix of eigenvalues *<sup>λ</sup>*<sup>1</sup> ≥ ··· ≥ *<sup>λ</sup>p*(<sup>&</sup>gt; <sup>0</sup>) and *<sup>H</sup>* <sup>=</sup> [*h*1, ..., *<sup>h</sup>p*] is a matrix of corresponding eigenvectors. Then, *<sup>Z</sup>* <sup>=</sup> **<sup>Λ</sup>**−1/2*HT*(*<sup>X</sup>* <sup>−</sup>[*μ*, ..., *<sup>μ</sup>*]) is considered as a *p* × *n* sphered data matrix from a distribution with zero mean and the identity covariance matrix. Here, we write *Z* = [*z*1, ..., *zp*] *<sup>T</sup>* and *z<sup>j</sup>* = (*zj*1, ..., *zjn*)*T*, *j* = 1, ..., *p*. We assume that the fourth moments of each variable in *Z* are uniformly bounded and ||*zj*|| �= 0 for *j* = 1, ..., *p*, where || · || denotes the Euclidean norm. We note that the multivariate distribution assumed here does not have to be a normal distribution, *Np*(*μ*, **Σ**), and the random variables in *Z* do not have to be regulated by a *ρ*-mixing condition. We consider a general setting as follows:

$$
\lambda\_{\rangle} = a\_{\rangle} p^{\mu\_{\rangle}} \text{ ( $j = 1, \ldots, m$ )} \quad \text{and} \quad \lambda\_{\rangle} = c\_{\rangle} (j = m + 1, \ldots, p). \tag{1}
$$

Here, *aj*(> 0), *cj*(> 0) and *αj*(*α*<sup>1</sup> ≥ ··· ≥ *α<sup>m</sup>* > 0) are unknown constants preserving the ordering that *λ*<sup>1</sup> ≥ ··· ≥ *λp*, and *m* is an unknown positive integer. The model (1) is an extension of a multi-component model or spiked covariance model given by Johnstone and Lu (2009). This is a quite general model for high-dimensional data. For example, a mixture model given by (6) in Section 3 is one of the examples that have the model (1) as in (7). One would also find the model (1) in a highly-correlated, high-dimensional data analysis such as graphical models, high dimensional regression models, and so on.

Let *<sup>X</sup><sup>o</sup>* <sup>=</sup> *<sup>X</sup>* <sup>−</sup> [*x*, ..., *<sup>x</sup>*], where *<sup>x</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi*/*n*. The sample covariance matrix is given by *<sup>S</sup>* = (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)−1*XoX<sup>T</sup> <sup>o</sup>* and its dual matrix is defined by *<sup>S</sup><sup>D</sup>* = (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)−1*X<sup>T</sup> <sup>o</sup> Xo*. Note that *S<sup>D</sup>* and *<sup>S</sup>* share non-zero eigenvalues. Let *<sup>λ</sup>*<sup>ˆ</sup> <sup>1</sup> ≥···≥ *<sup>λ</sup>*<sup>ˆ</sup> *<sup>n</sup>*−1(<sup>≥</sup> <sup>0</sup>) be the eigenvalues of *<sup>S</sup>D*. Let us write the eigen-decomposition of *S<sup>D</sup>* by *S<sup>D</sup>* = ∑*n*−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*<sup>ˆ</sup> *<sup>j</sup>u*ˆ*ju*<sup>ˆ</sup> *<sup>T</sup> <sup>j</sup>* , where ˆ*uj*'s are the corresponding eigenvectors of *<sup>λ</sup>*<sup>ˆ</sup> *<sup>j</sup>* such that ||*u*ˆ*j*|| <sup>=</sup> 1 and ˆ*u<sup>T</sup> <sup>i</sup> u*ˆ*<sup>j</sup>* = 0 (*i* �= *j*).

#### **2.1 Naive PCA in HDLSS situations**

Yata and Aoshima (2009) gave sufficient conditions to claim the consistency property for the sample eigenvalues: For *j* = 1, ..., *m*, it holds that

$$\frac{\hat{\lambda}\_j}{\hat{\lambda}\_j} \xrightarrow{p} 1\tag{2}$$

under the conditions:

**(YA-i)** *p* → ∞ and *n* → ∞ for *j* such that *α<sup>j</sup>* > 1; **(YA-ii)** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> and *<sup>p</sup>*2−2*αj*/*<sup>n</sup>* <sup>→</sup> 0 for *<sup>j</sup>* such that *<sup>α</sup><sup>j</sup>* <sup>∈</sup> (0, 1]. Here, *<sup>p</sup>* → denotes the convergence in probability. If *zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) are independent, the above conditions are improved by the necessary and sufficient conditions as follows:

$$\begin{aligned} \text{(YA-i')} \ p &\rightarrow \infty \text{ and } n \rightarrow \infty \text{ for } j \text{ such that } \alpha\_j > 1;\\ \text{(YA-ii')} \ p &\rightarrow \infty \text{ and } p^{1-\mathfrak{a}\_j}/n \rightarrow 0 \text{ for } j \text{ such that } \alpha\_j \in (0,1]. \end{aligned}$$

For the details including the limiting distribution of *λ*ˆ *<sup>j</sup>*, see Yata and Aoshima (2009). If the population distribution is *Np*(*μ*, **Σ**), one may consider that *zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) are independent. When *α<sup>j</sup>* > 1, the sample size *n* is free from *p* in (YA-i) or (YA-i'). However, when *α<sup>j</sup>* ∈ (0, 1], one would find difficulty in naive PCA in view of (YA-ii) or (YA-ii') in HDLSS data situations. Let us see a simple case that *<sup>p</sup>* <sup>=</sup> 10000, *<sup>λ</sup>*<sup>1</sup> <sup>=</sup> *<sup>p</sup>*1/2 and *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>λ</sup><sup>p</sup>* <sup>=</sup> 1. Then, we observe from (YA-ii) that it should be *n* >> *p*2−2*α*<sup>1</sup> = *p* = 10000. It is somewhat inconvenient for the experimenter to handle PCA in HDLSS data situations.

#### **2.2 Beyond naive PCA**

2 Will-be-set-by-IN-TECH

variable selection procedure, developed by Aoshima and Yata (2011), that provides screening of variables in the first stage. We select a significant set of associated variables from among a set of candidate variables in the second stage. We show that the selection procedure assures a high accuracy by eliminating redundant variables. We identify predictive genes to classify

Suppose we have a *<sup>p</sup>* <sup>×</sup> *<sup>n</sup>* data matrix *<sup>X</sup>* = [*x*1, ..., *<sup>x</sup>n*] with *<sup>p</sup>* <sup>&</sup>gt; *<sup>n</sup>*, where *<sup>x</sup><sup>k</sup>* = (*x*1*k*, ..., *xpk*)*T*, *k* = 1, ..., *n*, are independent and identically distributed as a *p*-dimensional distribution having mean *μ* and positive-definite covariance matrix **Σ**. The eigen-decomposition of **Σ** is given by **<sup>Σ</sup>** <sup>=</sup> *<sup>H</sup>***Λ***HT*, where **<sup>Λ</sup>** is a diagonal matrix of eigenvalues *<sup>λ</sup>*<sup>1</sup> ≥ ··· ≥ *<sup>λ</sup>p*(<sup>&</sup>gt; <sup>0</sup>) and *<sup>H</sup>* <sup>=</sup> [*h*1, ..., *<sup>h</sup>p*] is a matrix of corresponding eigenvectors. Then, *<sup>Z</sup>* <sup>=</sup> **<sup>Λ</sup>**−1/2*HT*(*<sup>X</sup>* <sup>−</sup>[*μ*, ..., *<sup>μ</sup>*]) is considered as a *p* × *n* sphered data matrix from a distribution with zero mean and the identity

assume that the fourth moments of each variable in *Z* are uniformly bounded and ||*zj*|| �= 0 for *j* = 1, ..., *p*, where || · || denotes the Euclidean norm. We note that the multivariate distribution assumed here does not have to be a normal distribution, *Np*(*μ*, **Σ**), and the random variables in *Z* do not have to be regulated by a *ρ*-mixing condition. We consider a general setting as

Here, *aj*(> 0), *cj*(> 0) and *αj*(*α*<sup>1</sup> ≥ ··· ≥ *α<sup>m</sup>* > 0) are unknown constants preserving the ordering that *λ*<sup>1</sup> ≥ ··· ≥ *λp*, and *m* is an unknown positive integer. The model (1) is an extension of a multi-component model or spiked covariance model given by Johnstone and Lu (2009). This is a quite general model for high-dimensional data. For example, a mixture model given by (6) in Section 3 is one of the examples that have the model (1) as in (7). One would also find the model (1) in a highly-correlated, high-dimensional data analysis such as

*<sup>o</sup>* and its dual matrix is defined by *<sup>S</sup><sup>D</sup>* = (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)−1*X<sup>T</sup>*

*<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*<sup>ˆ</sup> *<sup>j</sup>u*ˆ*ju*<sup>ˆ</sup> *<sup>T</sup>*

*<sup>i</sup> u*ˆ*<sup>j</sup>* = 0 (*i* �= *j*).

and *<sup>S</sup>* share non-zero eigenvalues. Let *<sup>λ</sup>*<sup>ˆ</sup> <sup>1</sup> ≥···≥ *<sup>λ</sup>*<sup>ˆ</sup> *<sup>n</sup>*−1(<sup>≥</sup> <sup>0</sup>) be the eigenvalues of *<sup>S</sup>D*. Let us

Yata and Aoshima (2009) gave sufficient conditions to claim the consistency property for the

*λ*ˆ *j λj p*

*λ<sup>j</sup>* = *aj pα<sup>j</sup>* (*j* = 1, ..., *m*) and *λ<sup>j</sup>* = *cj* (*j* = *m* + 1, ..., *p*). (1)

*<sup>T</sup>* and *z<sup>j</sup>* = (*zj*1, ..., *zjn*)*T*, *j* = 1, ..., *p*. We

*<sup>i</sup>*=<sup>1</sup> *xi*/*n*. The sample covariance matrix is given by

→ 1 (2)

*<sup>o</sup> Xo*. Note that *S<sup>D</sup>*

*<sup>j</sup>* , where ˆ*uj*'s are the corresponding

patients according to disease outcomes on prostate cancer.

**2. PCA for high-dimension, low-sample-size data**

covariance matrix. Here, we write *Z* = [*z*1, ..., *zp*]

graphical models, high dimensional regression models, and so on.

Let *<sup>X</sup><sup>o</sup>* <sup>=</sup> *<sup>X</sup>* <sup>−</sup> [*x*, ..., *<sup>x</sup>*], where *<sup>x</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup>*

write the eigen-decomposition of *S<sup>D</sup>* by *S<sup>D</sup>* = ∑*n*−<sup>1</sup>

eigenvectors of *<sup>λ</sup>*<sup>ˆ</sup> *<sup>j</sup>* such that ||*u*ˆ*j*|| <sup>=</sup> 1 and ˆ*u<sup>T</sup>*

sample eigenvalues: For *j* = 1, ..., *m*, it holds that

**(YA-i)** *p* → ∞ and *n* → ∞ for *j* such that *α<sup>j</sup>* > 1;

**(YA-ii)** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> and *<sup>p</sup>*2−2*αj*/*<sup>n</sup>* <sup>→</sup> 0 for *<sup>j</sup>* such that *<sup>α</sup><sup>j</sup>* <sup>∈</sup> (0, 1].

**2.1 Naive PCA in HDLSS situations**

*<sup>S</sup>* = (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)−1*XoX<sup>T</sup>*

under the conditions:

follows:

Yata and Aoshima (2010a,b) created an effective methodology called the *cross-data-matrix (CDM) methodology* to handle HDLSS data situations: Let *n*(1) = [*n*/2] + 1 and *n*(2) = *n* − *n*(1), where [*x*] denotes the largest integer less than *x*. Suppose that we have a *p* × *n* data matrix,

$$\mathbf{X} = [\mathbf{x}\_1, \dots, \mathbf{x}\_{\mathbb{N}}] = [\mathbf{x}\_{11}, \dots, \mathbf{x}\_{1n\_{(1)}}, \mathbf{x}\_{21}, \dots, \mathbf{x}\_{2n\_{(2)}}].\tag{3}$$

We define *p* × *n*(*i*) data matrices, *X*<sup>1</sup> and *X*2, by *X<sup>i</sup>* = [*xi*1, ..., *xin*(*i*) ], *i* = 1, 2. Note that *X*<sup>1</sup> and *<sup>X</sup>*<sup>2</sup> are independent. Let *<sup>X</sup>oi* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> [*xi*, ..., *<sup>x</sup>i*], *<sup>i</sup>* <sup>=</sup> 1, 2, where *<sup>x</sup><sup>i</sup>* <sup>=</sup> <sup>∑</sup>*n*(*i*) *<sup>j</sup>*=<sup>1</sup> *xij*/*n*(*i*). We define a cross data matrix by *<sup>S</sup>D*(1) = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup> <sup>o</sup>*1*Xo*<sup>2</sup> or *SD*(2) = ((*n*(1) − <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup> <sup>o</sup>*2*Xo*<sup>1</sup> (= *<sup>S</sup><sup>T</sup> D*(1) ). Note that rank(*SD*(1)) ≤ *n*(2) − 1. When we consider the singular value decomposition of *<sup>S</sup>D*(1), it follows that *<sup>S</sup>D*(1) <sup>=</sup> <sup>∑</sup>*n*(2)−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ *<sup>j</sup>u*˜*j*(1)*u*˜ *<sup>T</sup> j*(2) , where *<sup>λ</sup>*˜ <sup>1</sup> ≥ ··· ≥ *<sup>λ</sup>*˜ *<sup>n</sup>*(2)−1(<sup>≥</sup> <sup>0</sup>) denote singular values of *<sup>S</sup>D*(1), and ˜*uj*(1) (or ˜*uj*(2)) denotes a unit left- (or right-) singular vector corresponding to *<sup>λ</sup>*˜ *<sup>j</sup>* (*<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*(2) <sup>−</sup> <sup>1</sup>).

#### **[Cross-data-matrix (CDM) methodology]**

**(Step 1)** Define a cross data matrix by *<sup>S</sup>D*(1) = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup> <sup>o</sup>*1*Xo*2.

**(Step 2)** Calculate the singular values, *<sup>λ</sup>*˜ *<sup>j</sup>*'s, of *<sup>S</sup>D*(1) for the estimation of *<sup>λ</sup>j*'s.

Note that *<sup>S</sup>D*(1)*S<sup>T</sup> <sup>D</sup>*(1) <sup>=</sup> <sup>∑</sup>*n*(2)−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ <sup>2</sup> *<sup>j</sup> <sup>u</sup>*˜*j*(1)*u*˜ *<sup>T</sup> j*(1) . Thus one can calculate the singular values, *λ*˜ *<sup>j</sup>*'s, by the positive square-root of the eigenvalues of *<sup>S</sup>D*(1)*S<sup>T</sup> D*(1) . The CDM methodology assures the following properties. For the details, see Yata and Aoshima (2010a,b).

**Theorem 2.1.** *For j* = 1, ..., *m, it holds that*

$$\frac{\tilde{\lambda}\_j}{\lambda\_j} \stackrel{p}{\to} 1\tag{4}$$

*under the conditions:*

**(i)** *p* → ∞ *and n* → ∞ *for j such that α<sup>j</sup>* > 1/2*;*

Fig. 1. The behaviors of A: *λ*ˆ *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λ<sup>j</sup>* for the first eigenvalue (upper panel) and second eigenvalue (lower panel) when the samples, of size *n* = 20(20)120, were taken from

Effective Methodologies for Statistical Inference on Microarray Studies 17

We observed that naive PCA requires the sample size *n* depending on *p* for *α<sup>i</sup>* ∈ (1/2, 1] in (2). On the other hand, the CDM methodology allows the experimenter to choose *n* free from *p* for the case that *α<sup>i</sup>* > 1/2 as in Theorem 2.1 or Corollary 2.1. The CDM methodology might make it possible to give feasible estimation of eigenvalues for HDLSS data with extremely small *n*

We first considered a normal distribution case. Independent pseudorandom normal observations were generated from *Np*(**0**, **Σ**) with *p* = 1600. We considered *λ*<sup>1</sup> = *p*2/3, *λ*<sup>2</sup> = *<sup>p</sup>*1/3 and *<sup>λ</sup>*<sup>3</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>λ</sup><sup>p</sup>* <sup>=</sup> 1 in (1). We used the sample of size *<sup>n</sup>* <sup>=</sup> <sup>20</sup>(20)120 to define the data matrix *X* : *p* × *n* for the calculation of *S<sup>D</sup>* in naive PCA, whereas we divided the sample into *X*<sup>1</sup> : *p* × *n*(1) and *X*<sup>2</sup> : *p* × *n*(2) for the calculation of *SD*(1) in the CDM methodology. The findings were obtained by averaging the outcomes from 1000 (= *R*, say) replications. Under a fixed scenario, suppose that the *r*-th replication ends with estimates of *λj*, *λ*ˆ *jr* and *λ*˜ *jr* (*r* = 1, ..., *R*), given by naive PCA and the CDM methodology. Let us simply

and B: *λ*˜ *<sup>j</sup>*/*λj*. Figure 1 shows the behaviors of both A and B for the first two eigenvalues. By observing the behavior of A, naive PCA seems not to give a feasible estimation within

*<sup>r</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ *jr*. We considered two quantities, A: *<sup>λ</sup>*<sup>ˆ</sup> *<sup>j</sup>*/*λ<sup>j</sup>*

*<sup>r</sup>*=<sup>1</sup> *<sup>λ</sup>*<sup>ˆ</sup> *jr* and *<sup>λ</sup>*˜ *<sup>j</sup>* <sup>=</sup> *<sup>R</sup>*−<sup>1</sup> <sup>∑</sup>*<sup>R</sup>*

*Np*(**0**, **Σ**) with *p* = 1600.

**2.3 Performances**

compared to *p*.

write *λ*ˆ *<sup>j</sup>* = *R*−<sup>1</sup> ∑*<sup>R</sup>*

**(ii)** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> *and p*2−2*αj*/*<sup>n</sup>* <sup>→</sup> <sup>0</sup> *for j such that <sup>α</sup><sup>j</sup>* <sup>∈</sup> (0, 1/2]*.*

**Corollary 2.1**. *Assume further in Theorem 2.1 that zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent. Then, for j* = 1, ..., *m, we have (4) under the conditions:*


**Theorem 2.2**. *Let Var*(*z*<sup>2</sup> *jk*) = *Mj* (< ∞) *for j* = 1, ..., *m* (*k* = 1, ..., *n*)*. Assume that λ<sup>j</sup>* (*j* ≤ *m*) *has multiplicity one. Then, under the conditions (i)-(ii) in Theorem 2.1, it holds for j* = 1, ..., *m, that*

$$
\sqrt{\frac{n}{M\_j}} \left(\frac{\tilde{\lambda}\_j}{\lambda\_j} - 1\right) \Rightarrow N(0, 1)\_\prime \tag{5}
$$

*where "*⇒*" denotes the convergence in distribution and N*(0, 1) *denotes a random variable distributed as the standard normal distribution.*

**Corollary 2.2**. *Assume further in Theorem 2.2 that zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent. Then, for j* = 1, ..., *m, we have (5) under the conditions:*


**Remark 2.1.** When the population distribution is *Np*(*μ*, **Σ**), one has that *Mj* = 2 for *j* = 1, ..., *p*.

**Remark 2.2.** The condition (ii) given by Theorem 2.1 (or Theorem 2.2) is a sufficient condition for the case of *α<sup>j</sup>* ∈ (0, 1/2]. If more information is available about the population distribution, the condition (ii) can be relaxed to give consistency under a broader set of (*p*, *n*). For example, when the population distribution is *Np*(*μ*, **Σ**), the asymptotic properties are claimed under a broader set of (*p*, *n*) given by (ii) of Corollary 2.1 (or Corollary 2.2).

**Remark 2.3.** In view of Theorem 2.1 compared to (2), the CDM methodology successfully relaxes the condition for the case that *α<sup>j</sup>* > 1/2. The conditions given by Theorem 2.1 are not continuous in *α<sup>j</sup>* at *α<sup>j</sup>* = 1/2. On the other hand, the conditions given by Corollaries 2.1 and 2.2 are continuous in *αj*.

When we apply the CDM methodology, we simply divided *X* into *x*1, ..., *xn*(1) and *xn*(1)+1, ..., *x<sup>n</sup>* in (3). In general, there exist *nCn*(1) ways to divide *X* into *X*<sup>1</sup> and *X*2. The CDM methodology can be generalized as follows:

#### **[Generalized cross-data-matrix (GCDM) methodology]**

**(Step 1)** Set iteration number *T*. Set *t* = 1.

**(Step 2)** Randomly split *x*1, ..., *x<sup>n</sup>* into *X*<sup>1</sup> = [*x*1(1), ..., *x*1(*n*(1))] and *X*<sup>2</sup> = [*x*2(1), ..., *x*2(*n*(2))].

**(Step 3)** Define a cross data matrix by *<sup>S</sup>D*(1)*<sup>t</sup>* = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup> <sup>o</sup>*1*Xo*2, where *<sup>X</sup>oi* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> [*xi*, ..., *<sup>x</sup>i*], *<sup>i</sup>* <sup>=</sup> 1, 2, and *<sup>x</sup><sup>i</sup>* <sup>=</sup> <sup>∑</sup>*n*(*i*) *<sup>j</sup>*=<sup>1</sup> *xi*(*j*)/*n*(*i*).

**(Step 4)** Calculate the singular values, *<sup>λ</sup>*˜ <sup>1</sup>*<sup>t</sup>* ≥···≥ *<sup>λ</sup>*˜ *<sup>n</sup>*(2)−<sup>1</sup> *<sup>t</sup>*(<sup>≥</sup> <sup>0</sup>), of *<sup>S</sup>D*(1)*t*.

**(Step 5)** If *t* < *T*, put *t* = *t* + 1 and go to Step 2; otherwise go to Step 6.

**(Step 6)** Estimate *<sup>λ</sup><sup>j</sup>* by *<sup>λ</sup>*˜ *<sup>j</sup>*(*T*) <sup>=</sup> <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ *jt*/*<sup>T</sup>* for each *<sup>j</sup>*.

Fig. 1. The behaviors of A: *λ*ˆ *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λ<sup>j</sup>* for the first eigenvalue (upper panel) and second eigenvalue (lower panel) when the samples, of size *n* = 20(20)120, were taken from *Np*(**0**, **Σ**) with *p* = 1600.

#### **2.3 Performances**

4 Will-be-set-by-IN-TECH

**Corollary 2.1**. *Assume further in Theorem 2.1 that zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent.*

*multiplicity one. Then, under the conditions (i)-(ii) in Theorem 2.1, it holds for j* = 1, ..., *m, that*

*where "*⇒*" denotes the convergence in distribution and N*(0, 1) *denotes a random variable distributed*

**Corollary 2.2**. *Assume further in Theorem 2.2 that zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent.*

**Remark 2.1.** When the population distribution is *Np*(*μ*, **Σ**), one has that *Mj* = 2 for *j* = 1, ..., *p*. **Remark 2.2.** The condition (ii) given by Theorem 2.1 (or Theorem 2.2) is a sufficient condition for the case of *α<sup>j</sup>* ∈ (0, 1/2]. If more information is available about the population distribution, the condition (ii) can be relaxed to give consistency under a broader set of (*p*, *n*). For example, when the population distribution is *Np*(*μ*, **Σ**), the asymptotic properties are claimed under a

**Remark 2.3.** In view of Theorem 2.1 compared to (2), the CDM methodology successfully relaxes the condition for the case that *α<sup>j</sup>* > 1/2. The conditions given by Theorem 2.1 are not continuous in *α<sup>j</sup>* at *α<sup>j</sup>* = 1/2. On the other hand, the conditions given by Corollaries 2.1 and

When we apply the CDM methodology, we simply divided *X* into *x*1, ..., *xn*(1) and *xn*(1)+1, ..., *x<sup>n</sup>* in (3). In general, there exist *nCn*(1) ways to divide *X* into *X*<sup>1</sup> and *X*2. The CDM methodology

**(Step 2)** Randomly split *x*1, ..., *x<sup>n</sup>* into *X*<sup>1</sup> = [*x*1(1), ..., *x*1(*n*(1))] and *X*<sup>2</sup> = [*x*2(1), ..., *x*2(*n*(2))].

*<sup>t</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ *jt*/*<sup>T</sup>* for each *<sup>j</sup>*.

*<sup>j</sup>*=<sup>1</sup> *xi*(*j*)/*n*(*i*).

**(Step 3)** Define a cross data matrix by *<sup>S</sup>D*(1)*<sup>t</sup>* = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup>*

**(Step 4)** Calculate the singular values, *<sup>λ</sup>*˜ <sup>1</sup>*<sup>t</sup>* ≥···≥ *<sup>λ</sup>*˜ *<sup>n</sup>*(2)−<sup>1</sup> *<sup>t</sup>*(<sup>≥</sup> <sup>0</sup>), of *<sup>S</sup>D*(1)*t*. **(Step 5)** If *t* < *T*, put *t* = *t* + 1 and go to Step 2; otherwise go to Step 6.

 *<sup>λ</sup>*˜ *<sup>j</sup> λj* − 1 

*jk*) = *Mj* (< ∞) *for j* = 1, ..., *m* (*k* = 1, ..., *n*)*. Assume that λ<sup>j</sup>* (*j* ≤ *m*) *has*

/*n* < *p*−*ε<sup>j</sup> for j such that*

*<sup>o</sup>*1*Xo*2, where

⇒ *N*(0, 1), (5)

**(ii)** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> *and p*2−2*αj*/*<sup>n</sup>* <sup>→</sup> <sup>0</sup> *for j such that <sup>α</sup><sup>j</sup>* <sup>∈</sup> (0, 1/2]*.*

**(ii)** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> *and there exists a positive constant <sup>ε</sup><sup>j</sup> satisfying p*1−2*α<sup>j</sup>*

 *n Mj*

*Then, for j* = 1, ..., *m, we have (4) under the conditions:* **(i)** *p* → ∞ *and n* → ∞ *for j such that α<sup>j</sup>* > 1/2*;*

*Then, for j* = 1, ..., *m, we have (5) under the conditions:* **(i)** *p* → ∞ *and n* → ∞ *for j such that α<sup>j</sup>* > 1/2*;*

**(ii)** *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> *and p*2−4*αj*/*<sup>n</sup>* <sup>→</sup> <sup>0</sup> *for j such that <sup>α</sup><sup>j</sup>* <sup>∈</sup> (0, 1/2]*.*

**[Generalized cross-data-matrix (GCDM) methodology]**

*<sup>X</sup>oi* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> [*xi*, ..., *<sup>x</sup>i*], *<sup>i</sup>* <sup>=</sup> 1, 2, and *<sup>x</sup><sup>i</sup>* <sup>=</sup> <sup>∑</sup>*n*(*i*)

broader set of (*p*, *n*) given by (ii) of Corollary 2.1 (or Corollary 2.2).

*α<sup>j</sup>* ∈ (0, 1/2]*.* **Theorem 2.2**. *Let Var*(*z*<sup>2</sup>

*as the standard normal distribution.*

2.2 are continuous in *αj*.

can be generalized as follows:

**(Step 1)** Set iteration number *T*. Set *t* = 1.

**(Step 6)** Estimate *<sup>λ</sup><sup>j</sup>* by *<sup>λ</sup>*˜ *<sup>j</sup>*(*T*) <sup>=</sup> <sup>∑</sup>*<sup>T</sup>*

We observed that naive PCA requires the sample size *n* depending on *p* for *α<sup>i</sup>* ∈ (1/2, 1] in (2). On the other hand, the CDM methodology allows the experimenter to choose *n* free from *p* for the case that *α<sup>i</sup>* > 1/2 as in Theorem 2.1 or Corollary 2.1. The CDM methodology might make it possible to give feasible estimation of eigenvalues for HDLSS data with extremely small *n* compared to *p*.

We first considered a normal distribution case. Independent pseudorandom normal observations were generated from *Np*(**0**, **Σ**) with *p* = 1600. We considered *λ*<sup>1</sup> = *p*2/3, *λ*<sup>2</sup> = *<sup>p</sup>*1/3 and *<sup>λ</sup>*<sup>3</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>λ</sup><sup>p</sup>* <sup>=</sup> 1 in (1). We used the sample of size *<sup>n</sup>* <sup>=</sup> <sup>20</sup>(20)120 to define the data matrix *X* : *p* × *n* for the calculation of *S<sup>D</sup>* in naive PCA, whereas we divided the sample into *X*<sup>1</sup> : *p* × *n*(1) and *X*<sup>2</sup> : *p* × *n*(2) for the calculation of *SD*(1) in the CDM methodology. The findings were obtained by averaging the outcomes from 1000 (= *R*, say) replications. Under a fixed scenario, suppose that the *r*-th replication ends with estimates of *λj*, *λ*ˆ *jr* and *λ*˜ *jr* (*r* = 1, ..., *R*), given by naive PCA and the CDM methodology. Let us simply write *λ*ˆ *<sup>j</sup>* = *R*−<sup>1</sup> ∑*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *<sup>λ</sup>*<sup>ˆ</sup> *jr* and *<sup>λ</sup>*˜ *<sup>j</sup>* <sup>=</sup> *<sup>R</sup>*−<sup>1</sup> <sup>∑</sup>*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ *jr*. We considered two quantities, A: *<sup>λ</sup>*<sup>ˆ</sup> *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λj*. Figure 1 shows the behaviors of both A and B for the first two eigenvalues. By observing the behavior of A, naive PCA seems not to give a feasible estimation within

**3. Clustering for high-dimension, low-sample-size data**

where *wj*'s are positive constants such that *w*<sup>1</sup> + *w*<sup>2</sup> = 1 and *πi*(*x*; *μ<sup>i</sup>*

*λ*1

*s* √ 1*k λ*1

*p* →

Aoshima (2010b), it holds as *p* → ∞ that

**3.1 Effective estimation for PC scores**

estimator of the *<sup>j</sup>*-th PC score of *<sup>x</sup><sup>k</sup>* by *<sup>h</sup>*<sup>ˆ</sup> *<sup>T</sup>*

**[CDM methodology for PC scores]**

In general, the *<sup>j</sup>*-th PC score of *<sup>x</sup><sup>k</sup>* is given by *<sup>h</sup><sup>T</sup>*

1, ..., *<sup>n</sup>*(2) <sup>−</sup> <sup>1</sup>) of *<sup>S</sup>D*(1) = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup>*

**(Step 1)** Calculate the singular vectors ˜*uj*(*i*)'s, *i* = 1, 2, of *SD*(1).

given by

2.

Suppose we have a mixture model to classify a data set into two groups. We assume that the observation is sampled with mixing proportions *wj*'s from two populations, Π<sup>1</sup> and Π2, and the label of the population is missing. We consider a mixture model whose p.d.f. (or p.f.) is

Effective Methodologies for Statistical Inference on Microarray Studies 19

p.d.f. (or p.f.) of Π*<sup>i</sup>* having mean vector *μ<sup>i</sup>* and covariance matrix **Σ***i*. Let *μ* and **Σ** be the mean vector and the covariance matrix of the mixture model. Then, we have that *μ* = *w*1*μ*<sup>1</sup> + *w*2*μ*<sup>2</sup> and **<sup>Σ</sup>** <sup>=</sup> *<sup>w</sup>*1*w*2(*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2)(*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2)*<sup>T</sup>* <sup>+</sup> *<sup>w</sup>*1**Σ**<sup>1</sup> <sup>+</sup> *<sup>w</sup>*2**Σ**2. We suppose that *<sup>x</sup>k*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*, are independently taken from (6) and define a *p* × *n* data matrix *X* = [*x*1, ..., *xn*]. Let Δ = ||*μ*<sup>1</sup> − *<sup>μ</sup>*2||2. Let *<sup>λ</sup>*<sup>11</sup> and *<sup>λ</sup>*<sup>21</sup> be the largest eigenvalues of **<sup>Σ</sup>**<sup>1</sup> and **<sup>Σ</sup>**2. We assume that <sup>Δ</sup> <sup>=</sup> *cpβ*, where *c* and *β* are positive constants. We assume that *λ*11/Δ → 0 and *λ*21/Δ → 0 as *p* → ∞. Then, as for the largest eigenvalue, *λ*1, of **Σ** and the corresponding eigenvector, *h*1, we have that

We note from (7) that the mixture model given by (6) holds the model (1) about **Σ**. Let *s*1*<sup>k</sup>* denote the first principal component (PC) score of *x<sup>k</sup>* (*k* = 1, ..., *n*). Then, from Yata and

Thus one would be able to classify the data set {*x*1, ..., *xn*} into two groups if *s*1*<sup>k</sup>* is effectively estimated in HDLSS data situations. In this section hereafter, we borrow symbols from Section

Aoshima (2009) considered a sample eigenvector by *<sup>h</sup>*ˆ*<sup>j</sup>* = ((*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)*λ*<sup>ˆ</sup> *<sup>j</sup>*)−1/2*Xou*ˆ*<sup>j</sup>* and an naive

(*u*ˆ*j*1, ..., *u*ˆ*jn*). Note that *h*ˆ *<sup>j</sup>* can be calculated by using a unit-norm eigenvector, ˆ*uj*, of *S<sup>D</sup>* whose size is much smaller than *S* especially for a HDLSS data matrix. Now, we apply the CDM methodology to the PC score in order to improve the naive estimator. Recall that ˜*uj*(1) (or *<sup>u</sup>*˜ *<sup>j</sup>*(2)) is a unit left- (or right-) singular vector corresponding to the singular value *<sup>λ</sup>*˜ *<sup>j</sup>* (*<sup>j</sup>* <sup>=</sup>

*<sup>j</sup>* (*x<sup>k</sup>* <sup>−</sup> *<sup>x</sup>*) = *<sup>u</sup>*ˆ*jk*

 <sup>√</sup>*w*2/*w*<sup>1</sup> when *<sup>x</sup><sup>k</sup>* <sup>∈</sup> <sup>Π</sup>1, <sup>−</sup>√*w*1/*w*<sup>2</sup> when *<sup>x</sup><sup>k</sup>* <sup>∈</sup> <sup>Π</sup>2.

*f*(*x*) = *w*1*π*1(*x*; *μ*1, **Σ**1) + *w*2*π*2(*x*; *μ*2, **Σ**2), (6)

*<sup>ω</sup>*1*ω*2<sup>Δ</sup> <sup>→</sup> 1 and Angle(*h*1,(*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2)/Δ1/2) <sup>→</sup> 0. (7)

*<sup>j</sup>* (*x<sup>k</sup>* <sup>−</sup> *<sup>μ</sup>*) = *zjk*

*<sup>o</sup>*1*Xo*2.

, **Σ***i*)'s are *p*-dimensional

*λ<sup>j</sup>* (= *sjk*, say). Yata and

*<sup>j</sup>* =

(*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)*λ*<sup>ˆ</sup> *<sup>j</sup>* (= *<sup>s</sup>*ˆ*jk*, say), where ˆ*u<sup>T</sup>*

Fig. 2. The behaviors of A: *λ*ˆ *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λ<sup>j</sup>* for the first eigenvalue (upper panel) and second eigenvalue (lower panel) when the samples, of size *n* = 60, were taken from *tp*(**0**, **Σ**, *ν*) with *ν* = 15 and *p* = 400(400)2000.

the range of *n*. The sample size *n* was not large enough to use the eigenvalues of *S<sup>D</sup>* for such a high-dimensional space. On the other hand, in view of the behavior of B, the CDM methodology gave a reasonable estimation surprisingly well for such HDLSS data sets. The CDM methodology seems to perform excellently as expected theoretically.

Next, we considered a non-normal distribution case. Independent pseudorandom observations were generated from a *p*-variate *t*-distribution, *tp*(**0**, **Σ**, *ν*), with mean zero, covariance matrix **Σ** and degree of freedom *ν* = 15. We considered the case that *λ*<sup>1</sup> = *p*2/3, *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> *<sup>p</sup>*1/3 and *<sup>λ</sup>*<sup>3</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>λ</sup><sup>p</sup>* <sup>=</sup> 1 in (1) as before. We fixed the sample size as *<sup>n</sup>* <sup>=</sup> 60. We set the dimension as *p* = 400(400)2000. Similarly to Figure 1, the findings were obtained by averaging the outcomes from 1000 replications. Figure 2 shows the behaviors of two quantities, A: *λ*ˆ *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λj*, for the first two eigenvalues. Again, the CDM methodology seems to perform excellently as expected theoretically. One can observe the consistency of *λ*˜ *<sup>j</sup>* for all *p* = 400(400)2000. We conducted simulation studies for other settings as well and verified the superiority of the CDM methodology to naive PCA in various HDLSS data situations.

6 Will-be-set-by-IN-TECH

Fig. 2. The behaviors of A: *λ*ˆ *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λ<sup>j</sup>* for the first eigenvalue (upper panel) and second eigenvalue (lower panel) when the samples, of size *n* = 60, were taken from

CDM methodology seems to perform excellently as expected theoretically.

the range of *n*. The sample size *n* was not large enough to use the eigenvalues of *S<sup>D</sup>* for such a high-dimensional space. On the other hand, in view of the behavior of B, the CDM methodology gave a reasonable estimation surprisingly well for such HDLSS data sets. The

Next, we considered a non-normal distribution case. Independent pseudorandom observations were generated from a *p*-variate *t*-distribution, *tp*(**0**, **Σ**, *ν*), with mean zero, covariance matrix **Σ** and degree of freedom *ν* = 15. We considered the case that *λ*<sup>1</sup> = *p*2/3, *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> *<sup>p</sup>*1/3 and *<sup>λ</sup>*<sup>3</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>λ</sup><sup>p</sup>* <sup>=</sup> 1 in (1) as before. We fixed the sample size as *<sup>n</sup>* <sup>=</sup> 60. We set the dimension as *p* = 400(400)2000. Similarly to Figure 1, the findings were obtained by averaging the outcomes from 1000 replications. Figure 2 shows the behaviors of two quantities, A: *λ*ˆ *<sup>j</sup>*/*λ<sup>j</sup>* and B: *λ*˜ *<sup>j</sup>*/*λj*, for the first two eigenvalues. Again, the CDM methodology seems to perform excellently as expected theoretically. One can observe the consistency of *λ*˜ *<sup>j</sup>* for all *p* = 400(400)2000. We conducted simulation studies for other settings as well and verified the superiority of the CDM methodology to naive PCA in various HDLSS data

*tp*(**0**, **Σ**, *ν*) with *ν* = 15 and *p* = 400(400)2000.

situations.

#### **3. Clustering for high-dimension, low-sample-size data**

Suppose we have a mixture model to classify a data set into two groups. We assume that the observation is sampled with mixing proportions *wj*'s from two populations, Π<sup>1</sup> and Π2, and the label of the population is missing. We consider a mixture model whose p.d.f. (or p.f.) is given by

$$f(\mathbf{x}) = w\_1 \pi\_1(\mathbf{x}; \boldsymbol{\mu}\_1, \boldsymbol{\Sigma}\_1) + w\_2 \pi\_2(\mathbf{x}; \boldsymbol{\mu}\_2, \boldsymbol{\Sigma}\_2), \tag{6}$$

where *wj*'s are positive constants such that *w*<sup>1</sup> + *w*<sup>2</sup> = 1 and *πi*(*x*; *μ<sup>i</sup>* , **Σ***i*)'s are *p*-dimensional p.d.f. (or p.f.) of Π*<sup>i</sup>* having mean vector *μ<sup>i</sup>* and covariance matrix **Σ***i*. Let *μ* and **Σ** be the mean vector and the covariance matrix of the mixture model. Then, we have that *μ* = *w*1*μ*<sup>1</sup> + *w*2*μ*<sup>2</sup> and **<sup>Σ</sup>** <sup>=</sup> *<sup>w</sup>*1*w*2(*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2)(*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2)*<sup>T</sup>* <sup>+</sup> *<sup>w</sup>*1**Σ**<sup>1</sup> <sup>+</sup> *<sup>w</sup>*2**Σ**2. We suppose that *<sup>x</sup>k*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*, are independently taken from (6) and define a *p* × *n* data matrix *X* = [*x*1, ..., *xn*]. Let Δ = ||*μ*<sup>1</sup> − *<sup>μ</sup>*2||2. Let *<sup>λ</sup>*<sup>11</sup> and *<sup>λ</sup>*<sup>21</sup> be the largest eigenvalues of **<sup>Σ</sup>**<sup>1</sup> and **<sup>Σ</sup>**2. We assume that <sup>Δ</sup> <sup>=</sup> *cpβ*, where *c* and *β* are positive constants. We assume that *λ*11/Δ → 0 and *λ*21/Δ → 0 as *p* → ∞. Then, as for the largest eigenvalue, *λ*1, of **Σ** and the corresponding eigenvector, *h*1, we have that

$$\frac{\lambda\_1}{\omega\_1 \omega\_2 \Delta} \to 1 \quad \text{and} \quad \text{Angle}(\hbar\_1 \iota\_1 (\mu\_1 - \mu\_2) / \Delta^{1/2}) \to 0. \tag{7}$$

We note from (7) that the mixture model given by (6) holds the model (1) about **Σ**. Let *s*1*<sup>k</sup>* denote the first principal component (PC) score of *x<sup>k</sup>* (*k* = 1, ..., *n*). Then, from Yata and Aoshima (2010b), it holds as *p* → ∞ that

$$\frac{\mathbf{s}\_{1k}}{\sqrt{\lambda\_1}} \xrightarrow{p} \begin{cases} \sqrt{w\_2/w\_1} & \text{when } \mathbf{x}\_k \in \Pi\_1, \\\ -\sqrt{w\_1/w\_2} & \text{when } \mathbf{x}\_k \in \Pi\_2. \end{cases}$$

Thus one would be able to classify the data set {*x*1, ..., *xn*} into two groups if *s*1*<sup>k</sup>* is effectively estimated in HDLSS data situations. In this section hereafter, we borrow symbols from Section 2.

#### **3.1 Effective estimation for PC scores**

In general, the *<sup>j</sup>*-th PC score of *<sup>x</sup><sup>k</sup>* is given by *<sup>h</sup><sup>T</sup> <sup>j</sup>* (*x<sup>k</sup>* <sup>−</sup> *<sup>μ</sup>*) = *zjk λ<sup>j</sup>* (= *sjk*, say). Yata and Aoshima (2009) considered a sample eigenvector by *<sup>h</sup>*ˆ*<sup>j</sup>* = ((*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)*λ*<sup>ˆ</sup> *<sup>j</sup>*)−1/2*Xou*ˆ*<sup>j</sup>* and an naive estimator of the *<sup>j</sup>*-th PC score of *<sup>x</sup><sup>k</sup>* by *<sup>h</sup>*<sup>ˆ</sup> *<sup>T</sup> <sup>j</sup>* (*x<sup>k</sup>* <sup>−</sup> *<sup>x</sup>*) = *<sup>u</sup>*ˆ*jk* (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)*λ*<sup>ˆ</sup> *<sup>j</sup>* (= *<sup>s</sup>*ˆ*jk*, say), where ˆ*u<sup>T</sup> <sup>j</sup>* = (*u*ˆ*j*1, ..., *u*ˆ*jn*). Note that *h*ˆ *<sup>j</sup>* can be calculated by using a unit-norm eigenvector, ˆ*uj*, of *S<sup>D</sup>* whose size is much smaller than *S* especially for a HDLSS data matrix. Now, we apply the CDM methodology to the PC score in order to improve the naive estimator. Recall that ˜*uj*(1) (or *<sup>u</sup>*˜ *<sup>j</sup>*(2)) is a unit left- (or right-) singular vector corresponding to the singular value *<sup>λ</sup>*˜ *<sup>j</sup>* (*<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*(2) <sup>−</sup> <sup>1</sup>) of *<sup>S</sup>D*(1) = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup> <sup>o</sup>*1*Xo*2.

#### **[CDM methodology for PC scores]**

**(Step 1)** Calculate the singular vectors ˜*uj*(*i*)'s, *i* = 1, 2, of *SD*(1).

**(Step 2)** Adjust the sign of ˜*uj*(2) by ˜*uj*(2) = Sign(*u*˜ *<sup>T</sup> j*(1) *X<sup>T</sup> <sup>o</sup>*1*Xo*2*u*˜*j*(2))*u*˜ *<sup>j</sup>*(2). After the modification, let ˜*u<sup>T</sup> <sup>j</sup>*(*i*) = (*u*˜*j*1(*i*), ..., *u*˜*jn*(*i*)(*i*)), *i* = 1, 2.

#### **(Step 3)** Calculate *s*˜*jk*(*i*) = *u*˜*jk*(*i*) (*n*(*i*) <sup>−</sup> <sup>1</sup>)*λ*˜ *<sup>j</sup>*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*(*i*); *<sup>i</sup>* <sup>=</sup> 1, 2. Estimate the *<sup>j</sup>*-th PC score of *x<sup>k</sup>* by *s*˜*jk* = *s*˜*jk*(1), *k* = 1, ..., *n*(1) and *s*˜*jk*+*n*(1) = *s*˜*jk*(2), *k* = 1, ..., *n*(2).

One can calculate the singular vector ˜*uj*(*i*)'s by the eigenvectors of *<sup>S</sup>D*(*i*)*S<sup>T</sup> D*(*i*) . Let MSE(*s*˜*j*) = *n*−<sup>1</sup> ∑*<sup>n</sup> <sup>k</sup>*=1(*s*˜*jk* <sup>−</sup> *sjk*)<sup>2</sup> denote the sample mean-square error of the *<sup>j</sup>*-th PC score. Note that Var(*sjk*) = *λj*. Then, Yata and Aoshima (2010b) gave the following properties on the CDM-based PC scores.

**Theorem 3.1.** *Assume that λ<sup>j</sup>* (*j* ≤ *m*) *has multiplicity one. Then, it holds that*

$$\frac{MSE(\tilde{s}\_j)}{\lambda\_j} \stackrel{p}{\to} 0 \tag{8}$$

*under the conditions (i)-(ii) in Theorem 2.1. If zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent, we have (8) under the conditions (i)-(ii) in Corollary 2.1.*

**Theorem 3.2.** *Assume that λ<sup>j</sup>* (*j* ≤ *m*) *has multiplicity one. Then, for any k* (= 1, ..., *n*)*, it holds that*

$$
\lambda\_j^{-1/2} \mathfrak{s}\_{jk} \xrightarrow{p} z\_{jk} \tag{9}
$$

Fig. 3. Scatterplots of PC scores by PC1 and PC2 (upper panel) or PC1 and PC3 (lower panel) by using the GCDM methodology. There are 9 samples from Normal Prostate (blue point)

Effective Methodologies for Statistical Inference on Microarray Studies 21

We analyzed gene expression data about prostate cancer given by Singh et al. (2002). Refer to Pochet et al. (2004) for details of the data set. The data set consisted of 12600 (= *p*) genes and 34 microarrays in which there were 9 samples from Normal Prostate and 25 samples from Prostate Tumors. As for Prostate Tumors, we chose the first 9 samples and set 18 (= *n*) microarrays in which there were 9 samples from Normal Prostate and 9 samples from Prostate Tumors. We assumed the mixture model given by (6) for the data set. We defined the data matrix by *X* : 12600 × 18. We set (*n*(1), *n*(2))=(9, 9) and *T* = 1000. We focused on the first three PC scores. We randomly divided *X* into *X*<sup>1</sup> : 12600 × 9 and *X*<sup>2</sup> : 12600 × 9, and calculated *s*˜*jkt*, *k* = 1, ..., 18, for *j* = 1, 2, 3. According to the GCDM methodology, we repeated this operation *T* = 1000 times and obtained *s*˜*jk*(*T*), *k* = 1, ..., 18; *j* = 1, 2, 3, as an estimate of the

Figure 3 gives the scatterplots of the first three PC scores on the (PC1, PC2) plane or the (PC1, PC3) plane. As observed in Figure 3, Normal Prostate (blue point) and Prostate Tumors (red point) seem to be separated clearly. It is obvious especially for the first PC score (PC1) line. All the first PC scores of the samples from Normal Prostate are negative, whereas those from Prostate Tumors are positive. This observation is theoretically supported by the arguments in

<sup>3</sup>(*T*))=(2.77 <sup>×</sup> 108, 1.62 <sup>×</sup> 108, 6.34 <sup>×</sup> 107).

and 9 samples from Prostate Tumors (red point).

*<sup>j</sup>*-th PC score of *<sup>x</sup>k*. We also obtained (*λ*˜ <sup>1</sup>(*T*), *<sup>λ</sup>*˜ <sup>2</sup>(*T*), *<sup>λ</sup>*˜

**3.2 Demonstration**

Section 3.1.

*under the conditions (i)-(ii) of Theorem 2.1. If zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent, we have (9) under the conditions (i)-(ii) of Corollary 2.2.*

The CDM-based PC score can be generalized as follows:

#### **[GCDM methodology for PC scores]**

**(Step 1)** Set iteration number *T*. Set *t* = 1.

**(Step 2)** Randomly split *x*1, ..., *x<sup>n</sup>* into *X*<sup>1</sup> = [*x*1(1), ..., *x*1(*n*(1))] and *X*<sup>2</sup> = [*x*2(1), ..., *x*2(*n*(2))].

**(Step 3)** Define a cross data matrix by *<sup>S</sup>D*(1)*<sup>t</sup>* = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup> <sup>o</sup>*1*Xo*2, where *<sup>X</sup>oi* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> [*xi*, ..., *<sup>x</sup>i*], *<sup>i</sup>* <sup>=</sup> 1, 2, and *<sup>x</sup><sup>i</sup>* <sup>=</sup> <sup>∑</sup>*n*(*i*) *<sup>j</sup>*=<sup>1</sup> *xi*(*j*)/*n*(*i*). Calculate the singular values, *<sup>λ</sup>*˜ <sup>1</sup>*<sup>t</sup>* ≥ ··· ≥ *<sup>λ</sup>*˜ *<sup>n</sup>*(2)−<sup>1</sup> *<sup>t</sup>*(<sup>≥</sup> <sup>0</sup>), and the corresponding singular vectors, ˜*uj*(*i*)*t*'s, *<sup>i</sup>* <sup>=</sup> 1, 2, of *SD*(1)*t*. If *t* = 1, go to Step 5; otherwise go to Step 4.

**(Step 4)** Adjust the sign of ˜*uj*(1)*<sup>t</sup>* by ˜*uj*(1)*<sup>t</sup>* = Sign(*u*˜ *<sup>T</sup> j*(1)*t u*˜ *<sup>j</sup>*(1)1)*u*˜ *<sup>j</sup>*(1)*t*.


Fig. 3. Scatterplots of PC scores by PC1 and PC2 (upper panel) or PC1 and PC3 (lower panel) by using the GCDM methodology. There are 9 samples from Normal Prostate (blue point) and 9 samples from Prostate Tumors (red point).

#### **3.2 Demonstration**

8 Will-be-set-by-IN-TECH

*<sup>k</sup>*=1(*s*˜*jk* <sup>−</sup> *sjk*)<sup>2</sup> denote the sample mean-square error of the *<sup>j</sup>*-th PC score. Note

*p*

that Var(*sjk*) = *λj*. Then, Yata and Aoshima (2010b) gave the following properties on the

*MSE*(*s*˜*j*) *λj*

*under the conditions (i)-(ii) in Theorem 2.1. If zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent, we have*

**Theorem 3.2.** *Assume that λ<sup>j</sup>* (*j* ≤ *m*) *has multiplicity one. Then, for any k* (= 1, ..., *n*)*, it holds*

*under the conditions (i)-(ii) of Theorem 2.1. If zjk*, *j* = 1, ..., *p* (*k* = 1, ..., *n*) *are independent, we have*

**(Step 2)** Randomly split *x*1, ..., *x<sup>n</sup>* into *X*<sup>1</sup> = [*x*1(1), ..., *x*1(*n*(1))] and *X*<sup>2</sup> = [*x*2(1), ..., *x*2(*n*(2))].

*<sup>λ</sup>*˜ <sup>1</sup>*<sup>t</sup>* ≥ ··· ≥ *<sup>λ</sup>*˜ *<sup>n</sup>*(2)−<sup>1</sup> *<sup>t</sup>*(<sup>≥</sup> <sup>0</sup>), and the corresponding singular vectors, ˜*uj*(*i*)*t*'s, *<sup>i</sup>* <sup>=</sup> 1, 2, of

*j*(1)*t*

*u*˜ *<sup>j</sup>*(1)1)*u*˜ *<sup>j</sup>*(1)*t*.

*j*(1)*t X<sup>T</sup>*

(*n*(*i*) <sup>−</sup> <sup>1</sup>)*λ*˜ *jt*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*(*i*); *<sup>i</sup>* <sup>=</sup> 1, 2, and adjust the

*<sup>t</sup>*=<sup>1</sup> *s*˜*jkt*/*T* for each *j* and *k*.

**(Step 3)** Define a cross data matrix by *<sup>S</sup>D*(1)*<sup>t</sup>* = ((*n*(1) <sup>−</sup> <sup>1</sup>)(*n*(2) <sup>−</sup> <sup>1</sup>))−1/2*X<sup>T</sup>*

*<sup>j</sup>*(*i*)*<sup>t</sup>* = (*u*˜*j*(1*i*)*t*, ..., *u*˜*j*(*n*(*i*)*i*)*t*), *i* = 1, 2.

**(Step 7)** If *t* < *T*, put *t* = *t* + 1 and go to Step 2; otherwise go to Step 8.

*p*

*λ*−1/2 *<sup>j</sup> s*˜*jk* *j*(1) *X<sup>T</sup>*

(*n*(*i*) <sup>−</sup> <sup>1</sup>)*λ*˜ *<sup>j</sup>*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*(*i*); *<sup>i</sup>* <sup>=</sup> 1, 2. Estimate the *<sup>j</sup>*-th PC

*<sup>o</sup>*1*Xo*2*u*˜*j*(2))*u*˜ *<sup>j</sup>*(2). After the

*D*(*i*)

→ 0 (8)

→ *zjk* (9)

*<sup>j</sup>*=<sup>1</sup> *xi*(*j*)/*n*(*i*). Calculate the singular values,

*<sup>o</sup>*1*Xo*2*u*˜*j*(2)*t*)*u*˜*j*(2)*t*. After the

. Let MSE(*s*˜*j*)

*<sup>o</sup>*1*Xo*2, where

**(Step 2)** Adjust the sign of ˜*uj*(2) by ˜*uj*(2) = Sign(*u*˜ *<sup>T</sup>*

*<sup>j</sup>*(*i*) = (*u*˜*j*1(*i*), ..., *u*˜*jn*(*i*)(*i*)), *i* = 1, 2.

score of *x<sup>k</sup>* by *s*˜*jk* = *s*˜*jk*(1), *k* = 1, ..., *n*(1) and *s*˜*jk*+*n*(1) = *s*˜*jk*(2), *k* = 1, ..., *n*(2).

One can calculate the singular vector ˜*uj*(*i*)'s by the eigenvectors of *<sup>S</sup>D*(*i*)*S<sup>T</sup>*

**Theorem 3.1.** *Assume that λ<sup>j</sup>* (*j* ≤ *m*) *has multiplicity one. Then, it holds that*

modification, let ˜*u<sup>T</sup>*

CDM-based PC scores.

= *n*−<sup>1</sup> ∑*<sup>n</sup>*

*that*

**(Step 3)** Calculate *s*˜*jk*(*i*) = *u*˜*jk*(*i*)

*(8) under the conditions (i)-(ii) in Corollary 2.1.*

*(9) under the conditions (i)-(ii) of Corollary 2.2.*

**[GCDM methodology for PC scores] (Step 1)** Set iteration number *T*. Set *t* = 1.

modification, let ˜*u<sup>T</sup>*

**(Step 6)** Calculate *s*˜*j*(*ki*)*<sup>t</sup>* = *u*˜*j*(*ki*)*<sup>t</sup>*

The CDM-based PC score can be generalized as follows:

*<sup>X</sup>oi* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> [*xi*, ..., *<sup>x</sup>i*], *<sup>i</sup>* <sup>=</sup> 1, 2, and *<sup>x</sup><sup>i</sup>* <sup>=</sup> <sup>∑</sup>*n*(*i*)

**(Step 4)** Adjust the sign of ˜*uj*(1)*<sup>t</sup>* by ˜*uj*(1)*<sup>t</sup>* = Sign(*u*˜ *<sup>T</sup>*

subscript *k* of *s*˜*j*(*ki*)*<sup>t</sup>* as *s*˜*jkt* corresponding to *xk*.

**(Step 8)** Estimate the *<sup>j</sup>*-th PC score of *<sup>x</sup><sup>k</sup>* by *<sup>s</sup>*˜*jk*(*T*) = <sup>∑</sup>*<sup>T</sup>*

*SD*(1)*t*. If *t* = 1, go to Step 5; otherwise go to Step 4.

**(Step 5)** Adjust the sign of ˜*uj*(2)*<sup>t</sup>* by ˜*uj*(2)*<sup>t</sup>* = Sign(*u*˜ *<sup>T</sup>*

We analyzed gene expression data about prostate cancer given by Singh et al. (2002). Refer to Pochet et al. (2004) for details of the data set. The data set consisted of 12600 (= *p*) genes and 34 microarrays in which there were 9 samples from Normal Prostate and 25 samples from Prostate Tumors. As for Prostate Tumors, we chose the first 9 samples and set 18 (= *n*) microarrays in which there were 9 samples from Normal Prostate and 9 samples from Prostate Tumors. We assumed the mixture model given by (6) for the data set. We defined the data matrix by *X* : 12600 × 18. We set (*n*(1), *n*(2))=(9, 9) and *T* = 1000. We focused on the first three PC scores. We randomly divided *X* into *X*<sup>1</sup> : 12600 × 9 and *X*<sup>2</sup> : 12600 × 9, and calculated *s*˜*jkt*, *k* = 1, ..., 18, for *j* = 1, 2, 3. According to the GCDM methodology, we repeated this operation *T* = 1000 times and obtained *s*˜*jk*(*T*), *k* = 1, ..., 18; *j* = 1, 2, 3, as an estimate of the *<sup>j</sup>*-th PC score of *<sup>x</sup>k*. We also obtained (*λ*˜ <sup>1</sup>(*T*), *<sup>λ</sup>*˜ <sup>2</sup>(*T*), *<sup>λ</sup>*˜ <sup>3</sup>(*T*))=(2.77 <sup>×</sup> 108, 1.62 <sup>×</sup> 108, 6.34 <sup>×</sup> 107). Figure 3 gives the scatterplots of the first three PC scores on the (PC1, PC2) plane or the (PC1, PC3) plane. As observed in Figure 3, Normal Prostate (blue point) and Prostate Tumors (red point) seem to be separated clearly. It is obvious especially for the first PC score (PC1) line. All the first PC scores of the samples from Normal Prostate are negative, whereas those from Prostate Tumors are positive. This observation is theoretically supported by the arguments in Section 3.1.

**Remark 4.2.** Assume (A-i)-(A-ii). Let us consider a case that tr(**Σ**1)/tr(**Σ**2) �= 1 as *p* →

Effective Methodologies for Statistical Inference on Microarray Studies 23

**Remark 4.3.** Let *ni*(1) = [*ni*/2] + 1 and *ni*(2) = *ni* − *ni*(1) for each Π*<sup>i</sup>* (*i* = 1, 2). We omit

) = <sup>∑</sup>*n*(2)−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ <sup>2</sup>

We analyzed gene expression data given by Armstrong et al. (2001) in which data set consisted of 12582 (= *p*) genes. We had two populations about leukemia subtypes, i.e., Π1: acute lymphoblastic leukemia (ALL, 24 samples) and Π2: acute myeloid leukemia (AML, 28 samples). We set *n*<sup>1</sup> = *n*<sup>2</sup> = 10. Then, we constructed the discriminant rule given by (10) with *<sup>γ</sup>* <sup>=</sup> 0. From Remarks 4.3 and 4.4, we calculated max*i*=1,2{tr(*Sini*(1)*Sini*(2))} <sup>=</sup> 3.16 <sup>×</sup> 1019 and

from Theorem 4.1, the discriminant rule given by (10) with *γ* = 0 was expected to hold (11). In Table 1, we investigated the performance of the discriminant rule by using test data sets consisting of 24 − *n*<sup>1</sup> = 14 remaining samples from Π<sup>1</sup> and 28 − *n*<sup>2</sup> = 18 remaining samples from Π2. We observed that the discriminant rule showed *e*(1|2) = 0 and *e*(2|1) = 0

1-*e*(2|1) 14/14 (=1.0) 1-*e*(1|2) 18/18 (=1.0) Table 1. The correct discrimination rates for test data sets consisting of 14 samples from Π<sup>1</sup>

One would be interested in designing the discriminant rule given by (10) so as to hold both *e*(2|1) ≤ *α* and *e*(1|2) ≤ *β* when Δ� ≥ Δ*L*, where *α*, *β* ∈ (0, 1/2) and Δ*<sup>L</sup>* (> 0) are prespecified constants. We assume Δ*<sup>L</sup>* = *o*(*p*1/2). Aoshima and Yata (2011) showed the following property.

tr(*S*2*n*<sup>2</sup> ) <sup>−</sup> *<sup>p</sup>* log tr(*S*2*n*<sup>2</sup> )

tr(*S*1*n*<sup>1</sup> )

 − *p n*1 <sup>+</sup> *<sup>p</sup> n*2 .

(10) with *γ* = 0

]. Let *<sup>X</sup>o*<sup>1</sup> <sup>=</sup> *<sup>X</sup>*<sup>1</sup> <sup>−</sup> [*x*1, ..., *<sup>x</sup>*1] and *<sup>X</sup>o*<sup>2</sup> <sup>=</sup> *<sup>X</sup>*<sup>2</sup> <sup>−</sup> [*x*2, ..., *<sup>x</sup>*2], where *<sup>x</sup>*<sup>1</sup> <sup>=</sup> <sup>∑</sup>*n*(1)

*<sup>i</sup>* ), Yata (2010) considered an unbiased estimator, tr(*Sini*(1)*Sini*(2)), as an

2 max*i*=1,2 tr(*Sini*

)/*ni* <sup>+</sup> <sup>|</sup>tr(*S*1*n*<sup>1</sup> ) <sup>−</sup> tr(*S*2*n*<sup>2</sup> )<sup>|</sup>

the subscript *i* for a while. For each Π, split *x*1, ..., *x<sup>n</sup>* into *X*<sup>1</sup> = [*x*11, ..., *x*1*n*(1)

� min*i*=1,2{*ni*}) → 0 as *p* → ∞, we can claim (11) in the case.

*<sup>j</sup>*=<sup>1</sup> *<sup>x</sup>*2*j*/*n*(2). Define *<sup>S</sup>n*(1) = (*n*(1) <sup>−</sup> <sup>1</sup>)−1*Xo*1*X<sup>T</sup>*

2 ∑ *i*=1

<sup>Δ</sup>� <sup>=</sup> 2.67 <sup>×</sup> 1010, so that max*i*=1,2{tr(*Sini*(1)*Sini*(2))}/(<sup>Δ</sup><sup>2</sup>

*D*(1)

application of the CDM methodology given by Yata and Aoshima (2010a,b).

tr(*Sini*

*<sup>i</sup>* )}/(Δ<sup>2</sup>

/*<sup>p</sup>* <sup>&</sup>gt; 0 as *<sup>p</sup>* <sup>→</sup> <sup>∞</sup>. Since it holds max*i*=1,2{tr(**Σ**<sup>2</sup>

*<sup>i</sup>* )}

*o*2.

] and *X*<sup>2</sup> =

*<sup>j</sup>*=<sup>1</sup> *x*1*j*/*n*(1)

*<sup>o</sup>*<sup>1</sup> and *<sup>S</sup>n*(2) = (*n*(2) <sup>−</sup> <sup>1</sup>)−1*Xo*2*X<sup>T</sup>*

*<sup>j</sup>* . Then, we have that *E*(tr(*Sn*(1)*Sn*(2))) =

) (= <sup>Δ</sup>�, say).

� min*i*=1,2{*ni*}) = 0.0044. Thus, one

2

� min*i*=1,2{*ni*}) must be sufficiently small. Hence,

∞. Then, it follows that min*i*=1,2 Δ**Σ***<sup>i</sup>*

Note that tr(*Sn*(1)*Sn*(2)) =tr(*SD*(1)*S<sup>T</sup>*

**Remark 4.4.** We note that Δ� is estimated by


may conclude that max*i*=1,2{tr(**Σ**<sup>2</sup>

successfully as expected by theory.

**4.2 Sample size determination for classification**

*<sup>ω</sup>*(*x*0) = *<sup>p</sup>*||*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*1*n*<sup>1</sup> ||<sup>2</sup>

**Theorem 4.2.** *Assume that tr*(**Σ**1)/*tr*(**Σ**2) → 1 *as p* → ∞*. Let*

tr(*S*1*n*<sup>1</sup> ) <sup>−</sup> *<sup>p</sup>*||*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> ||<sup>2</sup>

and 18 samples from Π2.

/(Δ<sup>2</sup>

[*x*21, ..., *x*2*n*(2)

and *<sup>x</sup>*<sup>2</sup> <sup>=</sup> <sup>∑</sup>*n*(2)

tr(**Σ**2). As for tr(**Σ**<sup>2</sup>

#### **4. Classification for high-dimension, low-sample-size data**

Suppose we have independent and *p*-dimensional populations, Π*i*, *i* = 1, 2, having *unknown* mean vector *<sup>μ</sup><sup>i</sup>* = (*μi*1, ..., *<sup>μ</sup>ip*)*<sup>T</sup>* and *unknown* positive-definite covariance matrix **<sup>Σ</sup>***<sup>i</sup>* for each *i*. *We do not assume that* **Σ**<sup>1</sup> = **Σ**2*.* The eigen-decomposition of **Σ***<sup>i</sup>* (*i* = 1, 2) is given by **Σ***<sup>i</sup>* = *Hi***Λ***iH<sup>T</sup> <sup>i</sup>* , where **Λ***<sup>i</sup>* is a diagonal matrix of eigenvalues *λi*<sup>1</sup> ≥ ··· ≥ *λip* > 0 and *H<sup>i</sup>* = [*hi*1, ..., *hip*] is an orthogonal matrix of corresponding eigenvectors. Having recorded i.i.d. samples, *xi*1, ..., *xini* , from each Π*i*, we have a *p* × *ni* (*p* > *ni*) data matrix *X<sup>i</sup>* = [*xi*1, ..., *xini* ], where *<sup>x</sup>ij* = (*xi*1*j*, ..., *xipj*)*T*, *<sup>j</sup>* <sup>=</sup> 1, ..., *ni*. We assume *ni* <sup>≥</sup> 4, *<sup>i</sup>* <sup>=</sup> 1, 2. Then, *<sup>Z</sup><sup>i</sup>* = **<sup>Λ</sup>**−1/2 *<sup>i</sup> <sup>H</sup><sup>T</sup> <sup>i</sup>* (*X<sup>i</sup>* − [*μ<sup>i</sup>* , ..., *μ<sup>i</sup>* ]) is considered as a *p* × *ni* sphered data matrix from a distribution with zero mean and the identity covariance matrix. Here, we write *Z<sup>i</sup>* = [*zi*1, ..., *zini* ] and *zij* = (*zi*1*j*, ..., *zipj*)*T*, *j* = 1, ..., *ni*. Note that *E*(*z*<sup>2</sup> *ijl*) = 1 and *E*(*zijlzij*� *<sup>l</sup>*) = 0 for *i* = 1, 2; *j*(�= *j* � ) = 1, ..., *p*; *l* = 1, ..., *ni*. We assume that *λip* > 0 (*i* = 1, 2) as *p* → ∞ and the fourth moments of each variable in *Z<sup>i</sup>* are uniformly bounded. In this section, we assume the following assumption for Π*i*'s:

**(A-i)** *zijl*, *j* = 1, ..., *p*, are independent for *i* = 1, 2.

One of the population distributions satisfying (A-i) is *Np*(*μ<sup>i</sup>* , **Σ***i*). We also assume the following condition for **Σ***i*'s as necessary:

$$\text{tr}(\mathbf{A}\text{-ii}) \quad \frac{\text{tr}(\Sigma\_i^t)}{p} < \infty \ (t = 1, 2) \text{ and } \frac{\text{tr}(\Sigma\_i^4)}{p^2} \to 0 \text{ as } p \to \infty \text{ for } i = 1, 2.$$

**Remark 4.1.** If all *λij*'s are bounded, (A-ii) trivially holds. For a spiked model such as *λij* = *aij pαij* (*j* = 1, ..., *mi*) and *λij* = *cij* (*j* = *mi* + 1, ..., *p*) with positive constants *aij*'s, *cij*'s and *αij*'s, (A-ii) holds under the condition that *αij* < 1/2, *j* = 1, ..., *mi*(< ∞); *i* = 1, 2. As an interesting example, (A-ii) holds for **Σ***i*� = *ci*�(*ρ* |*i*−*j*| *q i* � *<sup>i</sup>*� ), *i* � = 1, 2, where *ci*�'s, *qi*�'s and *ρi*�'s(< 1) are positive constants.

#### **4.1 Discriminant rule for HDLSS data**

Let *x*<sup>0</sup> be an observation vector on an individual belonging to Π<sup>1</sup> or to Π2. Having recorded *<sup>x</sup>i*1, ..., *<sup>x</sup>ini* from each <sup>Π</sup>*i*, we estimate *<sup>μ</sup><sup>i</sup>* and **<sup>Σ</sup>***<sup>i</sup>* by *<sup>x</sup>ini* <sup>=</sup> <sup>∑</sup>*ni <sup>j</sup>*=<sup>1</sup> *<sup>x</sup>ij*/*ni* and *<sup>S</sup>ini* <sup>=</sup> <sup>∑</sup>*ni <sup>j</sup>*=1(*xij* − *xini* )(*xij* − *xini* )*T*/(*ni* <sup>−</sup> <sup>1</sup>). Aoshima and Yata (2011) considered a discriminant rule that classifies *x*<sup>0</sup> into Π<sup>1</sup> if

$$\frac{p||\mathbf{x}\_{0} - \overline{\mathbf{x}}\_{1n\_{1}}||^{2}}{\text{tr}(\mathbf{S}\_{1n\_{1}})} - \frac{p||\mathbf{x}\_{0} - \overline{\mathbf{x}}\_{2n\_{2}}||^{2}}{\text{tr}(\mathbf{S}\_{2n\_{2}})} - p\log\left\{\frac{\text{tr}(\mathbf{S}\_{2n\_{2}})}{\text{tr}(\mathbf{S}\_{1n\_{1}})}\right\} - \frac{p}{n\_{1}} + \frac{p}{n\_{2}} + \gamma < 0 \tag{10}$$

and into Π<sup>2</sup> otherwise. Here, −*p*/*n*<sup>1</sup> + *p*/*n*<sup>2</sup> is a bias-correction and *γ* is a tuning parameter. We denote the error rate of misclassifying an individual from Π<sup>1</sup> (into Π2) or from Π<sup>2</sup> (into <sup>Π</sup>1) by *<sup>e</sup>*(2|1) or *<sup>e</sup>*(1|2). Let <sup>Δ</sup> <sup>=</sup> ||*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2||<sup>2</sup> and <sup>Δ</sup>**Σ***<sup>i</sup>* = (tr(**Σ**1) <sup>−</sup> tr(**Σ**2))2/tr(**Σ***i*), *<sup>i</sup>* <sup>=</sup> 1, 2. Let us write that Δ*<sup>i</sup>* = Δ + Δ**Σ***<sup>i</sup>* /2, *i* = 1, 2, and Δ� = min *i*=1,2 Δ*i*. Aoshima and Yata (2011) gave the following property.

**Theorem 4.1.** *Assume (A-i)-(A-ii). Under the condition that* max *i*=1,2{*tr*(**Σ**<sup>2</sup> *<sup>i</sup>* )}/(Δ<sup>2</sup> � min *i*=1,2{*ni*}) <sup>→</sup> <sup>0</sup> *as p* → ∞*, for the discriminant rule given by (10) with γ* = 0*, it holds as p* → ∞ *that*

$$e(2|1) \to 0 \quad \text{and} \quad e(1|2) \to 0. \tag{11}$$

10 Will-be-set-by-IN-TECH

Suppose we have independent and *p*-dimensional populations, Π*i*, *i* = 1, 2, having *unknown* mean vector *<sup>μ</sup><sup>i</sup>* = (*μi*1, ..., *<sup>μ</sup>ip*)*<sup>T</sup>* and *unknown* positive-definite covariance matrix **<sup>Σ</sup>***<sup>i</sup>* for each *i*. *We do not assume that* **Σ**<sup>1</sup> = **Σ**2*.* The eigen-decomposition of **Σ***<sup>i</sup>* (*i* = 1, 2) is given

and *H<sup>i</sup>* = [*hi*1, ..., *hip*] is an orthogonal matrix of corresponding eigenvectors. Having

a distribution with zero mean and the identity covariance matrix. Here, we write *Z<sup>i</sup>* =

fourth moments of each variable in *Z<sup>i</sup>* are uniformly bounded. In this section, we assume the

**Remark 4.1.** If all *λij*'s are bounded, (A-ii) trivially holds. For a spiked model such as *λij* = *aij pαij* (*j* = 1, ..., *mi*) and *λij* = *cij* (*j* = *mi* + 1, ..., *p*) with positive constants *aij*'s, *cij*'s and *αij*'s, (A-ii) holds under the condition that *αij* < 1/2, *j* = 1, ..., *mi*(< ∞); *i* = 1, 2. As an interesting

Let *x*<sup>0</sup> be an observation vector on an individual belonging to Π<sup>1</sup> or to Π2. Having recorded

and into Π<sup>2</sup> otherwise. Here, −*p*/*n*<sup>1</sup> + *p*/*n*<sup>2</sup> is a bias-correction and *γ* is a tuning parameter. We denote the error rate of misclassifying an individual from Π<sup>1</sup> (into Π2) or from Π<sup>2</sup> (into <sup>Π</sup>1) by *<sup>e</sup>*(2|1) or *<sup>e</sup>*(1|2). Let <sup>Δ</sup> <sup>=</sup> ||*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2||<sup>2</sup> and <sup>Δ</sup>**Σ***<sup>i</sup>* = (tr(**Σ**1) <sup>−</sup> tr(**Σ**2))2/tr(**Σ***i*), *<sup>i</sup>* <sup>=</sup> 1, 2. Let

tr(*S*2*n*<sup>2</sup> ) <sup>−</sup> *<sup>p</sup>* log

/2, *i* = 1, 2, and Δ� = min

*p* → ∞*, for the discriminant rule given by (10) with γ* = 0*, it holds as p* → ∞ *that*

*<sup>i</sup>* , where **Λ***<sup>i</sup>* is a diagonal matrix of eigenvalues *λi*<sup>1</sup> ≥ ··· ≥ *λip* > 0

], where *<sup>x</sup>ij* = (*xi*1*j*, ..., *xipj*)*T*, *<sup>j</sup>* <sup>=</sup> 1, ..., *ni*. We assume *ni* <sup>≥</sup> 4, *<sup>i</sup>* <sup>=</sup> 1, 2.

) = 1, ..., *p*; *l* = 1, ..., *ni*. We assume that *λip* > 0 (*i* = 1, 2) as *p* → ∞ and the

*<sup>p</sup>*<sup>2</sup> <sup>→</sup> 0 as *<sup>p</sup>* <sup>→</sup> <sup>∞</sup> for *<sup>i</sup>* <sup>=</sup> 1, 2.

)*T*/(*ni* <sup>−</sup> <sup>1</sup>). Aoshima and Yata (2011) considered a discriminant rule that

*i*=1,2

tr(*S*2*n*<sup>2</sup> ) tr(*S*1*n*<sup>1</sup> )  − *p n*1

*i*=1,2{*tr*(**Σ**<sup>2</sup>

*e*(2|1) → 0 *and e*(1|2) → 0. (11)

, from each Π*i*, we have a *p* × *ni* (*p* > *ni*) data matrix

]) is considered as a *p* × *ni* sphered data matrix from

� = 1, 2, where *ci*�'s, *qi*�'s and *ρi*�'s(< 1) are positive

*<sup>j</sup>*=<sup>1</sup> *<sup>x</sup>ij*/*ni* and *<sup>S</sup>ini* <sup>=</sup> <sup>∑</sup>*ni*

<sup>+</sup> *<sup>p</sup> n*2

Δ*i*. Aoshima and Yata (2011) gave the

� min

*<sup>i</sup>* )}/(Δ<sup>2</sup>

*ijl*) = 1 and *E*(*zijlzij*�

, **Σ***i*). We also assume the

*<sup>l</sup>*) = 0

*<sup>j</sup>*=1(*xij* −

+ *γ* < 0 (10)

*i*=1,2{*ni*}) <sup>→</sup> <sup>0</sup> *as*

**4. Classification for high-dimension, low-sample-size data**

, ..., *μ<sup>i</sup>*

] and *zij* = (*zi*1*j*, ..., *zipj*)*T*, *j* = 1, ..., *ni*. Note that *E*(*z*<sup>2</sup>

*i* )


by **Σ***<sup>i</sup>* = *Hi***Λ***iH<sup>T</sup>*

*X<sup>i</sup>* = [*xi*1, ..., *xini*

for *i* = 1, 2; *j*(�= *j*

[*zi*1, ..., *zini*

**(A-ii)**

constants.

)(*xij* − *xini*

classifies *x*<sup>0</sup> into Π<sup>1</sup> if

us write that Δ*<sup>i</sup>* = Δ + Δ**Σ***<sup>i</sup>*

following property.

*xini*

Then, *<sup>Z</sup><sup>i</sup>* = **<sup>Λ</sup>**−1/2

recorded i.i.d. samples, *xi*1, ..., *xini*

�

following condition for **Σ***i*'s as necessary:

example, (A-ii) holds for **Σ***i*� = *ci*�(*ρ*

**4.1 Discriminant rule for HDLSS data**

*<sup>p</sup>*||*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*1*n*<sup>1</sup> ||<sup>2</sup>

following assumption for Π*i*'s:

tr(**Σ***<sup>t</sup> i*) *p*

*<sup>i</sup> <sup>H</sup><sup>T</sup>*

*<sup>i</sup>* (*X<sup>i</sup>* − [*μ<sup>i</sup>*

**(A-i)** *zijl*, *j* = 1, ..., *p*, are independent for *i* = 1, 2.

One of the population distributions satisfying (A-i) is *Np*(*μ<sup>i</sup>*

<sup>&</sup>lt; <sup>∞</sup> (*<sup>t</sup>* <sup>=</sup> 1, 2) and tr(**Σ**<sup>4</sup>

*<sup>x</sup>i*1, ..., *<sup>x</sup>ini* from each <sup>Π</sup>*i*, we estimate *<sup>μ</sup><sup>i</sup>* and **<sup>Σ</sup>***<sup>i</sup>* by *<sup>x</sup>ini* <sup>=</sup> <sup>∑</sup>*ni*

tr(*S*1*n*<sup>1</sup> ) <sup>−</sup> *<sup>p</sup>*||*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> ||<sup>2</sup>

**Theorem 4.1.** *Assume (A-i)-(A-ii). Under the condition that* max

**Remark 4.2.** Assume (A-i)-(A-ii). Let us consider a case that tr(**Σ**1)/tr(**Σ**2) �= 1 as *p* → ∞. Then, it follows that min*i*=1,2 Δ**Σ***<sup>i</sup>* /*<sup>p</sup>* <sup>&</sup>gt; 0 as *<sup>p</sup>* <sup>→</sup> <sup>∞</sup>. Since it holds max*i*=1,2{tr(**Σ**<sup>2</sup> *<sup>i</sup>* )} /(Δ<sup>2</sup> � min*i*=1,2{*ni*}) → 0 as *p* → ∞, we can claim (11) in the case.

**Remark 4.3.** Let *ni*(1) = [*ni*/2] + 1 and *ni*(2) = *ni* − *ni*(1) for each Π*<sup>i</sup>* (*i* = 1, 2). We omit the subscript *i* for a while. For each Π, split *x*1, ..., *x<sup>n</sup>* into *X*<sup>1</sup> = [*x*11, ..., *x*1*n*(1) ] and *X*<sup>2</sup> = [*x*21, ..., *x*2*n*(2) ]. Let *<sup>X</sup>o*<sup>1</sup> <sup>=</sup> *<sup>X</sup>*<sup>1</sup> <sup>−</sup> [*x*1, ..., *<sup>x</sup>*1] and *<sup>X</sup>o*<sup>2</sup> <sup>=</sup> *<sup>X</sup>*<sup>2</sup> <sup>−</sup> [*x*2, ..., *<sup>x</sup>*2], where *<sup>x</sup>*<sup>1</sup> <sup>=</sup> <sup>∑</sup>*n*(1) *<sup>j</sup>*=<sup>1</sup> *x*1*j*/*n*(1) and *<sup>x</sup>*<sup>2</sup> <sup>=</sup> <sup>∑</sup>*n*(2) *<sup>j</sup>*=<sup>1</sup> *<sup>x</sup>*2*j*/*n*(2). Define *<sup>S</sup>n*(1) = (*n*(1) <sup>−</sup> <sup>1</sup>)−1*Xo*1*X<sup>T</sup> <sup>o</sup>*<sup>1</sup> and *<sup>S</sup>n*(2) = (*n*(2) <sup>−</sup> <sup>1</sup>)−1*Xo*2*X<sup>T</sup> o*2. Note that tr(*Sn*(1)*Sn*(2)) =tr(*SD*(1)*S<sup>T</sup> D*(1) ) = <sup>∑</sup>*n*(2)−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*˜ <sup>2</sup> *<sup>j</sup>* . Then, we have that *E*(tr(*Sn*(1)*Sn*(2))) = tr(**Σ**2). As for tr(**Σ**<sup>2</sup> *<sup>i</sup>* ), Yata (2010) considered an unbiased estimator, tr(*Sini*(1)*Sini*(2)), as an application of the CDM methodology given by Yata and Aoshima (2010a,b). **Remark 4.4.** We note that Δ� is estimated by

$$||\overline{\mathbf{x}}\_{1\mathbb{n}\_1} - \overline{\mathbf{x}}\_{2\mathbb{n}\_2}||^2 - \sum\_{i=1}^2 \text{tr}(\mathbf{S}\_{\text{in}\_i})/n\_i + \frac{|\text{tr}(\mathbf{S}\_{1\mathbb{n}\_1}) - \text{tr}(\mathbf{S}\_{2\mathbb{n}\_2})|^2}{2\max\_{i=1,2} \text{tr}(\mathbf{S}\_{\text{in}\_i})} (= \widehat{\Delta}\_{\star}, \text{ say}).$$

We analyzed gene expression data given by Armstrong et al. (2001) in which data set consisted of 12582 (= *p*) genes. We had two populations about leukemia subtypes, i.e., Π1: acute lymphoblastic leukemia (ALL, 24 samples) and Π2: acute myeloid leukemia (AML, 28 samples). We set *n*<sup>1</sup> = *n*<sup>2</sup> = 10. Then, we constructed the discriminant rule given by (10) with *<sup>γ</sup>* <sup>=</sup> 0. From Remarks 4.3 and 4.4, we calculated max*i*=1,2{tr(*Sini*(1)*Sini*(2))} <sup>=</sup> 3.16 <sup>×</sup> 1019 and <sup>Δ</sup>� <sup>=</sup> 2.67 <sup>×</sup> 1010, so that max*i*=1,2{tr(*Sini*(1)*Sini*(2))}/(<sup>Δ</sup><sup>2</sup> � min*i*=1,2{*ni*}) = 0.0044. Thus, one may conclude that max*i*=1,2{tr(**Σ**<sup>2</sup> *<sup>i</sup>* )}/(Δ<sup>2</sup> � min*i*=1,2{*ni*}) must be sufficiently small. Hence, from Theorem 4.1, the discriminant rule given by (10) with *γ* = 0 was expected to hold (11). In Table 1, we investigated the performance of the discriminant rule by using test data sets consisting of 24 − *n*<sup>1</sup> = 14 remaining samples from Π<sup>1</sup> and 28 − *n*<sup>2</sup> = 18 remaining samples from Π2. We observed that the discriminant rule showed *e*(1|2) = 0 and *e*(2|1) = 0 successfully as expected by theory.


Table 1. The correct discrimination rates for test data sets consisting of 14 samples from Π<sup>1</sup> and 18 samples from Π2.

#### **4.2 Sample size determination for classification**

One would be interested in designing the discriminant rule given by (10) so as to hold both *e*(2|1) ≤ *α* and *e*(1|2) ≤ *β* when Δ� ≥ Δ*L*, where *α*, *β* ∈ (0, 1/2) and Δ*<sup>L</sup>* (> 0) are prespecified constants. We assume Δ*<sup>L</sup>* = *o*(*p*1/2). Aoshima and Yata (2011) showed the following property. **Theorem 4.2.** *Assume that tr*(**Σ**1)/*tr*(**Σ**2) → 1 *as p* → ∞*. Let*

$$\omega(\mathbf{x}\_{0}) = \frac{p||\mathbf{x}\_{0} - \overline{\mathbf{x}}\_{1\mathbb{n}\_{1}}||^{2}}{\text{tr}(\mathbb{S}\_{1\mathbb{n}\_{1}})} - \frac{p||\mathbf{x}\_{0} - \overline{\mathbf{x}}\_{2\mathbb{n}\_{2}}||^{2}}{\text{tr}(\mathbb{S}\_{2\mathbb{n}\_{2}})} - p\log\left\{\frac{\text{tr}(\mathbb{S}\_{2\mathbb{n}\_{2}})}{\text{tr}(\mathbb{S}\_{1\mathbb{n}\_{1}})}\right\} - \frac{p}{n\_{1}} + \frac{p}{n\_{2}}.$$

Since **Σ***i*'s are unknown, it is necessary to estimate *Ci*'s in (12) with some pilot samples. We

Effective Methodologies for Statistical Inference on Microarray Studies 25

**(Step 1)** Choose a pilot sample size, *m*(≥ 4), such as *m*/*Ci* ∈ (0, 1), *i* = 1, 2, as *p* → ∞. Take pilot samples of size *m* from each Π*<sup>i</sup>* and define *X<sup>i</sup>* = [*xi*1, ..., *xim*], *i* = 1, 2. Let *m*(1) = [*m*/2] + 1 and *m*(2) = *m* − *m*(1). For each Π*i*, divide *X<sup>i</sup>* into *X<sup>i</sup>* = [*Xi*1, *Xi*2] with

, ..., *xim*(1)

, ..., *xim*(2)

tr(*Sim*(1)*Sim*(2))1/4

**(Step 2)** Take additional samples *xij*, *j* = *m* + 1, ..., *Ni*, of size *Ni* − *m* from each Π*i*. By combining the initial samples and the additional samples, calculate *<sup>x</sup>iNi* <sup>=</sup> <sup>∑</sup>*Ni*

tr(*S*2*N*<sup>2</sup> ) <sup>−</sup> *<sup>p</sup>* log tr(*S*2*N*<sup>2</sup> )

and into <sup>Π</sup><sup>2</sup> otherwise, where *<sup>γ</sup>*<sup>ˆ</sup> = (tr(*S*1*N*<sup>1</sup> <sup>+</sup> *<sup>S</sup>*2*N*<sup>2</sup> )/(2*p*))−1Δ*L*(*z<sup>β</sup>* <sup>−</sup> *<sup>z</sup>α*)/(*z<sup>α</sup>* <sup>+</sup> *<sup>z</sup>β*).

**Theorem 4.4.** *Assume (A-i)-(A-ii). Then, under the regularity conditions, for the discriminant rule*

lim sup *e*(2|1) ≤ *α* and lim sup *e*(1|2) ≤ *β*

**Remark 4.6.** One may take different pilot-sample-sizes, *mi*(≥ 4), such as *mi*/*Ci* ∈ (0, 1) as

*p*

])(*Xi*<sup>1</sup> − [*xim*(1)

and (13)

])(*Xi*<sup>2</sup> − [*xim*(2)

2 ∑ *j*=1

tr(*S*1*N*<sup>1</sup> )

*<sup>m</sup>*(2) <sup>−</sup> <sup>1</sup> ,

*m*(1) − 1

, ..., *xim*(1)

, ..., *xim*(2)

*<sup>j</sup>*=*m*(1)+<sup>1</sup> *xij*/*m*(2). Define the total sample size

tr(*Sjm*(1)*Sjm*(2))1/4

)*T*/(*Ni* <sup>−</sup> <sup>1</sup>), *<sup>i</sup>* <sup>=</sup> 1, 2. Then, we classify *<sup>x</sup>*<sup>0</sup> into <sup>Π</sup><sup>1</sup> if

<sup>+</sup> *<sup>p</sup> N*2

*p*

→ 0, *i* = 1, 2, under Δ*<sup>L</sup>* → ∞ as *p* → ∞.

→ 1 for *i* = 1, 2, which

 <sup>−</sup> *<sup>p</sup> N*1 ])*T*

])*<sup>T</sup>*

+ 1 

, (14)

*<sup>j</sup>*=<sup>1</sup> *xij*/*Ni*

+ *γ*ˆ < 0 (15)

proceed the following two steps:

where *<sup>x</sup>im*(1) <sup>=</sup> <sup>∑</sup>*m*(1)

*Ni* <sup>=</sup> max

*m*,

*<sup>j</sup>*=1(*xij* − *xiNi*

*given by (15) with (14), it holds as p* → ∞ *that*

are in the HDLSS situation in the sense that *Ni*/*p*

*<sup>p</sup>*||*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*1*N*<sup>1</sup> ||<sup>2</sup>

for each Π*<sup>i</sup>* by

and *<sup>S</sup>iNi* <sup>=</sup> <sup>∑</sup>*Ni*

*when* Δ� ≥ Δ*L*.

**[Two-stage procedure for classification]**

*Xi*<sup>1</sup> : *p* × *m*(1) and *Xi*<sup>2</sup> : *p* × *m*(2), and calculate

*<sup>S</sup>im*(1) <sup>=</sup> (*Xi*<sup>1</sup> <sup>−</sup> [*xim*(1)

*<sup>S</sup>im*(2) <sup>=</sup> (*Xi*<sup>2</sup> <sup>−</sup> [*xim*(2)

 (*z<sup>α</sup>* + *<sup>z</sup>β*)2*σ*<sup>ˆ</sup> Δ2 *L*

where *<sup>σ</sup>*<sup>ˆ</sup> <sup>=</sup> max{tr(*S*1*m*(1)*S*1*m*(2))1/2, tr(*S*2*m*(1)*S*2*m*(2))1/2}.

tr(*S*1*N*<sup>1</sup> ) <sup>−</sup> *<sup>p</sup>*||*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*2*N*<sup>2</sup> ||<sup>2</sup>

Aoshima and Yata (2011) gave the following theorem.

)(*xij* − *xiNi*

*p* → ∞ for *i* = 1, 2. Then, the assertion in Theorem 4.4 is still claimed. **Remark 4.7.** Assume (A-i)-(A-ii). Then, it holds as *p* → ∞ that *Ni*/*Ci*

*<sup>j</sup>*=<sup>1</sup> *<sup>x</sup>ij*/*m*(1) and *<sup>x</sup>im*(2) <sup>=</sup> <sup>∑</sup>*<sup>m</sup>*

*Then, under the regularity conditions, it holds as p* → ∞ *and n*1, *n*<sup>2</sup> → ∞ *that*

$$\frac{\omega(\mathbf{x}\_0) + \Delta\_2(\text{tr}(\mathbf{E}\_2)/p)^{-1}}{2\sqrt{(\text{tr}(\mathbf{E}\_1)/p)^{-2}\text{tr}(\mathbf{E}\_1^2)/n\_1 + (\text{tr}(\mathbf{E}\_2)/p)^{-2}\text{tr}(\mathbf{E}\_1\mathbf{E}\_2)/n\_2}} \Rightarrow N(0,1) \quad \text{when } \mathbf{x}\_0 \in \Pi\_1;$$

$$\frac{\omega(\mathbf{x}\_0) - \Delta\_1(\text{tr}(\mathbf{E}\_1)/p)^{-1}}{2\sqrt{(\text{tr}(\mathbf{E}\_2)/p)^{-2}\text{tr}(\mathbf{E}\_2^2)/n\_2 + (\text{tr}(\mathbf{E}\_1)/p)^{-2}\text{tr}(\mathbf{E}\_1\mathbf{E}\_2)/n\_1}} \Rightarrow N(0,1) \quad \text{when } \mathbf{x}\_0 \in \Pi\_2.$$

Let *<sup>σ</sup>* <sup>=</sup> max{tr(**Σ**<sup>2</sup> 1)1/2, tr(**Σ**<sup>2</sup> <sup>2</sup>)1/2}. We find the sample size for each <sup>Π</sup>*<sup>i</sup>* (*<sup>i</sup>* <sup>=</sup> 1, 2) as

$$m\_i \geq \frac{(z\_{\it\prime} + z\_{\beta})^2 \sigma}{\Delta\_L^2} \text{tr}(\mathbf{E}\_i^2)^{1/4} \sum\_{j=1}^2 \text{tr}(\mathbf{E}\_j^2)^{1/4} \quad (= \mathbf{C}\_{i\prime} \text{ say}), \tag{12}$$

where *z<sup>α</sup>* is the upper *α* point of *N*(0, 1). Note that *Ci* = *O*(*p*/Δ<sup>2</sup> *<sup>L</sup>*) for *i* = 1, 2, under (A-ii). Thus under Δ*<sup>L</sup>* → ∞ as *p* → ∞, it holds that *Ci*/*p* → 0 as *p* → ∞ . Then, Aoshima and Yata (2011) gave the following theorem.

**Theorem 4.3.** *Assume (A-i)-(A-ii). Let <sup>γ</sup>* = (*tr*(*S*1*n*<sup>1</sup> <sup>+</sup> *<sup>S</sup>*2*n*<sup>2</sup> )/(2*p*))−1Δ*L*(*z<sup>β</sup>* <sup>−</sup> *<sup>z</sup>α*)/(*z<sup>α</sup>* <sup>+</sup> *<sup>z</sup>β*) *in (10). Then, under the regularity conditions, for the discriminant rule given by (10) with (12), it holds as p* → ∞ *that*

$$
\limsup e(2|1) \le \alpha \quad \text{and} \quad \limsup e(1|2) \le \beta
$$

*when* Δ ≥ Δ*L.*

**Remark 4.5.** One can design Δ*<sup>L</sup>* by using the two sample test given by Aoshima and Yata (2011). Under the regularity conditions, it holds that

$$\frac{|\left|\overline{\mathfrak{X}}\_{1n\_1} - \overline{\mathfrak{X}}\_{2n\_2}\right\||^2 - \sum\_{i=1}^2 \text{tr}(\mathbf{S}\_{i\bar{n}\_i})/n\_i - \Delta}{\sqrt{\widehat{Var}(||\overline{\mathfrak{X}}\_{1n\_1} - \overline{\mathfrak{X}}\_{2n\_2}||^2)}} \Rightarrow N(0,1)$$

as *p* → ∞ and *ni* → ∞, *i* = 1, 2, where

$$\widehat{Var}(||\mathfrak{X}\_{1\mathbb{N}\_1} - \mathfrak{X}\_{2\mathbb{N}\_2}||^2) = 2\frac{\text{tr}(\mathbf{S}\_{1\mathbb{N}\_1(1)}\mathbf{S}\_{1\mathbb{N}\_1(2)})}{n\_1(n\_1 - 1)} + 2\frac{\text{tr}(\mathbf{S}\_{2\mathbb{N}\_2(1)}\mathbf{S}\_{2\mathbb{N}\_2(2)})}{n\_2(n\_2 - 1)} + 4\frac{\text{tr}(\mathbf{S}\_{1\mathbb{N}\_1}\mathbf{S}\_{2\mathbb{N}\_2})}{n\_1n\_2}.$$

Note that *<sup>E</sup>*(||*x*1*n*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> ||<sup>2</sup> <sup>−</sup> <sup>∑</sup><sup>2</sup> *<sup>i</sup>*=<sup>1</sup> tr(*Sini* )/*ni*) = Δ. Thus it follows that

$$P\left(\frac{||\overline{\mathbf{x}}\_{1n\_1} - \overline{\mathbf{x}}\_{2n\_2}||^2 - \sum\_{i=1}^2 \text{tr}(\mathbf{S}\_{\text{in}\_i})/n\_i}{\sqrt{\widehat{Var}(||\overline{\mathbf{x}}\_{1n\_1} - \overline{\mathbf{x}}\_{2n\_2}||^2)}} - z\_{n'} \le \frac{\Delta}{\sqrt{\widehat{Var}(||\overline{\mathbf{x}}\_{1n\_1} - \overline{\mathbf{x}}\_{2n\_2}||^2)}}\right) \to 1 - \alpha'$$

with *α*� ∈ (0, 1/2). From the fact that Δ ≥ Δ, we design a lower bound of Δ by

$$\Delta\_L = ||\overline{\mathbf{x}\_{1n\_1}} - \overline{\mathbf{x}\_{2n\_2}}||^2 - \sum\_{i=1}^2 \text{tr}(\mathbf{S}\_{i n\_i})/n\_i - z\_{a'}\sqrt{\widehat{Var}(||\overline{\mathbf{x}\_{1n\_1}} - \overline{\mathbf{x}\_{2n\_2}}||^2)}$$

for sufficiently small *α*� . Since **Σ***i*'s are unknown, it is necessary to estimate *Ci*'s in (12) with some pilot samples. We proceed the following two steps:

#### **[Two-stage procedure for classification]**

12 Will-be-set-by-IN-TECH

⇒ *N*(0, 1) when *x*<sup>0</sup> ∈ Π1;

⇒ *N*(0, 1) when *x*<sup>0</sup> ∈ Π2.

*<sup>j</sup>*)1/4 (= *Ci*, say), (12)

*<sup>L</sup>*) for *i* = 1, 2, under (A-ii).

tr(*S*1*n*1*S*2*n*<sup>2</sup> ) *n*1*n*<sup>2</sup>

⎠ → 1 − *α*�

⎞

.

<sup>1</sup>)/*n*<sup>1</sup> + (tr(**Σ**2)/*p*)−2tr(**Σ**1**Σ**2)/*n*<sup>2</sup>

<sup>2</sup>)/*n*<sup>2</sup> + (tr(**Σ**1)/*p*)−2tr(**Σ**1**Σ**2)/*n*<sup>1</sup>

2 ∑ *j*=1

Thus under Δ*<sup>L</sup>* → ∞ as *p* → ∞, it holds that *Ci*/*p* → 0 as *p* → ∞ . Then, Aoshima and Yata

**Theorem 4.3.** *Assume (A-i)-(A-ii). Let <sup>γ</sup>* = (*tr*(*S*1*n*<sup>1</sup> <sup>+</sup> *<sup>S</sup>*2*n*<sup>2</sup> )/(2*p*))−1Δ*L*(*z<sup>β</sup>* <sup>−</sup> *<sup>z</sup>α*)/(*z<sup>α</sup>* <sup>+</sup> *<sup>z</sup>β*) *in (10). Then, under the regularity conditions, for the discriminant rule given by (10) with (12), it holds*

lim sup *e*(2|1) ≤ *α* and lim sup *e*(1|2) ≤ *β*

**Remark 4.5.** One can design Δ*<sup>L</sup>* by using the two sample test given by Aoshima and Yata

*<sup>i</sup>*=<sup>1</sup> tr(*Sini*

�(||*x*1*n*<sup>1</sup> − *<sup>x</sup>*2*n*<sup>2</sup> ||2)

tr(*S*1*n*1(1)*S*1*n*<sup>1</sup> (2)) *<sup>n</sup>*1(*n*<sup>1</sup> <sup>−</sup> <sup>1</sup>) <sup>+</sup> <sup>2</sup>

)/*ni*

with *α*� ∈ (0, 1/2). From the fact that Δ ≥ Δ, we design a lower bound of Δ by

tr(*Sini*

2 ∑ *i*=1 − *zα*� ≤

*<sup>i</sup>*=<sup>1</sup> tr(*Sini*

)/*ni* − Δ

)/*ni*) = Δ. Thus it follows that

� *Var*

� *Var*

)/*ni* − *zα*�

⇒ *N*(0, 1)

tr(*S*2*n*<sup>2</sup> (1)*S*2*n*<sup>2</sup> (2)) *<sup>n</sup>*2(*n*<sup>2</sup> <sup>−</sup> <sup>1</sup>) <sup>+</sup> <sup>4</sup>

Δ

�(||*x*1*n*<sup>1</sup> − *<sup>x</sup>*2*n*<sup>2</sup> ||2)

�(||*x*1*n*<sup>1</sup> − *<sup>x</sup>*2*n*<sup>2</sup> ||2)

tr(**Σ**<sup>2</sup> *<sup>i</sup>* )1/4

<sup>2</sup>)1/2}. We find the sample size for each <sup>Π</sup>*<sup>i</sup>* (*<sup>i</sup>* <sup>=</sup> 1, 2) as

tr(**Σ**<sup>2</sup>

*Then, under the regularity conditions, it holds as p* → ∞ *and n*1, *n*<sup>2</sup> → ∞ *that*

*ω*(*x*0) + Δ2(tr(**Σ**2)/*p*)−<sup>1</sup>

*<sup>ω</sup>*(*x*0) <sup>−</sup> <sup>Δ</sup>1(tr(**Σ**1)/*p*)−<sup>1</sup>

where *z<sup>α</sup>* is the upper *α* point of *N*(0, 1). Note that *Ci* = *O*(*p*/Δ<sup>2</sup>


� *Var*

2 �

2 �

Let *<sup>σ</sup>* <sup>=</sup> max{tr(**Σ**<sup>2</sup>

*as p* → ∞ *that*

*when* Δ ≥ Δ*L.*

*Var*

*P* ⎛ ⎝

for sufficiently small *α*�

(tr(**Σ**1)/*p*)−2tr(**Σ**<sup>2</sup>

(tr(**Σ**2)/*p*)−2tr(**Σ**<sup>2</sup>

(2011) gave the following theorem.

as *p* → ∞ and *ni* → ∞, *i* = 1, 2, where

�(||*x*1*n*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> ||2) = <sup>2</sup>


*Var*

*<sup>i</sup>*=<sup>1</sup> tr(*Sini* �

<sup>Δ</sup>*<sup>L</sup>* <sup>=</sup> ||*x*1*n*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> ||<sup>2</sup> <sup>−</sup>

.

�(||*x*1*n*<sup>1</sup> − *<sup>x</sup>*2*n*<sup>2</sup> ||2)

Note that *<sup>E</sup>*(||*x*1*n*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> ||<sup>2</sup> <sup>−</sup> <sup>∑</sup><sup>2</sup>

1)1/2, tr(**Σ**<sup>2</sup>

*ni* <sup>≥</sup> (*z<sup>α</sup>* <sup>+</sup> *<sup>z</sup>β*)2*<sup>σ</sup>* Δ2 *L*

(2011). Under the regularity conditions, it holds that

**(Step 1)** Choose a pilot sample size, *m*(≥ 4), such as *m*/*Ci* ∈ (0, 1), *i* = 1, 2, as *p* → ∞. Take pilot samples of size *m* from each Π*<sup>i</sup>* and define *X<sup>i</sup>* = [*xi*1, ..., *xim*], *i* = 1, 2. Let *m*(1) = [*m*/2] + 1 and *m*(2) = *m* − *m*(1). For each Π*i*, divide *X<sup>i</sup>* into *X<sup>i</sup>* = [*Xi*1, *Xi*2] with *Xi*<sup>1</sup> : *p* × *m*(1) and *Xi*<sup>2</sup> : *p* × *m*(2), and calculate

$$\mathcal{S}\_{im(1)} = \frac{(\mathbf{X}\_{i1} - [\overline{\mathbf{x}}\_{im\_{(1)}}, \dots, \overline{\mathbf{x}}\_{im\_{(1)}}])(\mathbf{X}\_{i1} - [\overline{\mathbf{x}}\_{im\_{(1)}}, \dots, \overline{\mathbf{x}}\_{im\_{(1)}}])^T}{m\_{(1)} - 1}$$
 and

$$\mathbf{S}\_{im(2)} = \frac{(\mathbf{X}\_{i2} - [\overline{\mathbf{x}}\_{im\_{(2)}}, \dots, \overline{\mathbf{x}}\_{im\_{(2)}}])(\mathbf{X}\_{i2} - [\overline{\mathbf{x}}\_{im\_{(2)}}, \dots, \overline{\mathbf{x}}\_{im\_{(2)}}])^T}{m\_{(2)} - 1},$$

where *<sup>x</sup>im*(1) <sup>=</sup> <sup>∑</sup>*m*(1) *<sup>j</sup>*=<sup>1</sup> *<sup>x</sup>ij*/*m*(1) and *<sup>x</sup>im*(2) <sup>=</sup> <sup>∑</sup>*<sup>m</sup> <sup>j</sup>*=*m*(1)+<sup>1</sup> *xij*/*m*(2). Define the total sample size for each Π*<sup>i</sup>* by

$$N\_l = \max\left\{ m\_\prime \left[ \frac{(z\_\pi + z\_\beta)^2 \vartheta}{\Delta\_L^2} \text{tr} (\mathbf{S}\_{\text{im}(1)} \mathbf{S}\_{\text{im}(2)})^{1/4} \sum\_{j=1}^2 \text{tr} (\mathbf{S}\_{jm(1)} \mathbf{S}\_{jm(2)})^{1/4} \right] + 1 \right\},\tag{14}$$

where *<sup>σ</sup>*<sup>ˆ</sup> <sup>=</sup> max{tr(*S*1*m*(1)*S*1*m*(2))1/2, tr(*S*2*m*(1)*S*2*m*(2))1/2}.

**(Step 2)** Take additional samples *xij*, *j* = *m* + 1, ..., *Ni*, of size *Ni* − *m* from each Π*i*. By combining the initial samples and the additional samples, calculate *<sup>x</sup>iNi* <sup>=</sup> <sup>∑</sup>*Ni <sup>j</sup>*=<sup>1</sup> *xij*/*Ni* and *<sup>S</sup>iNi* <sup>=</sup> <sup>∑</sup>*Ni <sup>j</sup>*=1(*xij* − *xiNi* )(*xij* − *xiNi* )*T*/(*Ni* <sup>−</sup> <sup>1</sup>), *<sup>i</sup>* <sup>=</sup> 1, 2. Then, we classify *<sup>x</sup>*<sup>0</sup> into <sup>Π</sup><sup>1</sup> if

$$\frac{p||\mathbf{x}\_{0} - \overline{\mathbf{x}}\_{1N\_{1}}||^{2}}{\text{tr}(\mathbf{S}\_{1N\_{1}})} - \frac{p||\mathbf{x}\_{0} - \overline{\mathbf{x}}\_{2N\_{2}}||^{2}}{\text{tr}(\mathbf{S}\_{2N\_{2}})} - p\log\left\{\frac{\text{tr}(\mathbf{S}\_{2N\_{2}})}{\text{tr}(\mathbf{S}\_{1N\_{1}})}\right\} - \frac{p}{N\_{1}} + \frac{p}{N\_{2}} + \hat{\gamma} < 0 \tag{15}$$

and into <sup>Π</sup><sup>2</sup> otherwise, where *<sup>γ</sup>*<sup>ˆ</sup> = (tr(*S*1*N*<sup>1</sup> <sup>+</sup> *<sup>S</sup>*2*N*<sup>2</sup> )/(2*p*))−1Δ*L*(*z<sup>β</sup>* <sup>−</sup> *<sup>z</sup>α*)/(*z<sup>α</sup>* <sup>+</sup> *<sup>z</sup>β*).

Aoshima and Yata (2011) gave the following theorem.

**Theorem 4.4.** *Assume (A-i)-(A-ii). Then, under the regularity conditions, for the discriminant rule given by (15) with (14), it holds as p* → ∞ *that*

$$
\limsup e(2|1) \le \alpha \quad \text{and} \quad \limsup e(1|2) \le \beta
$$

*when* Δ� ≥ Δ*L*.

**Remark 4.6.** One may take different pilot-sample-sizes, *mi*(≥ 4), such as *mi*/*Ci* ∈ (0, 1) as *p* → ∞ for *i* = 1, 2. Then, the assertion in Theorem 4.4 is still claimed.

**Remark 4.7.** Assume (A-i)-(A-ii). Then, it holds as *p* → ∞ that *Ni*/*Ci p* → 1 for *i* = 1, 2, which are in the HDLSS situation in the sense that *Ni*/*p p* → 0, *i* = 1, 2, under Δ*<sup>L</sup>* → ∞ as *p* → ∞.

(15) DLDR DQDR

1-*e*(2|1) 75/85 (=0.882) 63/85 (=0.741) 76/85 (=0.894) 1-*e*(1|2) 27/27 (=1.0) 24/27 (=0.889) 24/27 (=0.889)

Effective Methodologies for Statistical Inference on Microarray Studies 27

Suppose we have two independent and *p*-dimensional populations, Π*i*, *i* = 1, 2, having *unknown* mean vector *<sup>μ</sup><sup>i</sup>* = (*μi*1, ..., *<sup>μ</sup>ip*)*<sup>T</sup>* and *unknown* positive-definite covariance matrix **Σ***<sup>i</sup>* for each *i*. *We do not assume* **Σ**<sup>1</sup> = **Σ**2*.* We consider an effective methodology to select a significant set of associated variables from among high-dimensional data sets. That is, we

Our interest is to select a set of significant variables such that *D* = {*j* : *μ*1*<sup>j</sup>* �= *μ*2*j*}. Assume that |*D*| = *S* for some *S* ≥ 1, where |*D*| denotes the number of elements in set *D*. A variable selection procedure *D* maps the data into subsets of {1, ..., *p*}. We are interested in designing

*H*0*<sup>j</sup>* : *μ*1*<sup>j</sup>* = *μ*2*<sup>j</sup>* vs. *H*1*<sup>j</sup>* : *μ*1*<sup>j</sup>* �= *μ*2*<sup>j</sup>* for *j* = 1, ..., *p*. (16)

*<sup>j</sup>*∈*<sup>D</sup>* <sup>|</sup>*μ*1*<sup>j</sup>* <sup>−</sup> *<sup>μ</sup>*2*j*<sup>|</sup>

*<sup>j</sup>*∈*<sup>D</sup>* <sup>|</sup>*μ*1*<sup>j</sup>* <sup>−</sup> *<sup>μ</sup>*2*j*<sup>|</sup>

*<sup>P</sup>*(|*D<sup>c</sup>* <sup>∩</sup> *<sup>D</sup>* | �<sup>=</sup> <sup>0</sup>) <sup>→</sup> 0, (17)

<sup>2</sup> > *δ*.

*<sup>j</sup>*∈*<sup>D</sup>* <sup>|</sup>*μ*1*<sup>j</sup>* <sup>−</sup> *<sup>μ</sup>*2*j*<sup>|</sup>

<sup>2</sup> > *δ*, (18)

<sup>2</sup> = *δ*.

(*i*)*<sup>j</sup>* )} < ∞, *i* =

Table 2. The correct discrimination rates, by (15), DLDR and DQDR, for test data sets

consisting of 85 samples from Π<sup>1</sup> and 27 samples from Π2.

consider testing the following hypotheses simultaneously:

and the asymptotic *average power* (AP) is 1, i.e.,

**5.1 Variable selection procedure for HDLSS data**


**5. Variable selection for high-dimension, low-sample-size data**

*D* ensuring that both the asymptotic *family-wise error rate* (FWER) is 0, i.e.,

*p*

*P*(*D* ⊆ *D* ) → 1 when min

assume that *<sup>σ</sup>*(*i*)*<sup>j</sup>* <sup>&</sup>lt; <sup>∞</sup> for *<sup>i</sup>* <sup>=</sup> 1, 2; *<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>p</sup>*, and *<sup>E</sup>*θ{exp(*t*|*xijl* <sup>−</sup> *<sup>μ</sup>ij*|/*σ*1/2

We note that the assertion (18) does not consider the case when min

→ 1 when min

where *δ* (> 0) is a prespecified constant. When *S* is bounded (< ∞), one can modify (18) by

Let *<sup>σ</sup><sup>i</sup>* = max1≤*j*≤*<sup>p</sup> <sup>σ</sup>*(*i*)*<sup>j</sup>* (*<sup>i</sup>* = 1, 2), where *<sup>σ</sup>*(*i*)*j*, *<sup>j</sup>* = 1, ..., *<sup>p</sup>*, are diagonal elements of **<sup>Σ</sup>***i*. We

1, 2; *j* = 1, ..., *p*, for some *t* > 0. Then, for testing the hypotheses (16), we take samples,

#### **4.3 Demonstration**

We analyzed gene expression data given by Chiaretti et al. (2004) in which data set consisted of 12625 (= *p*) genes and 128 samples. Note that the expression measures were obtained by using the three-step robust multichip average (RMA) preprocessing method. Refer to Pollard et al. (2005) as well for the details. The data set had two tumor cellular subtypes, Π1: B-cell (95 samples) and Π2: T-cell (33 samples). We set *α* = 0.1, *β* = 0.02 and *m* = 6. Our goal was to construct a discriminant rule ensuring that both 1 − *e*(2|1) ≥ 0.9 and 1 − *e*(1|2) ≥ 0.98 when Δ� ≥ Δ*L*, where Δ*<sup>L</sup>* is designed later.

First, we took the first 6 samples from each Π*<sup>i</sup>* as a pilot sample. According to Remark 4.5, we calculated ||*x*1*<sup>m</sup>* <sup>−</sup> *<sup>x</sup>*2*m*||<sup>2</sup> <sup>−</sup> <sup>∑</sup><sup>2</sup> *<sup>i</sup>*=<sup>1</sup> tr(*Sim*)/*<sup>m</sup>* <sup>=</sup> 1890 and *Var* �(||*x*1*<sup>m</sup>* <sup>−</sup> *<sup>x</sup>*2*m*||2) = 87860. By setting *α*� = 0.01 so that *zα*� = 2.33, we designed a lower bound of Δ� by

$$\Delta\_L = ||\overline{\mathfrak{x}\_{1m}} - \overline{\mathfrak{x}\_{2m}}||^2 - \sum\_{i=1}^2 \text{tr}(\mathbb{S}\_{im})/m - z\_{a'} \sqrt{\widehat{Var}(||\overline{\mathfrak{x}\_{1m}} - \overline{\mathfrak{x}\_{2m}}||^2)} = 1200.$$

According to (14), the total sample size for each Π*<sup>i</sup>* was given by

$$\begin{split} N\_{1} &= \max\left\{ 6, \begin{bmatrix} (z\_{a} + z\_{\beta})^{2} \boldsymbol{\vartheta} \\ \end{bmatrix} \frac{\Delta\_{1m(1)}^{2} \mathbf{S}\_{1m(2)}}{\Delta\_{L}^{2}} \text{tr}(\mathbf{S}\_{1m(1)} \mathbf{S}\_{1m(2)})^{1/4} \sum\_{j=1}^{2} \text{tr}(\mathbf{S}\_{jm(1)} \mathbf{S}\_{jm(2)})^{1/4} \end{bmatrix} + 1 \right\} = 10, \\ N\_{2} &= \max\left\{ 6, \begin{bmatrix} \left( \underline{\mathbf{z}\_{a} + \underline{\mathbf{z}}\_{\beta}} \right)^{2} \underline{\mathbf{\sigma}} \left( \mathbf{S}\_{2m(1)} \mathbf{S}\_{2m(2)} \right)^{1/4} \sum\_{j=1}^{2} \text{tr}(\mathbf{S}\_{jm(1)} \mathbf{S}\_{jm(2)})^{1/4} \right) + 1 \right\} = 6. \end{split}$$

So, we took the next 4 (= *N*<sup>1</sup> − *m*) samples from Π1. On the other hand, since *N*<sup>2</sup> = *m*, we did not take additional samples from <sup>Π</sup>2. We had *<sup>γ</sup>*<sup>ˆ</sup> = (tr(*S*1*N*<sup>1</sup> <sup>+</sup> *<sup>S</sup>*2*N*<sup>2</sup> )/(2*p*))−1Δ*L*(*z<sup>β</sup>* <sup>−</sup> *zα*)/(*z<sup>α</sup>* + *zβ*) = 58.1. Then, we constructed the discriminant rule given by (15) ensuring that both 1 − *e*(2|1) ≥ 0.9 and 1 − *e*(1|2) ≥ 0.98 when Δ� ≥ 1200.

We compared the constructed discriminant rule with two other discriminant rules, DLDR and DQDR, that were given by Dudoit et al. (2002) as follows: Diagonal linear discriminant rule (DLDR) classifies *x*<sup>0</sup> into Π<sup>1</sup> if

$$(\mathfrak{x}\_0 - (\overline{\mathfrak{x}}\_{1N\_1} + \overline{\mathfrak{x}}\_{2N\_2})/2)^T \mathbf{S}\_{diag}^{-1} (\overline{\mathfrak{x}}\_{2N\_2} - \overline{\mathfrak{x}}\_{1N\_1}) < 0$$

and into Π<sup>2</sup> otherwise, where *Sdiag* = diag(*s*1*N*, ...,*spN*) having *sjN* = 2 ∑ *i*=1 *Ni* ∑ *l*=1 (*xijl*−*xijNi* )2

/(*N*<sup>1</sup> <sup>+</sup> *<sup>N</sup>*<sup>2</sup> <sup>−</sup> <sup>2</sup>) and *xijNi* <sup>=</sup> <sup>∑</sup>*Ni <sup>l</sup>*=<sup>1</sup> *xijl*/*Ni*. On the other hand, diagonal quadratic discriminant rule (DQDR) classifies *x*<sup>0</sup> into Π<sup>1</sup> if

$$\mathbb{E}\left(\mathbf{x}\_{0}-\overline{\mathbf{x}}\_{1\mathbf{N}\_{1}}\right)^{T}\mathbb{S}\_{diag(1)}^{-1}(\mathbf{x}\_{0}-\overline{\mathbf{x}}\_{1\mathbf{N}\_{1}})-(\mathbf{x}\_{0}-\overline{\mathbf{x}}\_{2\mathbf{N}\_{2}})^{T}\mathbb{S}\_{diag(2)}^{-1}(\mathbf{x}\_{0}-\overline{\mathbf{x}}\_{2\mathbf{N}\_{2}})-\log\left\{\frac{\det(\mathsf{S}\_{diag(2)})}{\det(\mathsf{S}\_{diag(1)})}\right\}<0$$

and into Π<sup>2</sup> otherwise, where *Sdiag*(*i*) = diag(*s*(*i*)1*Ni* , ...,*s*(*i*)*pNi* ) having *<sup>s</sup>*(*i*)*jNi* <sup>=</sup> <sup>∑</sup>*Ni <sup>l</sup>*=1(*xijl* −*xijNi* )<sup>2</sup> /(*Ni* <sup>−</sup> <sup>1</sup>). We constructed the three discriminant rules by using the common data sets of sizes (*N*1, *N*2)=(10, 6). In Table 2, we investigated those performances by using the remaining samples of sizes (95 − *N*1, 33 − *N*2)=(85, 27) as test data sets. We observed that the discriminant rule given by (15) showed an adequate performance.

14 Will-be-set-by-IN-TECH

We analyzed gene expression data given by Chiaretti et al. (2004) in which data set consisted of 12625 (= *p*) genes and 128 samples. Note that the expression measures were obtained by using the three-step robust multichip average (RMA) preprocessing method. Refer to Pollard et al. (2005) as well for the details. The data set had two tumor cellular subtypes, Π1: B-cell (95 samples) and Π2: T-cell (33 samples). We set *α* = 0.1, *β* = 0.02 and *m* = 6. Our goal was to construct a discriminant rule ensuring that both 1 − *e*(2|1) ≥ 0.9 and 1 − *e*(1|2) ≥ 0.98 when

First, we took the first 6 samples from each Π*<sup>i</sup>* as a pilot sample. According to Remark 4.5,

tr(*Sim*)/*m* − *zα*�

tr(*S*1*m*(1)*S*1*m*(2))

tr(*S*2*m*(1)*S*2*m*(2))1/4

So, we took the next 4 (= *N*<sup>1</sup> − *m*) samples from Π1. On the other hand, since *N*<sup>2</sup> = *m*, we did not take additional samples from <sup>Π</sup>2. We had *<sup>γ</sup>*<sup>ˆ</sup> = (tr(*S*1*N*<sup>1</sup> <sup>+</sup> *<sup>S</sup>*2*N*<sup>2</sup> )/(2*p*))−1Δ*L*(*z<sup>β</sup>* <sup>−</sup> *zα*)/(*z<sup>α</sup>* + *zβ*) = 58.1. Then, we constructed the discriminant rule given by (15) ensuring that

We compared the constructed discriminant rule with two other discriminant rules, DLDR and DQDR, that were given by Dudoit et al. (2002) as follows: Diagonal linear discriminant rule

*diag*(2)

)<sup>2</sup> /(*Ni* <sup>−</sup> <sup>1</sup>). We constructed the three discriminant rules by using the common data sets of sizes (*N*1, *N*2)=(10, 6). In Table 2, we investigated those performances by using the remaining samples of sizes (95 − *N*1, 33 − *N*2)=(85, 27) as test data sets. We observed that

setting *α*� = 0.01 so that *zα*� = 2.33, we designed a lower bound of Δ� by

2 ∑ *i*=1

According to (14), the total sample size for each Π*<sup>i</sup>* was given by

(*z<sup>α</sup>* + *zβ*)2*σ*ˆ Δ2 *L*

(*z<sup>α</sup>* + *zβ*)2*σ*ˆ Δ2 *L*

both 1 − *e*(2|1) ≥ 0.9 and 1 − *e*(1|2) ≥ 0.98 when Δ� ≥ 1200.

(*x*<sup>0</sup> <sup>−</sup> (*x*1*N*<sup>1</sup> <sup>+</sup> *<sup>x</sup>*2*N*<sup>2</sup> )/2)*TS*−<sup>1</sup>

and into Π<sup>2</sup> otherwise, where *Sdiag* = diag(*s*1*N*, ...,*spN*) having *sjN* =

(*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*1*N*<sup>1</sup> ) <sup>−</sup> (*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*2*N*<sup>2</sup> )*TS*−<sup>1</sup>

the discriminant rule given by (15) showed an adequate performance.

and into Π<sup>2</sup> otherwise, where *Sdiag*(*i*) = diag(*s*(*i*)1*Ni*

*<sup>i</sup>*=<sup>1</sup> tr(*Sim*)/*m* = 1890 and *Var*

� *Var*

1/4 2 ∑ *j*=1

> 2 ∑ *j*=1

*diag*(*x*2*N*<sup>2</sup> − *x*1*N*<sup>1</sup> ) < 0

*<sup>l</sup>*=<sup>1</sup> *xijl*/*Ni*. On the other hand, diagonal quadratic discriminant

(*x*<sup>0</sup> − *x*2*N*<sup>2</sup> ) − log

, ...,*s*(*i*)*pNi*

�(||*x*1*<sup>m</sup>* <sup>−</sup> *<sup>x</sup>*2*m*||2) = 87860. By

⎤ ⎦ + 1

> ⎤ ⎦ + 1

2 ∑ *i*=1

*Ni* ∑ *l*=1

� det(*Sdiag*(2)) det(*Sdiag*(1))

) having *<sup>s</sup>*(*i*)*jNi* <sup>=</sup> <sup>∑</sup>*Ni*

(*xijl*−*xijNi*

� < 0

*<sup>l</sup>*=1(*xijl*

)2

⎫ ⎬ <sup>⎭</sup> <sup>=</sup> 10,

> ⎫ ⎬ <sup>⎭</sup> <sup>=</sup> 6.

�(||*x*1*<sup>m</sup>* − *<sup>x</sup>*2*m*||2) = 1200.

tr(*Sjm*(1)*Sjm*(2))1/4

tr(*Sjm*(1)*Sjm*(2))1/4

**4.3 Demonstration**

*N*<sup>1</sup> = max

*N*<sup>2</sup> = max

Δ� ≥ Δ*L*, where Δ*<sup>L</sup>* is designed later.

<sup>Δ</sup>*<sup>L</sup>* <sup>=</sup> ||*x*1*<sup>m</sup>* <sup>−</sup> *<sup>x</sup>*2*m*||<sup>2</sup> <sup>−</sup>

we calculated ||*x*1*<sup>m</sup>* <sup>−</sup> *<sup>x</sup>*2*m*||<sup>2</sup> <sup>−</sup> <sup>∑</sup><sup>2</sup>

⎧ ⎨ ⎩ 6, ⎡ ⎣

> ⎧ ⎨ ⎩ 6, ⎡ ⎣

(DLDR) classifies *x*<sup>0</sup> into Π<sup>1</sup> if

/(*N*<sup>1</sup> <sup>+</sup> *<sup>N</sup>*<sup>2</sup> <sup>−</sup> <sup>2</sup>) and *xijNi* <sup>=</sup> <sup>∑</sup>*Ni*

(*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*1*N*<sup>1</sup> )*TS*−<sup>1</sup>

−*xijNi*

rule (DQDR) classifies *x*<sup>0</sup> into Π<sup>1</sup> if

*diag*(1)


Table 2. The correct discrimination rates, by (15), DLDR and DQDR, for test data sets consisting of 85 samples from Π<sup>1</sup> and 27 samples from Π2.

#### **5. Variable selection for high-dimension, low-sample-size data**

Suppose we have two independent and *p*-dimensional populations, Π*i*, *i* = 1, 2, having *unknown* mean vector *<sup>μ</sup><sup>i</sup>* = (*μi*1, ..., *<sup>μ</sup>ip*)*<sup>T</sup>* and *unknown* positive-definite covariance matrix **Σ***<sup>i</sup>* for each *i*. *We do not assume* **Σ**<sup>1</sup> = **Σ**2*.* We consider an effective methodology to select a significant set of associated variables from among high-dimensional data sets. That is, we consider testing the following hypotheses simultaneously:

$$H\_{\mathbf{0}\mathbf{j}} : \mu\_{\mathbf{1}\mathbf{j}} = \mu\_{\mathbf{2}\mathbf{j}} \quad \text{vs.} \quad H\_{\mathbf{1}\mathbf{j}} : \mu\_{\mathbf{1}\mathbf{j}} \neq \mu\_{\mathbf{2}\mathbf{j}} \quad \text{for } \mathbf{j} = \mathbf{1}, \ldots, p. \tag{16}$$

Our interest is to select a set of significant variables such that *D* = {*j* : *μ*1*<sup>j</sup>* �= *μ*2*j*}. Assume that |*D*| = *S* for some *S* ≥ 1, where |*D*| denotes the number of elements in set *D*. A variable selection procedure *D* maps the data into subsets of {1, ..., *p*}. We are interested in designing *D* ensuring that both the asymptotic *family-wise error rate* (FWER) is 0, i.e.,

$$P(|\mathbf{D}^c \cap \widehat{\mathbf{D}}| \neq 0) \to 0,\tag{17}$$

and the asymptotic *average power* (AP) is 1, i.e.,

$$\frac{|\mathcal{D}\cap\tilde{\mathcal{D}}|}{S} \xrightarrow{p} 1 \quad \text{when} \; \min\_{j\in\mathcal{D}} |\mu\_{1j} - \mu\_{2j}|^2 > \delta \,. \tag{18}$$

where *δ* (> 0) is a prespecified constant. When *S* is bounded (< ∞), one can modify (18) by

$$P(\mathbf{D} \subseteq \hat{\mathbf{D}}) \to 1 \quad \text{when } \min\_{j \in \mathbf{D}} |\mu\_{1j} - \mu\_{2j}|^2 > \delta.$$

We note that the assertion (18) does not consider the case when min *<sup>j</sup>*∈*<sup>D</sup>* <sup>|</sup>*μ*1*<sup>j</sup>* <sup>−</sup> *<sup>μ</sup>*2*j*<sup>|</sup> <sup>2</sup> = *δ*.

#### **5.1 Variable selection procedure for HDLSS data**

Let *<sup>σ</sup><sup>i</sup>* = max1≤*j*≤*<sup>p</sup> <sup>σ</sup>*(*i*)*<sup>j</sup>* (*<sup>i</sup>* = 1, 2), where *<sup>σ</sup>*(*i*)*j*, *<sup>j</sup>* = 1, ..., *<sup>p</sup>*, are diagonal elements of **<sup>Σ</sup>***i*. We assume that *<sup>σ</sup>*(*i*)*<sup>j</sup>* <sup>&</sup>lt; <sup>∞</sup> for *<sup>i</sup>* <sup>=</sup> 1, 2; *<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>p</sup>*, and *<sup>E</sup>*θ{exp(*t*|*xijl* <sup>−</sup> *<sup>μ</sup>ij*|/*σ*1/2 (*i*)*<sup>j</sup>* )} < ∞, *i* = 1, 2; *j* = 1, ..., *p*, for some *t* > 0. Then, for testing the hypotheses (16), we take samples,

From Theorem 5.2 in Aoshima and Yata (2011), we can claim the following theorem.

We set *δ* = 1.5. Our goal was to find variables *j*'s such that |*μ*1*<sup>j</sup>* − *μ*2*j*|

. .

. .

each Π*<sup>i</sup>* as pilot samples, which are given in Table 3.

. . . . .

. . . . .

Table 3. Pilot samples, *xijl* (*p* = 12600, *m* = 18)

each Π*i*, which are given in Table 4.

1.5} and finally obtained

We considered screening variables by *D* = {*j*| |*x*1*jm* − *x*2*jm*|

*N* =

. . . . .

Table 4. Additional samples, *xijl*, *j* ∈ *D* (*S* = 160, *N* = 18)

*(21) as p* → ∞*.*

samples).

**5.2 Demonstration**

**5.2.1 Variable selection procedure**

**Theorem 5.2.** *The two-stage variable selection procedure, (22) and (25), given by (24) with (23) has*

Effective Methodologies for Statistical Inference on Microarray Studies 29

We analyzed the gene expression data of Prostate Cancer that were given by Singh et al. (2002). The data took a pre-processing given by Jeffery et al. (2006). The data set consisted of 12600 (= *p*) genes and two groups, Π1: Normal Prostate (50 samples) and Π2: Prostate Tumors (52

pilot sample size for each Π*<sup>i</sup>* as *m* = 18 (= *O*(log *p*)). Then, we took the first 18 samples from

*j*\*l* 1 ··· 18 1 ··· 18 1 6.776 ··· 7.017 6.888 ··· 6.905

. . . .

. . . .

12600 3.050 ··· 3.612 3.097 ··· 3.549

of candidate variables as *D* = {192, 198, 200, ..., 12153, 12156, 12432} with *S* = |*D*| = 160. We set (*ξ*,*ε*)=(1.0, 1.0). According to (23), the additional sample size for each Π*<sup>i</sup>* was given by

Regarding *j* ∈ *D* , we took additional samples *xijl*, *l* = *m* + 1, ..., *m* + *N*, of size *N* = 18 from

*j*\*l* 19 ··· 36 19 ··· 36 192 9.859 ··· 8.973 9.338 ··· 10.212 198 8.622 ··· 7.077 6.120 ··· 7.724

. . . .

12432 9.884 ··· 9.091 8.00 ··· 9.388

We selected significant variables by *D* = {*j* ∈ *D*| rejecting *H*0*j*} = {*j* ∈ *D*| |*x*1*j*(*N*) − *x*2*j*(*N*)|

. .

Π1: Normal Prostate Π2: Prostate Tumors

max{(log *<sup>S</sup>*)1+*<sup>ξ</sup>* , (log *<sup>p</sup>*)*ε*} *δ*

Π1: Normal Prostate Π2: Prostate Tumors

. .

. .

. .

. .

*D* = {556, 7412, 8662, 11552} (26)

+ 1 = 18.

. .

. . <sup>2</sup> > 1.5. We chose the

<sup>2</sup> <sup>&</sup>gt; 1.5}. Then, we obtained a set

<sup>2</sup> >

*xi*1, ..., *xini* , of size

$$m\_i \ge \frac{(\log p)^{1+\zeta}}{\delta} \tag{19}$$

from each <sup>Π</sup>*<sup>i</sup>* (*<sup>i</sup>* <sup>=</sup> 1, 2), where *<sup>ζ</sup>* <sup>∈</sup> (0, 1] is a chosen constant. Let *<sup>x</sup>il* = (*xi*1*l*, ..., *xipl*)*T*, *<sup>l</sup>* <sup>=</sup> 1, ..., *ni*. Calculate *Tj*(**n**) <sup>=</sup> *<sup>x</sup>*1*jn*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2*jn*<sup>2</sup> for *<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>p</sup>*, where *xijni* <sup>=</sup> <sup>∑</sup>*ni <sup>l</sup>*=<sup>1</sup> *xijl*/*ni* for each Π*i*. Then, test the hypothesis for *j* = 1, ..., *p*, by

$$\text{rejecting } H\_{0\overline{\rangle}} \Longleftrightarrow |T\_{\overline{\rangle}(\mathbf{n})|} > \sqrt{\delta}. \tag{20}$$

Let *D* = {*j* | rejecting *H*0*j*}. Then, from Theorem 5.1 given in Aoshima and Yata (2011), we can claim the following theorem.

**Theorem 5.1.** *The test given by (20) with (19) has as p* → ∞ *that*

$$\begin{aligned} &P(|\mathbf{D}^c \cap \hat{\mathbf{D}}| \neq 0) \to 1;\\ &\frac{|\mathbf{D} \cap \hat{\mathbf{D}}|}{\mathbf{S}} \xrightarrow{p} 1 \quad \text{when} \,\,\, \min\_{j \in \mathbf{D}} |\mu\_{1j} - \mu\_{2j}|^2 > \delta. \end{aligned} \tag{21}$$

One would be interested in a two-stage variable selection procedure so as to provide screening of variables in the first stage. We consider selecting a significant set of associated variables from among a set of candidate variables in the second stage. Aoshima and Yata (2011) proposed the following effective methodology:

#### **[Two-stage variable selection procedure]**

**(Step 1)** Choose a pilot sample size *m* such as *m* = *O*(log *p*) and *m* → ∞ as *p* → ∞. Take pilot samples *xil*, *l* = 1, ..., *m*, of size *m* from each Π*<sup>i</sup>* (*i* = 1, 2). Calculate *Tj*(*m*) = *x*1*jm* − *x*2*jm* for *j* = 1, ..., *p*, where *xijm* = ∑*<sup>m</sup> <sup>l</sup>*=<sup>1</sup> *xijl*/*m* for each Π*i*. Then, provide screening of variables by

$$\tilde{\mathbf{D}} = \{ \mathbf{j} \mid |T\_{\mathbf{j}(m)}| > \sqrt{\delta} \}\tag{22}$$

for a set of candidate variables. Let *S* = |*D* |. Define the additional sample size for each Π*<sup>i</sup>* by

$$N = \left[\frac{\max\{(\log \tilde{S})^{1+\xi} \prime (\log p)^{\varepsilon}\}}{\delta}\right] + 1,\tag{23}$$

where *ξ* ∈ (0, 1] and *ε* ∈ (0, 1] are chosen constants.

**(Step 2)** Regarding *j* ∈ *D* , take new samples *xijl*, *l* = *m* + 1, ..., *m* + *N*, of size *N* from each Π*i*. Calculate *Tj*(*N*) <sup>=</sup> *<sup>x</sup>*1*j*(*N*) <sup>−</sup> *<sup>x</sup>*2*j*(*N*), where *xij*(*N*) <sup>=</sup> <sup>∑</sup>*m*+*<sup>N</sup> <sup>l</sup>*=*m*+<sup>1</sup> *xijl*/*N*, *j* ∈ *D* for each Π*i*. Then, test the hypothesis by

$$\text{rejecting } H\_{0j} \Longleftrightarrow |T\_{j(N)}| > \sqrt{\delta} \tag{24}$$

for *j* ∈ *D* , and define

$$\hat{D} = \{ j \in \tilde{D} \mid \text{rejecting } H\_{0j} \}. \tag{25}$$

Select the variables regarding *D* .

From Theorem 5.2 in Aoshima and Yata (2011), we can claim the following theorem. **Theorem 5.2.** *The two-stage variable selection procedure, (22) and (25), given by (24) with (23) has (21) as p* → ∞*.*

#### **5.2 Demonstration**

16 Will-be-set-by-IN-TECH

*ni* <sup>≥</sup> (log *<sup>p</sup>*)1+*<sup>ζ</sup>*

from each <sup>Π</sup>*<sup>i</sup>* (*<sup>i</sup>* <sup>=</sup> 1, 2), where *<sup>ζ</sup>* <sup>∈</sup> (0, 1] is a chosen constant. Let *<sup>x</sup>il* = (*xi*1*l*, ..., *xipl*)*T*, *<sup>l</sup>* <sup>=</sup>

rejecting *<sup>H</sup>*0*<sup>j</sup>* ⇐⇒ |*Tj*(**n**)<sup>|</sup> <sup>&</sup>gt; <sup>√</sup>

Let *D* = {*j* | rejecting *H*0*j*}. Then, from Theorem 5.1 given in Aoshima and Yata (2011), we

One would be interested in a two-stage variable selection procedure so as to provide screening of variables in the first stage. We consider selecting a significant set of associated variables from among a set of candidate variables in the second stage. Aoshima and Yata (2011)

**(Step 1)** Choose a pilot sample size *m* such as *m* = *O*(log *p*) and *m* → ∞ as *p* → ∞. Take pilot samples *xil*, *l* = 1, ..., *m*, of size *m* from each Π*<sup>i</sup>* (*i* = 1, 2). Calculate *Tj*(*m*) = *x*1*jm* − *x*2*jm* for

*<sup>D</sup>* <sup>=</sup> {*<sup>j</sup>* | |*Tj*(*m*)<sup>|</sup> <sup>&</sup>gt; <sup>√</sup>

for a set of candidate variables. Let *S* = |*D* |. Define the additional sample size for each Π*<sup>i</sup>*

max{(log *<sup>S</sup>*)1+*<sup>ξ</sup>* , (log *<sup>p</sup>*)*ε*} *δ*

**(Step 2)** Regarding *j* ∈ *D* , take new samples *xijl*, *l* = *m* + 1, ..., *m* + *N*, of size *N* from each Π*i*.

rejecting *<sup>H</sup>*0*<sup>j</sup>* ⇐⇒ |*Tj*(*N*)<sup>|</sup> <sup>&</sup>gt; <sup>√</sup>

*<sup>j</sup>*∈*<sup>D</sup>* <sup>|</sup>*μ*1*<sup>j</sup>* <sup>−</sup> *<sup>μ</sup>*2*j*<sup>|</sup>

*<sup>l</sup>*=<sup>1</sup> *xijl*/*m* for each Π*i*. Then, provide screening of variables by

*D* = {*j* ∈ *D* | rejecting *H*0*j*}. (25)

→ 1 *when* min

1, ..., *ni*. Calculate *Tj*(**n**) <sup>=</sup> *<sup>x</sup>*1*jn*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2*jn*<sup>2</sup> for *<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>p</sup>*, where *xijni* <sup>=</sup> <sup>∑</sup>*ni*

*<sup>P</sup>*(|*D<sup>c</sup>* <sup>∩</sup> *<sup>D</sup>* | �<sup>=</sup> <sup>0</sup>) <sup>→</sup> 1 ;

*p*

**Theorem 5.1.** *The test given by (20) with (19) has as p* → ∞ *that*


*N* =

Calculate *Tj*(*N*) <sup>=</sup> *<sup>x</sup>*1*j*(*N*) <sup>−</sup> *<sup>x</sup>*2*j*(*N*), where *xij*(*N*) <sup>=</sup> <sup>∑</sup>*m*+*<sup>N</sup>*

where *ξ* ∈ (0, 1] and *ε* ∈ (0, 1] are chosen constants.

proposed the following effective methodology: **[Two-stage variable selection procedure]**

*j* = 1, ..., *p*, where *xijm* = ∑*<sup>m</sup>*

test the hypothesis by

for *j* ∈ *D* , and define

Select the variables regarding *D* .

by

*<sup>δ</sup>* (19)

*<sup>l</sup>*=<sup>1</sup> *xijl*/*ni* for each Π*i*.

*δ*. (20)

<sup>2</sup> > *δ*. (21)

*δ*} (22)

*<sup>l</sup>*=*m*+<sup>1</sup> *xijl*/*N*, *j* ∈ *D* for each Π*i*. Then,

+ 1, (23)

*δ* (24)

*xi*1, ..., *xini*

, of size

Then, test the hypothesis for *j* = 1, ..., *p*, by

can claim the following theorem.

We analyzed the gene expression data of Prostate Cancer that were given by Singh et al. (2002). The data took a pre-processing given by Jeffery et al. (2006). The data set consisted of 12600 (= *p*) genes and two groups, Π1: Normal Prostate (50 samples) and Π2: Prostate Tumors (52 samples).

#### **5.2.1 Variable selection procedure**

We set *δ* = 1.5. Our goal was to find variables *j*'s such that |*μ*1*<sup>j</sup>* − *μ*2*j*| <sup>2</sup> > 1.5. We chose the pilot sample size for each Π*<sup>i</sup>* as *m* = 18 (= *O*(log *p*)). Then, we took the first 18 samples from each Π*<sup>i</sup>* as pilot samples, which are given in Table 3.


Table 3. Pilot samples, *xijl* (*p* = 12600, *m* = 18)

We considered screening variables by *D* = {*j*| |*x*1*jm* − *x*2*jm*| <sup>2</sup> <sup>&</sup>gt; 1.5}. Then, we obtained a set of candidate variables as *D* = {192, 198, 200, ..., 12153, 12156, 12432} with *S* = |*D*| = 160. We set (*ξ*,*ε*)=(1.0, 1.0). According to (23), the additional sample size for each Π*<sup>i</sup>* was given by

$$N = \left[\frac{\max\{ (\log \bar{S})^{1+\tilde{\xi}} \text{ } (\log p)^{\varepsilon} \}}{\delta} \right] + 1 = 18.5$$

Regarding *j* ∈ *D* , we took additional samples *xijl*, *l* = *m* + 1, ..., *m* + *N*, of size *N* = 18 from each Π*i*, which are given in Table 4.


Table 4. Additional samples, *xijl*, *j* ∈ *D* (*S* = 160, *N* = 18)

We selected significant variables by *D* = {*j* ∈ *D*| rejecting *H*0*j*} = {*j* ∈ *D*| |*x*1*j*(*N*) − *x*2*j*(*N*)| <sup>2</sup> > 1.5} and finally obtained

$$
\hat{D} = \{556, 7412, 8662, 11552\} \tag{26}
$$

the classification after variable selection if Δ is not large enough to claim the condition of Theorem 4.1 or to claim the assertion in Theorem 4.4 within the available sample size.

Effective Methodologies for Statistical Inference on Microarray Studies 31

Research of the first author was partially supported by Grant-in-Aid for Scientific Research (B) and Challenging Exploratory Research, Japan Society for the Promotion of Science (JSPS), under Contract Numbers 22300094 and 23650142. Research of the second author was partially supported by Grant-in-Aid for Young Scientists (B), JSPS, under Contract Number 23740066.

[1] Aoshima, M. & Yata, K. (2011). Two-stage procedures for high-dimensional data,

[2] Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R. den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R. & Korsmeyer, S.J. (2001). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia, *Nature*

[3] Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. & Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival, *Blood*, Vol.

[4] Dudoit, S., Fridlyand, J. & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, *Journal of American Statistical*

[5] Jeffery, I.B., Higgins, D.G. & Culhane, A.C. (2006). Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data,

[6] Johnstone, I.M. & Lu, A.Y. (2009). On consistency and sparsity for principal components analysis in high dimensions, *Journal of American Statistical Association*, Vol. 104, 682-693. [7] Pochet, N., De Smet, F., Suykens, J.A. & De Moor, B.L. (2004). Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality

[8] Pollard, K.S., Dudoit, S. & van der Laan, M.J. (2005). Multiple testing procedures: R multtest package and applications to genomics, In: Gentleman, R., Carey, V., Huber, W., Irizarry, R. & Dudoit, S. (ed.) Bioinformatics and Computational Biology Solutions

[9] Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D'Amico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W., Golub, T.R. & Sellers, W.R. (2002). Gene expression correlates of clinical prostate cancer behavior,

[10] Yata, K. (2010). Effective two-stage estimation for a linear function of high-dimensional

[11] Yata, K. & Aoshima, M. (2009). PCA consistency for non-Gaussian data in high dimension, low sample size context, *Communications in Statistics -Theory* & *Methods*,

Special Issue Honoring Zacks, S. (ed. Mukhopadhyay, N.), Vol. 38, 2634-2652.

*Sequential Analysis* (Editor's Special Invited Paper), to appear.

**6. Acknowledgments**

*Genetics*, Vol. 30, 41-47.

*Association*, Vol. 97, 77-87.

*Bioinformatics*, Vol. 7, 359.

*Cancer Cell*, Vol.1(2), 203-209.

reduction, *Bioinformatics*, Vol. 20, 3185-3195.

Using R and Bioconductor. Springer, New York, pp 249-271.

Gaussian means, *Sequential Analysis*, Vol. 29, 463-482.

103, 2771-2778.

**7. References**

with *<sup>S</sup>* <sup>=</sup> <sup>|</sup>*<sup>D</sup>*<sup>|</sup> <sup>=</sup> 4. For *<sup>j</sup>* <sup>∈</sup> *<sup>D</sup>*, we calculated *<sup>x</sup>*¯*ijm*+*<sup>N</sup>* <sup>=</sup> <sup>∑</sup>*m*+*<sup>N</sup> <sup>l</sup>*=<sup>1</sup> *xijl*/(*m* + *N*) for each Π*<sup>i</sup>* and obtained estimates of *μ*1*<sup>j</sup>* − *μ*2*<sup>j</sup>* for *j* ∈ *D* as

$$\{\left|\bar{\chi}\_{1jm+N} - \bar{\chi}\_{2jm+N}\right| j \in \widehat{D}\} = \{-1.511, -1.472, -1.79, -2.148\}.$$

The required sample-size in the two-stage variable selection procedure was *m* + *N* = 36 for each Π*i*. On the other hand, the required sample-size in the single variable selection procedure given by (20) with (19) was *ni* <sup>≥</sup> (log *<sup>p</sup>*)1+*ζ*/*<sup>δ</sup>* <sup>=</sup> 59.43 with *<sup>ζ</sup>* <sup>=</sup> 1.0. The two-stage variable selection procedure allows the experimenter to reduce the cost of sampling in the second stage.

#### **5.2.2 Classification after variable selection**

In Section 4, we considered a two-stage discriminant procedure in HDLSS data situations. In some cases the experimenter would encounter the situation that the required sample size, *Ni*, is much larger than the available sample size if <sup>Δ</sup>� <sup>=</sup> ||*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2||<sup>2</sup> + (tr(**Σ**1) <sup>−</sup> tr(**Σ**2))2/ max*i*=1,2{2tr(**Σ***i*)} is much smaller than tr(**Σ**<sup>2</sup> *<sup>i</sup>* )'s. In that case, we recommend that the experimenter should consider the classification based only on the selected variables. We selected a set of significant variables by *D* = {556, 7412, 8662, 11552} that was given in (26). We set *n*<sup>1</sup> = *n*<sup>2</sup> = *m* + *N* = 36, where *m* and *N* were given in Section 5.2.1. Let us write the selected 4-variable data as *xil* = (*xi*556*l*, *xi*7412*l*, *xi*8662*l*, *xi*11552*l*)*T*, *i* = 1, 2, for the *l*-th sample. Then, we considered a typical quadratic discriminant rule that classifies *x*<sup>0</sup> into Π<sup>1</sup> if

$$\log \left( \mathbf{x}\_0 - \overline{\mathbf{x}}\_{1\mathbb{I}\_1} \right)^T \mathbf{S}\_{1\mathbb{I}\_1}^{-1} (\mathbf{x}\_0 - \overline{\mathbf{x}}\_{1\mathbb{I}\_1}) - \log \left\{ \frac{\det(\mathbf{S}\_{2\mathbb{I}\_2})}{\det(\mathbf{S}\_{1\mathbb{I}\_1})} \right\} < (\mathbf{x}\_0 - \overline{\mathbf{x}}\_{2\mathbb{I}\_2})^T \mathbf{S}\_{2\mathbb{I}\_2}^{-1} (\mathbf{x}\_0 - \overline{\mathbf{x}}\_{2\mathbb{I}\_2}), \tag{27}$$

and into Π<sup>2</sup> otherwise, where *x*<sup>0</sup> is an observation vector with respect to the 4 variables on an individual belonging to <sup>Π</sup><sup>1</sup> or to <sup>Π</sup>2, *<sup>x</sup>ini* <sup>=</sup> <sup>∑</sup>*ni <sup>l</sup>*=<sup>1</sup> *<sup>x</sup>il*/*ni* and *<sup>S</sup>ini* <sup>=</sup> <sup>∑</sup>*ni <sup>l</sup>*=1(*xil* − *xini* )(*xil* − *xini* )*T*/(*ni* <sup>−</sup> <sup>1</sup>), *<sup>i</sup>* <sup>=</sup> 1, 2.

We compared the discriminant rule given by (27) after variable selection with those given by (10) with *γ* = 0, DLDR and DQDR. Note that the three competitors were constructed by using the original (12600-variable) data without variable selection. In Table 5, we investigated those performances by using test data sets consisting of 50 − *n*<sup>1</sup> = 14 remaining samples from Π<sup>1</sup> and 52 − *n*<sup>2</sup> = 16 remaining samples from Π2. We observed that the discriminant rule given by (10) with *γ* = 0 showed a bad performance for *x*<sup>0</sup> classified into Π1: Normal Prostate. We inspected the condition of Theorem 4.1 for the data sets and found that max*i*=1,2{tr(*Sini*(1)*Sini*(2))}/(<sup>Δ</sup><sup>2</sup> � min*i*=1,2{*ni*}) = 0.15 according to Remark 4.4 so that max*i*=1,2{tr(**Σ**<sup>2</sup> *<sup>i</sup>* )}/(Δ<sup>2</sup> � min*i*=1,2{*ni*}) seems not to be sufficiently small. This may be a reason why Theorem 4.1 is not applicable to the present data sets. On the other hand, we observed that the discriminant rule given by (27) after variable selection showed a good performance when compared to the competitors. We recommend that the experimenter should consider


Table 5. The correct discrimination rates by (27) after variable selection, (10) with *γ* = 0, DLDR and DQDR for test data sets consisting of 14 samples from Π<sup>1</sup> and 16 samples from Π2.

the classification after variable selection if Δ is not large enough to claim the condition of Theorem 4.1 or to claim the assertion in Theorem 4.4 within the available sample size.

#### **6. Acknowledgments**

Research of the first author was partially supported by Grant-in-Aid for Scientific Research (B) and Challenging Exploratory Research, Japan Society for the Promotion of Science (JSPS), under Contract Numbers 22300094 and 23650142. Research of the second author was partially supported by Grant-in-Aid for Young Scientists (B), JSPS, under Contract Number 23740066.

#### **7. References**

18 Will-be-set-by-IN-TECH

{*x*¯1*jm*+*<sup>N</sup>* − *x*¯2*jm*+*N*| *j* ∈ *D* } = {−1.511, −1.472, −1.79, −2.148}.

The required sample-size in the two-stage variable selection procedure was *m* + *N* = 36 for each Π*i*. On the other hand, the required sample-size in the single variable selection procedure given by (20) with (19) was *ni* <sup>≥</sup> (log *<sup>p</sup>*)1+*ζ*/*<sup>δ</sup>* <sup>=</sup> 59.43 with *<sup>ζ</sup>* <sup>=</sup> 1.0. The two-stage variable selection procedure allows the experimenter to reduce the cost of sampling in the second stage.

In Section 4, we considered a two-stage discriminant procedure in HDLSS data situations. In some cases the experimenter would encounter the situation that the required sample size, *Ni*, is much larger than the available sample size if <sup>Δ</sup>� <sup>=</sup> ||*μ*<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*2||<sup>2</sup> + (tr(**Σ**1) <sup>−</sup>

the experimenter should consider the classification based only on the selected variables. We selected a set of significant variables by *D* = {556, 7412, 8662, 11552} that was given in (26). We set *n*<sup>1</sup> = *n*<sup>2</sup> = *m* + *N* = 36, where *m* and *N* were given in Section 5.2.1. Let us write the selected 4-variable data as *xil* = (*xi*556*l*, *xi*7412*l*, *xi*8662*l*, *xi*11552*l*)*T*, *i* = 1, 2, for the *l*-th sample.

> det(*S*2*n*<sup>2</sup> ) det(*S*1*n*<sup>1</sup> )

and into Π<sup>2</sup> otherwise, where *x*<sup>0</sup> is an observation vector with respect to the 4 variables on

We compared the discriminant rule given by (27) after variable selection with those given by (10) with *γ* = 0, DLDR and DQDR. Note that the three competitors were constructed by using the original (12600-variable) data without variable selection. In Table 5, we investigated those performances by using test data sets consisting of 50 − *n*<sup>1</sup> = 14 remaining samples from Π<sup>1</sup> and 52 − *n*<sup>2</sup> = 16 remaining samples from Π2. We observed that the discriminant rule given by (10) with *γ* = 0 showed a bad performance for *x*<sup>0</sup> classified into Π1: Normal Prostate. We inspected the condition of Theorem 4.1 for the data sets and found

why Theorem 4.1 is not applicable to the present data sets. On the other hand, we observed that the discriminant rule given by (27) after variable selection showed a good performance when compared to the competitors. We recommend that the experimenter should consider

1-*e*(2|1) 10/14 (=0.714) 4/14 (=0.286) 4/14 (=0.286) 4/14 (=0.286) 1-*e*(1|2) 15/16 (=0.938) 15/16 (=0.938) 15/16 (=0.938) 15/16 (=0.938)

Table 5. The correct discrimination rates by (27) after variable selection, (10) with *γ* = 0, DLDR and DQDR for test data sets consisting of 14 samples from Π<sup>1</sup> and 16 samples from

(27) after variable selection (10) with *γ* = 0 DLDR DQDR

<sup>&</sup>lt; (*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*2*n*<sup>2</sup> )*TS*−<sup>1</sup>

*<sup>l</sup>*=<sup>1</sup> *<sup>x</sup>il*/*ni* and *<sup>S</sup>ini* <sup>=</sup> <sup>∑</sup>*ni*

� min*i*=1,2{*ni*}) = 0.15 according to Remark 4.4 so that

� min*i*=1,2{*ni*}) seems not to be sufficiently small. This may be a reason

Then, we considered a typical quadratic discriminant rule that classifies *x*<sup>0</sup> into Π<sup>1</sup> if

*<sup>l</sup>*=<sup>1</sup> *xijl*/(*m* + *N*) for each Π*<sup>i</sup>* and

*<sup>i</sup>* )'s. In that case, we recommend that

2*n*<sup>2</sup>

(*x*<sup>0</sup> − *x*2*n*<sup>2</sup> ), (27)

)(*xil* −

*<sup>l</sup>*=1(*xil* − *xini*

with *<sup>S</sup>* <sup>=</sup> <sup>|</sup>*<sup>D</sup>*<sup>|</sup> <sup>=</sup> 4. For *<sup>j</sup>* <sup>∈</sup> *<sup>D</sup>*, we calculated *<sup>x</sup>*¯*ijm*+*<sup>N</sup>* <sup>=</sup> <sup>∑</sup>*m*+*<sup>N</sup>*

obtained estimates of *μ*1*<sup>j</sup>* − *μ*2*<sup>j</sup>* for *j* ∈ *D* as

**5.2.2 Classification after variable selection**

(*x*<sup>0</sup> <sup>−</sup> *<sup>x</sup>*1*n*<sup>1</sup> )*TS*−<sup>1</sup>

)*T*/(*ni* <sup>−</sup> <sup>1</sup>), *<sup>i</sup>* <sup>=</sup> 1, 2.

that max*i*=1,2{tr(*Sini*(1)*Sini*(2))}/(<sup>Δ</sup><sup>2</sup>

*<sup>i</sup>* )}/(Δ<sup>2</sup>

*xini*

Π2.

max*i*=1,2{tr(**Σ**<sup>2</sup>

1*n*<sup>1</sup>

an individual belonging to <sup>Π</sup><sup>1</sup> or to <sup>Π</sup>2, *<sup>x</sup>ini* <sup>=</sup> <sup>∑</sup>*ni*

tr(**Σ**2))2/ max*i*=1,2{2tr(**Σ***i*)} is much smaller than tr(**Σ**<sup>2</sup>

(*x*<sup>0</sup> − *x*1*n*<sup>1</sup> ) − log


**3** 

*Finland* 

**Prostate Cancer, the Long Search for** 

*2Pharmacology & Toxicology, ELTDK, University of Helsinki* 

Thomas Tallberg1 and Faik Atroshi2 *1The Institute for Bio-Immunotherapy, Helsinki* 

**Etiologic and Therapeutic Factors: Dietary** 

**Supplementation Avoiding Invasive Treatment** 

The lack of a comprehensive aetiology for prostate cancer (CaP), and the need for an effective and inexpensive biological treatment modality, devoid of side-effects has resulted in a multitude of therapeutic trials. None of these has been very satisfying, and they have vaned from focal to invasive therapies for CaP. The progress has been delayed and hampered by the lack of any thorough effort to elucidate the cause of the disease. Such efforts would have speeded up the introduction of more rational therapy modalities. The different incidence of CaP in populations aroused our interest to proceed in a more physiologic way to empirically test different functional f factors, since they are non-toxic, although such an approach is consuming time. It was ethical to test the effect of these natural alimentary components, and follow the patients´ reaction and laboratory responses. Huggins and Hodges1 had already, in 1941, decisively proven that CaP was a hormone dependent disease, although castration alone was not curative. In 1945 Huggins and Scott performed bilateral adrenalectomies2 after the glands were found to produce DHEA, which could be transformed into testosterone, regarded to have caused recurrent disease. All patients died in a short time postoperatively. However, Huggins did not recognize that this showed the central position the adrenals had in regulating prostate cancer. In this paper we shall review some cases of prostate cancer treated patients. For these we especially follow the shift in FSH, LH, PRL, DHEA, DHEAS, SHBG, plus PSA. The involvement of adrenal glands became evident with an orchiectomized prostate cancer patient with excessively high FSH 120 IU/L and immeasurable PSA. In MRI (magnetic resolution imaging) there was no pituitary adenoma which could explain this high level, but both his adrenal glands were evenly hypertophic. Upon laparatomy the enlargement was due to bilateral zona reticularis (ZR) cell proliferation, while the adrenals other cell structures were normal. These bluish cells with strong green fluorescence had via the hypothalamus stimulated the pituitary to greatly increase FSH and LH, see Case A below with normal

The Importance of adrenal ZR cells is unveiled by: A) their marked proliferation after orchiectomy; B) lack of ZR cells in castrated male pigs [eunuchs]; C) markedly decreased number of ZR cells in patients succumb-bing to CaP; D) the hormonal effect in extracts made from ZR cells of castrated boars; E) rapid lethal out-come if CaP patients after

laboratory ranges displayed (the row at the top in all cases).

**1. Introduction** 

