|
Cancer Patent Abstract
A method of providing a prognosis of breast cancer is conducted
by analyzing the expression of a group of genes. Gene expresson
profiles in a variety of medium such as microarrays are included
as are kits that contain them.
Cancer Patent Claims
I claim:
1. A method of predicting breast cancer recurrence comprising identifying
differential modulation of each gene in a combination of genes selected
for their ability to predict breast cancer recurrence wherein the
genes comprise Seq. ID No. 1-56.
Cancer Patent Description
--------------------------------------------------------------------------------
Description
--------------------------------------------------------------------------------
BACKGROUND
This invention relates to prognostics for breast cancer based on
the gene expression profiles of biological samples.
In breast cancers, prognosis is determined primarily by the presence
or absence of metastases in draining axillary lymph nodes. However,
in approximately one third of women with breast cancer who have
negative lymph nodes, the disease recurs and about one third of
patients with positive lymph nodes are free of disease ten years
after local or regional therapy. Furthermore, an increasing proportion
of breast cancers are being diagnosed at an early stage because
of increased awareness and wider use of screening modalities. Universal
application of systematic therapy to these patients often leads
to over-treatment. According to the St Gallen and NIH consensus,
70-80% of the Stage I and II patients would not have developed distant
metastases without adjuvant treatment and may potentially suffer
from the side effects. These data highlight the need for more sensitive
and specific prognostic assays that could significantly reduce the
number of patients that receive unnecessary treatment.
Tumor size and lymphatic or vascular invasion have been found to
be of significant prognostic value in several studies. Quantitative
pathological features, i.e. nuclear morphology, DNA content and
proliferative activity may further demarcate tumors that have a
high chance of micrometastases. Known molecular genetic changes
that affect patient outcome include Her2/NEU over-expression, DNA
amplifications, p53 mutations, ER/PR status, uPA and PAI expression.
Because the metastatic cascade is a complex process that includes
multiple steps, single factors that contribute to tumor process
have limitations for prognostic assessment. The gene expression
profiles of this invention will provide increased prognostic power.
SUMMARY OF THE INVENTION
The invention is a method of assessing the likelihood of a recurrence
or metastasis of breast cancer in a patient diagnosed with or treated
for breast cancer. The method involves the analysis of a gene expression
profile.
In one aspect of the invention, the gene expression profile includes
56 genes. In yet other aspects of the invention, the profiles comprise
those of at least 45 genes, 26 genes, 13 genes, and 6 genes respectively.
Articles used in practicing the methods are also an aspect of the
invention. Such articles include gene expression profiles or representations
of them that are fixed in machine-readable media such as computer
readable media.
Articles used to identify gene expression profiles can also include
substrates or surfaces (such as microarrays) to capture and/or indicate
the presence, absence, or degree of gene expression.
In yet another aspect of the invention, kits include reagents for
conducting the gene expression analysis prognostic of breast caner
recurrence or metastasis.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a standard Kaplan-Meier Plot constructed from the patient
data as a training set as described in the Examples. Two classes
of patients are indicated as predicted by the chip data of the 56-gene
panel. The vertical axis shows the probability of disease-free survival
among patients in each class.
FIG. 2 is a standard Kaplan-Meier Plot constructed from the patient
data as a testing set as described in the Examples. Two classes
of patients are indicated as predicted by the chip data of the 56-gene
panel. The vertical axis shows the probability of disease-free survival
among patients in each class.
FIG. 3 is a standard Kaplan-Meier Plot constructed from the patient
data of 54 patients (training and testing data combined) using a
56-gene expression profile. Two classes of patients are indicated
as predicted by the chip data of the 56-gene panel The vertical
axis shows the probability of disease-free survival among patients
in each class.
DETAILED DESCRIPTION
The mere presence or absence of particular nucleic acid sequences
in a tissue sample has only rarely been found to have diagnostic
or prognostic value. Information about the expression of various
proteins, peptides or mRNA, on the other hand, is increasingly viewed
as important. The mere presence of nucleic acid sequences having
the potential to express proteins, peptides, or mRNA (such sequences
referred to as "genes") within the genome by itself is
not determinative of whether a protein, peptide, or mRNA is expressed
in a given cell. Whether or not a given gene capable of expressing
proteins, peptides, or mRNA does so and to what extent such expression
occurs, if at all, is determined by a variety of complex factors.
Irrespective of difficulties in understanding and assessing these
factors, assaying gene expression can provide useful information
about the occurrence of important events such as tumerogenesis,
metastasis, apoptosis, and other clinically relevant phenomena.
Relative indications of the degree to which genes are active or
inactive can be found in gene expression profiles. The gene expression
profiles of this invention are used to provide a prognosis and treat
patients for breast cancer.
Sample preparation requires the collection of patient samples.
Patient samples used in the inventive method are those that are
suspected of containing diseased cells such as epithelial cells
taken from a breast or lymph node sample or from surgical margins.
One useful technique for obtaining suspect samples is Laser Capture
Microdisection (LCM). LCM technology provides a way to select the
cells to be studied, minimizing variability caused by cell type
heterogeneity. Consequently, moderate or small changes in gene expression
between normal and cancerous cells can be readily detected. In a
preferred method, the samples comprise circulating epithelial cells
extracted from peripheral blood. These can be obtained according
to a number of methods but the most preferred method is the magnetic
separation technique described in U.S. Pat. No. 6,136,182 assigned
to Immunivest Corp which is incorporated herein by reference. Once
the sample containing the cells of interest has been obtained, RNA
is extracted and amplified and a gene expression profile is obtained,
preferably via micro-array, for genes in the appropriate portfolios.
Preferred methods for establishing gene expression profiles include
determining the amount of RNA that is produced by a gene that can
code for a protein or peptide. This is accomplished by reverse transcriptase
PCR (RT-PCR), competitive RT-PCR, real time RT-PCR, differential
display RT-PCR, Northern Blot analysis and other related tests.
While it is possible to conduct these techniques using individual
PCR reactions, it is best to amplify complimentary DNA (cDNA) or
complimentary RNA (cRNA) produced from mRNA and analyze it via microarray.
A number of different array configurations and methods for their
production are known to those of skill in the art and are described
in U.S. patents such as: U.S. Pat. Nos. 5,445,934; 5,532,128; 5,556,752;
5,242,974; 5,384,261; 5,405,783; 5,412,087; 5,424,186; 5,429,807;
5,436,327; 5,472,672; 5,527,681; 5,529,756; 5,545,531; 5,554,501;
5,561,071; 5,571,639; 5,593,839; 5,599,695; 5,624,711; 5,658,734;
and 5,700,637; the disclosures of which are incorporated herein
by reference.
Microarray technology allows for the measurement of the steady-state
mRNA level of thousands of genes simultaneously thereby presenting
a powerful tool for identifying effects such as the onset, arrest,
or modulation of uncontrolled cell proliferation. Two microarray
technologies are currently in wide use. The first are cDNA arrays
and the second are oligonucleotide arrays. Although differences
exist in the construction of these chips, essentially all downstream
data analysis and output are the same. The product of these analyses
are typically measurements of the intensity of the signal received
from a labeled probe used to detect a cDNA sequence from the sample
that hybridizes to a nucleic acid sequence at a known location on
the microarray. Typically, the intensity of the signal is proportional
to the quantity of cDNA, and thus mRNA, expressed in the sample
cells. A large number of such techniques are available and useful.
Preferred methods for determining gene expression can be found in
U.S. Pat. No. 6,271,002 to Linsley, et al.; U.S. Pat. No. 6,218,122
to Friend, et al.; U.S. Pat. No. 6,218,114 to Peck, et al.; and
U.S. Pat. No. 6,004,755 to Wang, et al., the disclosure of each
of which is incorporated herein by reference.
Analysis of the expression levels is conducted by comparing such
intensities. This is best done by generating a ratio matrix of the
expression intensities of genes in a test sample versus those in
a control sample. For instance, the gene expression intensities
from a diseased tissue can be compared with the expression intensities
generated from normal tissue of the same type (e.g., diseased breast
tissue sample vs. normal breast tissue sample). A ratio of these
expression intensities indicates the fold-change in gene expression
between the test and control samples.
Gene expression profiles can also be displayed in a number of ways.
The most common method is to arrange a raw fluorescence intensities
or ratio matrix into a graphical dendogram where columns indicate
test samples and rows indicate genes. The data is arranged so genes
that have similar expression profiles are proximal to each other.
The expression ratio for each gene is visualized as a color. For
example, a ratio less than one (indicating down-regulation) may
appear in the blue portion of the spectrum while a ratio greater
than one (indicating up-regulation) may appear as a color in the
red portion of the spectrum. Commercially available computer software
programs are available to display such data including "GENESPRINT"
from Silicon Genetics, Inc. and "DISCOVERY" and "INFER"
software from Partek, Inc.
Modulated genes used in the methods of the invention are described
in the Examples. The genes that are differentially expressed are
either up regulated or down regulated in patients with a relapse
of breast cancer relative to those without a relapse. Up regulation
and down regulation are relative terms meaning that a detectable
difference (beyond the contribution of noise in the system used
to measure it) is found in the amount of expression of the genes
relative to some baseline. In this case, the baseline is the measured
gene expression of a non-relapsing patient. The genes of interest
in the diseased cells (from the relapsing patients) are then either
up regulated or down regulated relative to the baseline level using
the same measurement method. Diseased, in this context, refers to
an alteration of the state of a body that interrupts or disturbs,
or has the potential to disturb, proper performance of bodily functions
as occurs with the uncontrolled proliferation of cells. Someone
is diagnosed with a disease when some aspect of that person's genotype
or phenotype is consistent with the presence of the disease. However,
the act of conducting a diagnosis or prognosis includes the determination
of disease/status issues such as determining the likelihood of relapse
or metastasis and therapy monitoring. In therapy monitoring, clinical
judgments are made regarding the effect of a given course of therapy
by comparing the expression of genes over time to determine whether
the gene expression profiles have changed or are changing to patterns
more consistent with normal tissue.
Preferably, levels of up and down regulation are distinguished
based on fold changes of the intensity measurements of hybridized
microarray probes. A 2.0 fold difference is preferred for making
such distinctions or a p-value less than 0.05. That is, before a
gene is said to be differentially expressed in diseased/relapsing
versus normal/non-relapsing cells, the diseased cell is found to
yield at least 2 times more, or 2 times less intensity than the
normal cells. The greater the fold difference, the more preferred
is use of the gene as a diagnostic or prognostic tool. Genes selected
for the gene expression profiles of the instant invention have expression
levels that result in the generation of a signal that is distinguishable
from those of the normal or non-modulated genes by an amount that
exceeds background using clinical laboratory instrumentation.
Statistical values can be used to confidently distinguish modulated
from non-modulated genes and noise. Statistical tests find the genes
most significantly different between diverse groups of samples.
The Student's t-test is an example of a robust statistical test
that can be used to find significant differences between two groups.
The lower the p-value, the more compelling the evidence that the
gene is showing a difference between the different groups. Nevertheless,
since microarrays measure more than one gene at a time, tens of
thousands of statistical tests may be asked at one time. Because
of this, one is unlikely to see small p-values just by chance and
adjustments for this using a Sidak correction as well as a randomization/permutation
experiment can be made. A p-value less than 0.05 by the t-test is
evidence that the gene is significantly different. More compelling
evidence is a p-value less then 0.05 after the Sidak correction
is factored in. For a large number of samples in each group, a p-value
less than 0.05 after the randomization/permutation test is the most
compelling evidence of a significant difference.
Another parameter that can be used to select genes that generate
a signal that is greater than that of the non-modulated gene or
noise is the use of a measurement of absolute signal difference.
Preferably, the signal generated by the modulated gene expression
is at least 20% different than those of the normal or non-modulated
gene (on an absolute basis). It is even more preferred that such
genes produce expression patterns that are at least 30% different
than those of normal or non-modulated genes.
Genes can be grouped so that information obtained about the set
of genes in the group provides a sound basis for making a clinically
relevant judgment such as a diagnosis, prognosis, or treatment choice.
These sets of genes make up the portfolios of the invention. In
this case, the judgments supported by the portfolios involve breast
cancer. As with most diagnostic markers, it is often desirable to
use the fewest number of markers sufficient to make a correct medical
judgment. This prevents a delay in treatment pending further analysis
as well inappropriate use of time and resources.
Preferably, portfolios are established such that the combination
of genes in the portfolio exhibit improved sensitivity and specificity
relative to individual genes or randomly selected combinations of
genes. In the context of the instant invention, the sensitivity
of the portfolio can be reflected in the fold differences exhibited
by a gene's expression in the diseased state relative to the normal
state. Specificity can be reflected in statistical measurements
of the correlation of the signaling of gene expression with the
condition of interest. For example, standard deviation can be a
used as such a measurement. In considering a group of genes for
inclusion in a portfolio, a small standard deviation in expression
measurements correlates with greater specificity. Other measurements
of variation such as correlation coefficients can also be used in
this capacity.
A preferred method of establishing gene expression portfolios is
through the use of optimization algorithms such as the mean variance
algorithm widely used in establishing stock portfolios. This method
is described in detail in the patent application entitled "Selection
of Markers" by Tim Jatkoe, et. al., filed on Mar. 21, 2003
(application Ser. No. 10/394,087, incorporated herein by reference).
Essentially, the method calls for the establishment of a set of
inputs (stocks in financial applications, expression as measured
by intensity here) that will optimize the return (e.g., signal that
is generated) one receives for using it while minimizing the variability
of the return. Many commercial software programs are available to
conduct such operations. "Wagner Associates Mean-Variance Optimization
Application", referred to as "Wagner Software" throughout
this specification, is preferred. This software uses functions from
the "Wagner Associates Mean-Variance Optimization Library"
to determine an efficient frontier and optimal portfolios.
Use of this type of software requires that microarray data (i.e.
intensity measurements) be transformed so that it can be treated
as an input in the way stock return and risk measurements are used
when the software is used for its intended financial analysis purposes.
The process of portfolio selection and characterization of an unknown
is summarized as follows:
1. Choose baseline class 2. Calculate mean, and standard deviation
of each gene for baseline class samples 3. Calculate (X*Standard
Deviation + Mean) for each gene. This is the baseline reading from
which all other samples will be compared. X is a stringency variable
with higher values of X being more stringent than lower. 4. Calculate
ratio between each Experimental sample versus baseline reading calculated
in step 3. 5. Transform ratios such that ratios less than 1 are
negative (eg. using Log base 10). (Down regulated genes now correctly
have negative values necessary for MV optimization). 6. These transformed
ratios are used as inputs in place of the asset returns that are
normally used in the software application. 7. The software will
plot the efficient frontier and return an optimized portfolio at
any point along the efficient frontier. 8. Choose a desired return
or variance on the efficient frontier. 9. Calculate the Portfolio's
Value for each sample by summing the multiples of each gene's intensity
value by the weight generated by the portfolio selection algorithm.
10. Calculate a boundary value by adding the mean Portfolio Value
for Baseline groups to the multiple of Y and the Standard Deviation
of the Baseline's Portfolio Values. Values greater than this boundary
value shall be classified as the Experimental Class. 11. Optionally
one can reiterate this process until best prediction accuracy is
obtained.
Alternatively, genes can first be pre-selected by identifying those
genes whose expression shows some minimal level of differentiation.
The pre-selection in this alternative method is preferably based
on a threshold given by
.ltoreq..mu..mu..sigma..sigma. ##EQU00001## where .mu..sub.1 is
the mean of the subset known to possess the disease or condition,
.mu..sub.n is the mean of the subset of normal samples, and .sigma..sub.1,
+.sigma..sub.n represent the combined standard deviations. A signal
to noise cutoff can also be used by pre-selecting the data according
to a relationship such as
.ltoreq..mu..sigma..sigma. ##EQU00002## This ensures that genes
that are pre-selected based on their differential modulation are
differentiated in a clinically significant way. That is, above the
noise level of instrumentation appropriate to the task of measuring
the diagnostic parameters. For each marker pre-selected according
to these criteria, a matrix is established in which columns represents
samples, rows represent markers and each element is a normalized
intensity measurement for the expression of that marker according
to the relationship:
.mu..mu. ##EQU00003## where I is the intensity measurement.
It is also possible to set additional boundary conditions to define
the optimal portfolios. For example, portfolio size can be limited
to a fixed range or number of markers. This can be done either by
making data pre-selection criteria more stringent
.times..ltoreq..mu..sigma..sigma..times..times..times..times..times..times-
..ltoreq..mu..sigma..sigma. ##EQU00004## or by using programming
features such as restricting portfolio size. One could, for example,
set the boundary condition that the efficient frontier is to be
selected from among only the most optimal 10 genes. One could also
use all of the genes pre-selected for determining the efficient
frontier and then limit the number of genes selected (e.g., no more
than 10).
The process of selecting a portfolio can also include the application
of heuristic rules. Preferably, such rules are formulated based
on biology and an understanding of the technology used to produce
clinical results. More preferably, they are applied to output from
the optimization method. For example, the mean variance method of
portfolio selection can be applied to microarray data for a number
of genes differentially expressed in subjects with breast cancer.
Output from the method would be an optimized set of genes that could
include some genes that are expressed in peripheral blood as well
as in diseased tissue. If samples used in the testing method are
obtained from peripheral blood and certain genes differentially
expressed in instances of breast cancer could also be differentially
expressed in peripheral blood, then a heuristic rule can be applied
in which a portfolio is selected from the efficient frontier excluding
those that are differentially expressed in peripheral blood. Of
course, the rule can be applied prior to the formation of the efficient
frontier by, for example, applying the rule during data pre-selection.
Other heuristic rules can be applied that are not necessarily related
to the biology in question. For example, one can apply the rule
that only a given percentage of the portfolio can be represented
by a particular gene or genes. Commercially available software such
as the Wagner Software readily accommodates these types of heuristics.
This can be useful, for example, when factors other than accuracy
and precision (e.g., anticipated licensing fees) have an impact
on the desirability of including one or more genes.
One method of the invention involves comparing gene expression
profiles for various genes (or portfolios) to ascribe prognoses.
The gene expression profiles of each of the genes comprising the
portfolio are fixed in a medium such as a computer readable medium.
This can take a number of forms. For example, a table can be established
into which the range of signals (e.g., intensity measurements) indicative
of disease/relapse is input. Actual patient data can then be compared
to the values in the table to determine whether the patient samples
are normal or diseased. In a more sophisticated embodiment, patterns
of the expression signals (e.g., flourescent intensity) are recorded
digitally or graphically. The gene expression patterns from the
gene portfolios used in conjunction with patient samples are then
compared to the expression patterns. Pattern comparison software
can then be used to determine whether the patient samples have a
pattern indicative of recurrence of the disease. Of course, these
comparisons can also be used to determine whether the patient is
not likely to experience disease recurrence. The expression profiles
of the samples are then compared to the portfolio of a control cell.
If the sample expression patterns are consistent with the expression
pattern for recurrence of a breast cancer then (in the absence of
countervailing medical considerations) the patient is treated as
one would treat a relapse patient. If the sample expression patterns
are consistent with the expression pattern from the normal/control
cell then the patient is diagnosed negative for breast cancer.
Numerous well known methods of pattern recognition are available.
The following references provide some examples:
Weighted Voting: Golub, T R., Slonim, D K., Tamaya, P., Huard,
C., Gaasenbeek, M., Mesirov, J P., Coller, H., Loh, L., Downing,
J R., Caligiuri, M A., Bloomfield, C D., Lander, E S. Molecular
classification of cancer: class discovery and class prediction by
gene expression monitoring. Science 286:531-537, 1999
Support Vector Machines: Su, A I., Welsh, J B., Sapinoso, L M.,
Kern, S G., Dimitrov, P., Lapp, H., Schultz, P G., Powell, S M.,
Moskaluk, C A., Frierson, H F. Jr., Hampton, G M. Molecular classification
of human carcinomas by use of gene expression signatures. Cancer
Research 61:7388-93, 2001 Ramaswamy, S., Tamayo, P., Rifkin, R.,
Mukherjee, S., Yeang, C H., Angelo, M., Ladd, C., Reich, M., Latulippe,
E., Mesirov, J P., Poggio, T., Gerald, W., Loda, M., Lander, E S.,
Gould, T R. Multiclass cancer diagnosis using tumor gene expresvion
signatures Proceedings of the National Academy of Sciences of the
USA 98:15149-15154, 2001
K-nearest Neighbors: Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee,
S., Yeang, C H., Angelo, M., Ladd, C., Reich, M., Latulippe, E.,
Mesirov, J P., Poggio, T., Gerald, W., Loda, M., Lander, E S., Gould,
T R. Multiclass cancer diagnosis using tumor gene expression signatures
Proceedings of the National Academy of Sciences of the USA 98:15149-15154,
2001
Correlation Coefficients: van 't Veer L J, Dai H, van de Vijver
M J, He Y D, Hart A A, Mao M, Peterse H L, van der Kooy K, Marton
M J, Witteveen A T, Schreiber G J, Kerkhoven R M, Roberts C, Linsley
P S, Bemards R, Friend S H. Gene expression profiling predicts clinical
outcome of breast cancer. Nature. 2002 Jan. 31;415(6871):530-6.
The gene expression profiles of this invention can also be used
in conjunction with other non-genetic diagnostic methods useful
in cancer diagnosis, prognosis, or treatment monitoring. For example,
in some circumstances it is beneficial to combine the diagnostic
power of the gene expression based methods described above with
data from conventional markers such as serum protein markers. A
range of such markers exists including such analytes as Estrogen
Receptor (ER) with ER+ results indicating a greater likelihood of
recurrence or metastasis. Other markers such as the protein (or
peptides) produced by the estrogen regulated gene sequence pLIV1
can be used in this capacity as described in U.S. Pat. No. 5,693,465
(incorporated by reference in this specification). In one such method,
blood is periodically taken from a treated patient and then subjected
to an enzyme immunoassay for one or more serum markers. When the
concentration of the marker(s) suggests the return of tumors or
failure of therapy, a sample source amenable to gene expression
analysis is taken. Where a suspicious mass exists, a fine needle
aspirate is taken and gene expression profiles of cells taken from
the mass are then analyzed as described above. Alternatively, tissue
samples may be taken from areas adjacent to the tissue from which
a tumor was previously removed. This approach can be particularly
useful when other testing produces ambiguous results.
Articles of this invention include representations of the gene
expression profiles useful for treating, diagnosing, prognosticating,
and otherwise assessing diseases. These profile representations
are reduced to a medium that can be automatically read by a machine
such as computer readable media (magnetic, optical, and the like).
The articles can also include instructions for assessing the gene
expression profiles in such media. For example, the articles may
comprise a CD ROM having computer instructions for comparing gene
expression profiles of the portfolios of genes described above.
The articles may also have gene expression profiles digitally recorded
therein so that they may be compared with gene expression data from
patient samples. Alternatively, the profiles can be recorded in
different representational format. A graphical recordation is one
such format. Clustering algorithms such as those incorporated in
"DISCOVERY" and "INFER" software from Partek,
Inc. mentioned above can best assist in the visualization of such
data.
Different types of articles of manufacture according to the invention
are media or formatted assays used to reveal gene expression profiles.
These can comprise, for example, microarrays in which sequence complements
or probes are affixed to a matrix to which the sequences indicative
of the genes of interest combine creating a readable determinant
of their presence. Alternatively, articles according to the invention
can be fashioned into reagent kits for conducting hybridization,
amplification, and signal generation indicative of the level of
expression of the genes of interest for detecting breast cancer.
Kits made according to the invention include formatted assays for
determining the gene expression profiles. These can include all
or some of the materials needed to conduct the assays such as reagents
and instructions.
The invention is further illustrated by the following non-limiting
examples.
EXAMPLES
Genes analyzed according to this invention are typically related
to full-length nucleic acid sequences that code for the production
of a protein or peptide. One skilled in the art will recognize that
identification of full-length sequences is not necessary from an
analytical point of view. That is, portions of the sequences or
ESTs can be selected according to well-known principles for which
probes can be designed to assess gene expression for the corresponding
gene.
Example 1
Sample Handling and LCM
Fresh frozen tissue samples were collected from patients who had
surgery for breast tumors. The samples that were used were from
149 Stage I and II patients (staged according to standard clinical
diagnostics and pathology). Clinical outcome of the patients was
known. Seventy four of the patients have remained disease-free for
more than seven years while seventy five patients had distant metastases
within four years. One hundred and three patients were lymph node
negative while forty six were lymph node positive.
The tissues were snap frozen in liquid nitrogen within 20-30 minutes
of harvesting, and stored at -80.degree. C. thereafter. For laser
capture, the samples were cut (6 .mu.m), and one section was mounted
on a glass slide, and the second on film (P.A.L.M.), which had been
fixed onto a glass slide (Micro Slides Colorfrost, VWR Scientific,
Media, PA). The section mounted on a glass slide was after fixed
in cold acetone, and stained with Mayer's Haematoxylin (Sigma, St.
Louis, Mo.). A pathologist analyzed the samples for diagnosis and
grade. The clinical stage was estimated from the accompanying surgical
pathology and clinical reports to verify the staging of the tumor.
The section mounted on film was after fixed for five minutes in
100% ethanol, counter stained for 1 minute in eosin/100% ethanol
(100 .mu.g of Eosin in 100 ml of dehydrated ethanol), quickly soaked
once in 100% ethanol to remove the free stain, and air dried for
10 minutes.
Before use in LCM, the membrane (LPC-MEMBRANE PEN FOIL 1.35 .mu.m
No 8100, P.A.L.M. GmbH Mikrolaser Technologie, Bernried, Germany)
and slides were pretreated to abolish RNases, and to enhance the
attachment of the tissue sample onto the film. Briefly, the slides
were washed in DEP H.sub.2O, and the film was washed in RNase AWAY
(Molecular Bioproducts, Inc., San Diego, Calif.) and rinsed in DEP
H.sub.2O. After attaching the film onto the glass slides, the slides
were baked at +120.degree. C. for 8 hours, treated with TI-SAD (Diagnostic
Products Corporation, Los Angeles, Calif., 1:50 in DEP H.sub.2O,
filtered through cotton wool), and incubated at +37.degree. C. for
30 minutes. Immediately before use, a 10 .mu.l aliquot of RNase
inhibitor solution (Rnasin Inhibitor 2500 U=33 U/.mu.l N211A, Promega
GmbH, Mannheim, Germany, 0.5 .mu.l in 400 .mu.l of freezing solution,
containing 0.15 mol NaCl, 10 mmol Tris pH 8.0, 0.25 mmol dithiothreitol)
was spread onto the film, where the tissue sample was to be mounted.
The tissue sections mounted on film were used for LCM. Approximately
2000 epithelial cells/sample were captured using the PALM Robot-Microbeam
technology (P.A.L.M. Mikrolaser Technologie, Carl Zeiss, Inc., Thornwood,
N.Y.), coupled into Zeiss Axiovert 135 microscope (Carl Zeiss Jena
GmbH, Jena, Germany). The surrounding stroma in the normal mucosa,
and the occasional intervening stromal components in cancer samples,
were included. The captured cells were put in tubes in 100% ethanol
and preserved at -80.degree. C.
Example 2
RNA Extraction and Amplification
Zymo-Spin Column (Zymo Research, Orange, Calif. 92867) was used
to extract total RNA from the LCM captured samples. About 2 ng of
total RNA was resuspended in 10 ul of water and 2 rounds of the
T7 RNA polymerase based amplification were performed to yield about
50 ug of amplified RNA.
Example 3
cDNA Microarray Hybridization and Quantitation
A set of cDNA microarrays consisting of approximately 23,000 human
cDNA clones was used to test the samples by use of the humanU133a
chip obtained and commercially available from Affymetrix, Inc. Total
RNA obtained and prepared as outlined above and applied to the chips
and analyzed by Agilent BioAnalyzer according to the manufacturer's
protocol. All 149 samples passed the quality control standards and
the data were used for marker selection.
Marker selection was performed by analyzing the 103 lymph node
negative patients. Genes that allow the discrimination of distant
metastases and survivors were identified using Cox proportional
hazard model. Chip intensity data was analyzed using MAS Version
5.0 software commercially available from Affymetrix, Inc. ("MAS
5.0"). An unsupervised analysis was first conducted followed
by a supervised analysis.
The chip intensity data obtained as described was the input for
the unsupervised clustering software commercially available as PARTEK
version 5.1 software. This unsupervised clustering algorithm identified
a group of 22 patients with a significant low expression of many
genes including estrogen receptor. ER/PR are known prognostic factors
for poor outcome in breast cancer so this group of 22 patients were
excluded from subsequent analysis to identify additional factors
(gene markers) with independent value as prognostic indicators.
The remaining 81 patients were further filtered to remove potential
effects of the well-characterized prognostic indicators of age and
tumor size. Twenty-seven patients older than 55 years or having
tumors larger than 5 cm were thus excluded too.
A Cox proportional hazard model was used for gene selection. In
each cycle of the total 31 cycles, each of the 31 patients in the
training set was held out, the remaining 26 patients were used in
the univariate Cox model regression to assess the strength of association
of gene expression with the patient survival time. The strength
of such association was evaluated by the corresponding estimated
standardized parameter estimate and P value returned from the Cox
model regression. P value of 0.01 was used as the threshold to select
top genes from each cycle of the leave-one-out gene selection. The
top genes selected from each cycle were then compared in order to
select those genes that showed up at least 28 times in the total
of 31 leave-one-out gene selection cycles. A total of 56 genes were
TABLE-US-00001 Modulation (Standardized Gene Coefficient) P. value
Sequence I.D. No. 216516_at -2.7708 0.0056 Sequence I.D. No. 32
211646_at -2.7853 0.0053 Sequence I.D. No. 33 219463_at -2.7860
0.0053 Sequence I.D. No. 34 204532_x_at -2.7921 0.0052 Sequence
I.D. No. 35 210365_at -2.7931 0.0052 Sequence I.D. No. 36 222098_s_at
-2.8121 0.0049 Sequence I.D. No. 37 212800_at -2.8267 0.0047 Sequence
I.D. No. 38 205582_s_at -2.8350 0.0046 Sequence I.D. No. 39: 219096_at
-2.8393 0.0045 Sequence I.D. No. 40 216944_s_at -2.8667 0.0041 Sequence
I.D. No. 41 208923_at -2.8766 0.0040 Sequence I.D. No. 42 209309_at
-2.9149 0.0036 Sequence I.D. No. 43 207981_s_at -2.9294 0.0034 Sequence
I.D. No. 44 210160_at -2.9448 0.0032 Sequence I.D. No. 45 206862_at
-2.9676 0.0030 Sequence I.D. No. 46 213110_s_at -2.9857 0.0028 Sequence
I.D. No. 47 201906_s_at -3.0124 0.0026 Sequence I.D. No. 48 201057_s_at
-3.0133 0.0026 Sequence I.D. No. 49 220798_x_at -3.0270 0.0025 Sequence
I.D. No. 50 218650_at -3.0513 0.0023 Sequence I.D. No. 51 220986_s_at
-3.2095 0.0013 Sequence I.D. No. 52 214451_at -3.4431 0.0006 Sequence
I.D. No. 53 203844_at -3.4965 0.0005 Sequence I.D. No. 54 202966_at
-3.5864 0.0003 Sequence I.D. No. 55
Construction of a multiple-gene predictor: The prediction index
is defined as the sum of the product of the 56 genes' expression
values (log 10 based) and their corresponding cox model parameter
estimates. The parameter estimate from the cox models measures the
hazard ratio of the patient when the gene expression value increases.
Therefore, patients with high scores using the index have poor survival
outcomes. This prediction index was applied to the training set
to obtain an estimate of the prediction accuracy (FIG. 1).
Cross-validation and evaluation of predictor: Performance of the
predictor should be determined on an independent data set because
most classification selected. Gene expression for those genes having
Seq. ID No 1 to 26 and Seq ID No. 56 were up-regulated at least
two fold and genes having Seq. ID No 27 to 55 were down regulated
at least two fold.
TABLE-US-00002 TABLE 1 Breast Cancer Prognostic Gene Markers. Modulation
(Standardized Gene Coefficient) P. value Sequence I.D. No. 202984_s_at
3.8521 0.0001 Sequence I.D. No.: 1 208777_s_at 3.4922 0.0005 Sequence
I.D. No. 2. 222133_s_at 3.1841 0.0015 Sequence I.D. No. 3 218185_s_at
3.1379 0.0017 Sequence I.D. No. 4 219571_s_at 3.1131 0.0019 Sequence
I.D. No. 5 201138_s_at 3.1075 0.0019 Sequence I.D. No. 6 209155_s_at
3.1018 0.0019 Sequence I.D. No. 7 212468_at 0.0019 0.0022 Sequence
I.D. No. 8 217593_at 0.0019 0.0022 Sequence I.D. No. 9 212973_at
3.0325 0.0024 Sequence I.D. No. 10 202971_s_at 2.9994 0.0027 Sequence
I.D. No. 11 204444_at 2.9926 0.0028 Sequence I.D. No. 12 205169_at
2.9911 0.0028 Sequence I.D. No. 13 219751_at 2.9707 0.0030 Sequence
I.D. No. 14 217988_at 2.9649 0.0030 Sequence I.D. No. 15 212942_s_at
2.9460 0.0032 Sequence I.D. No. 16 208993_s_at 2.9423 0.0033 Sequence
I.D. No. 17 219105_x_at 2.9324 0.0034 Sequence I.D. No. 18 220085_at
2.9001 0.0037 Sequence I.D. No. 19 206640_x_at 2.8799 0.0040 Sequence
I.D. No. 20 205062_x_at 2.8663 0.0042 Sequence I.D. No. 21 209385_s_at
2.8115 0.0049 Sequence I.D. No. 22 AFFX-M27830_5_at 2.7868 0.0053
Sequence I.D. No. 56 215170_s_at 2.7814 0.0054 Sequence I.D. No.
23 207663_x_at 2.7634 0.0057 Sequence I.D. No. 24 212229_s_at 2.7422
0.0061 Sequence I.D. No. 25 215206_at 2.7317 0.0063 Sequence I.D.
No. 26 206241_at -2.7281 0.0064 Sequence I.D. No. 27 219813_at -2.7406
0.0061 Sequence I.D. No. 28 210969_at -2.7522 0.0059 Sequence I.D.
No. 29 207865_s_at -2.7691 0.0056 Sequence I.D. No. 30 202520_s_at
-2.7702 0.0056 Sequence I.D. No. 31
methods work well on the examples that were used in their establishment.
The 23 patients testing set was used to assess prediction accuracy.
The cutoff for the classification is determined using the ROC curve
with 90% sensitivity. With the selected cutoff, the numbers of correct
prediction for relapse and survival patients in the test set are
summarized in (Table 1). The Kaplan-Meier curve was constructed
on the predicted relapsers and survivors (FIG. 2).
Overall prediction: Gene expression profiling of 54 Stage I and
II breast cancer patients led to identification of 56 genes that
have differential expression in these patients. Thirty-six of the
patients have remained disease-free for more than 7 years while
27 patients had distant metasteses within 4 years. Using the 56-gene
predictor, 22 of the 27 relapse patients and 27 of 36 disease-free
patients were identified correctly. This result represents a sensitivity
of 82% and a specificity of 75%. The positive predictive value is
71% and the negative predictive value is 84% (Table 2) The Kaplan-Meier
curve was constructed on the predicted distant metastases and survivors
(FIG. 3).
An independent study was previously published (Van 't Veer et al.,
Nature 415, 530-535, Vijver et al., NEJM347, 1999-2009) in which
a 70-gene predictor was constructed to predict patient outcomes
in Stage I and II lymph node negative breast cancer. Only one gene
overlaps between the 70-gene of the Van't Veer et al. study and
the 56-gene predictor of this specification.
TABLE-US-00003 TABLE 2 Prediction accuracy based on testing set
using 56-gene predictor. Study Number of Prediction Sample Correct
Relapse 11 10 Survivor 12 6 Sensitivity 91% Specificity 50%
TABLE-US-00004 TABLE 3 Prediction accuracy based on all patients
using 56-gene predictor. Study Number of Prediction Sample Correct
Relapse 25 23 Survivor 29 23 Sensitivity 92% Specificity 79%
Example 4
Further Portfolios
The 56 gene portfolio was subjected to different treatments to
fashion further portfolios that provide clinically significant benefits
with fewer numbers of gene expression signatures for analysis.
a. In a first treatment, correlation coefficients among the 56
genes were calculated by Spearman rank correlation and Pearson's
correlation. Using 0.7 as the correlation cutoff, a portfolio of
45 modulated genes was established. The genes are shown in Table
4. b. In a second treatment, the 56 genes were tested with t-tests
using either the training or testing dataset. The genes that displayed
significant p values (<0.05) in both training and testing data
were selected as a portfolio. A portfolio of 26 modulated genes
was thus established. The genes are shown in Table 5. c. The 26
gene portfolio of (b) and Table 5 were then evaluated based on the
known biological functions of the genes in the portfolio. Those
having a biological relationship to a metastatic pathway were selected.
A portfolio of 13 modulated genes was thus established. The genes
are shown in Table 6. d. A two gene pair exhibiting the best classification
performance was selected from the 56 gene portfolio. In serial,
one additional gene was added to the portfolio and tested to determine
whether the addition of that signature improved the overall classification
accuracy in both training set and testing set of the two gene combination.
This procedure was repeated until no further improvement was achieved.
A portfolio of 6 modulated genes was established. The genes are
shown in Table 7. The sensitivity and specificity for each of the
portfolios shown in Tables 4-7 was determined based on predicted
versus known outcomes for the samples described above. These values
are shown in Tables 8-11.
TABLE-US-00005 TABLE 4 45 Gene Set Modulation (Standardized Gene
Coefficient) Sequence I.D. No. 220986_s_at -3.2095 Seq. I.D. No.
52 220798_x_at -3.0270 Seq. I.D. No. 50 220085_at 2.9001 Seq. I.D.
No. 19 219751_at 2.9707 Seq. I.D. No. 14 219105_x_at 2.9324 Seq.
I.D. No. 18 218650_at -3.0513 Seq. I.D. No. 51 214451_at -3.4431
Seq. I.D. No. 53 212973_at 3.0325 Seq. I.D. No. 10 208993_s_at 2.9423
Seq. I.D. No. 17 205582_s_at -2.8350 Seq. I.D. No. 39 205169_at
2.9911 Seq. I.D. No. 13 203844_at -3.4965 Seq. I.D. No. 54 202984_s_at
3.8521 Seq. I.D. No. 1 202966_at -3.5864 Seq. I.D. No. 55 201057_s_at
-3.0133 Seq. I.D. No. 49 222133_s_at 3.1841 Seq. I.D. No. 3 219096_at
-2.8393 Seq. I.D. No. 40 218185_s_at 3.1379 Seq. I.D. No. 4 212942_s_at
2.9460 Seq. I.D. No. 16 210160_at -2.9448 Seq. I.D. No. 45 209155_s_at
3.1018 Seq. I.D. No. 7 204444_at 2.9926 Seq. I.D. No. 12 202971_s_at
2.9994 Seq. I.D. No. 11 201138_s_at 3.1075 Seq. I.D. No. 6 222098_s_at
-2.8121 Seq. I.D. No. 37 219813_at -2.7406 Seq. I.D. No. 28 216944_s_at
-2.8667 Seq. I.D. No. 41 215206_at 2.7317 Seq. I.D. No. 26 212800_at
-2.8267 Seq. I.D. No. 38 212229_s_at 2.7422 Seq. I.D. No. 25 211646_at
-2.7853 Seq. I.D. No. 33 210365_at -2.7931 Seq. I.D. No. 36 209385_s_at
2.8115 Seq. I.D. No. 22 209309_at -2.9149 Seq. I.D. No. 43 208923_at
-2.8766 Seq. I.D. No. 42 207663_x_at 2.7634 Seq. I.D. No. 24 205062_x_at
2.8663 Seq. I.D. No. 21 202520_s_at -2.7702 Seq. I.D. No. 31 AFFX-M27830_5_at
2.7868 Seq. I.D. No. 56 216516_at -2.7708 Seq. I.D. No. 32 215170_s_at
2.7814 Seq. I.D. No. 23 210969_at -2.7522 Seq. I.D. No. 29 207981_s_at
-2.9294 Seq. I.D. No. 44 206241_at -2.7281 Seq. I.D. No. 27 204532_x_at
-2.7921 Seq. I.D. No. 35
TABLE-US-00006 TABLE 5 26 Gene Set Modulation (Standardized Gene
Coefficient) Sequence I.D. No. 205169_at 2.9911 Seq. I.D. No. 13
203844_at -3.4965 Seq. I.D. No. 54 205062_x_at 2.8663 Seq. I.D.
No. 21 202971_s_at 2.9994 Seq. I.D. No. 11 201906_s_at -3.0124 Seq.
I.D. No. 48 212942_s_at 2.9460 Seq. I.D. No. 16 206862_at -2.9676
Seq. I.D. No. 46 202966_at -3.5864 Seq. I.D. No. 55 201057_s_at
-3.0133 Seq. I.D. No. 49 219105_x_at 2.9324 Seq. I.D. No. 18 217593_at
3.0584 Seq. I.D. No. 9 202520_s_at -2.7702 Seq. I.D. No. 31 210365_at
-2.7931 Seq. I.D. No. 36 215206_at 2.7317 Seq. I.D. No. 26 212229_s_at
2.7422 Seq. I.D. No. 25 211646_at -2.7853 Seq. I.D. No. 33 219813_at
-2.7406 Seq. I.D. No. 28 216944_s_at -2.8667 Seq. I.D. No. 41 219096_at
-2.8393 Seq. I.D. No. 40 218185_s_at 3.1379 Seq. I.D. No. 4 213110_s_at
-2.9857 Seq. I.D. No. 47 212468_at 3.0663 Seq. I.D. No. 8 208993_s_at
2.9423 Seq. I.D. No. 17 208777_s_at 3.4922 Seq. I.D. No. 2 220085_at
2.9001 Seq. I.D. No. 19 219751_at 2.9707 Seq. I.D. No. 14
TABLE-US-00007 TABLE 6 13 Gene Set Modulation (Standardized Gene
Coefficient) Sequence I.D. No. 202971_s_at 2.9994 Seq. I.D. No.
11 201906_s_at -3.0124 Seq. I.D. No. 48 206862_at -2.9676 Seq. I.D.
No. 46 202966_at -3.5864 Seq. I.D. No. 55 219105_x_at 2.9324 Seq.
I.D. No. 18 210365_at -2.7931 Seq. I.D. No. 36 212229_s_at 2.7422
Seq. I.D. No. 25 219813_at -2.7406 Seq. I.D. No. 28 219096_at -2.8393
Seq. I.D. No. 40 218185_s_at 3.1379 Seq. I.D. No. 4 213110_s_at
-2.9857 Seq. I.D. No. 47 208777_s_at 3.4922 Seq. I.D. No. 2 220085_at
2.9001 Seq. I.D. No. 19
TABLE-US-00008 TABLE 7 6 Gene Set Modulation (Standardized Gene
Coefficient) Sequence I.D. No. 205169_at 2.9911 Seq. I.D. No. 13
202966_at -3.5864 Seq. I.D. No. 55 206862_at -2.9676 Seq. I.D. No.
46 219105_x_at 2.9324 Seq. I.D. No. 18 205062_x_at 2.8663 Seq. I.D.
No. 21 201138_s_at 3.1075 Seq. I.D. No. 6
TABLE-US-00009 TABLE 8 45-gene Prognostic Portfolio Study Number
of Prediction Sample Correct Relapse 11 10 Survivor 12 6 Sensitivity
91% Specificity 50%
TABLE-US-00010 TABLE 9 26-gene Prognostic Portfolio Study Number
of Prediction Sample Correct Relapse 11 10 Survivor 12 10 Sensitivity
91% Specificity 83%
TABLE-US-00011 TABLE 10 13-gene Prognostic Portfolio Study Number
of Prediction Sample Correct Relapse 11 10 Survivor 12 8 Sensitivity
91% Specificity 67%
TABLE-US-00012 TABLE 11 6-gene Prognostic Portfolio Study Number
of Prediction Sample Correct Relapse 11 10 Survivor 12 8 Sensitivity
91% Specificity 67%
|