Skip to main content

Table 1 A data mining primer: basic steps used for analysing microarray data

From: Assessing the human immune system through blood transcriptomics

Here we provide basic analysis steps and important considerations for microarray data analysis:

   - Per-chip normalization: This step controls for array-wide variations in intensity across multiple samples that form a given dataset. Arrays, as with all fluorescence based assays, are subject to signal variation for a variety of reasons, including the efficiency of the labeling and hybridization reactions and possibly other, less well defined variables, such as reagent quality and sample handling. To control for this, samples are normalized by first subtracting background and then employing a normalization algorithm to rescale the difference in overall intensity to a fixed intensity level for all samples across multiple arrays.

   - Data filtering: Typically more than half of the oligonucleotide probes present on a microarray do not detect a signal for any of the samples in a given analysis. Thus, a detection filter is applied to exclude these transcripts from the original dataset. This step avoids the introduction of unnecessary noise in downstream analyses.

   - Unsupervised analysis: The aim of this analysis is to group samples on the basis of their molecular profiles without a priori knowledge of their phenotypic classification. The first step, which functions as a second detection filter, consists of selecting transcripts that are expressed in the dataset and display some degree of variability, which will facilitate sample clustering. For instance, this filter could select transcripts with expression levels that deviate by at least two-fold from the median intensity calculated across all samples. Importantly, this additional filter is applied independently of any knowledge of sample grouping or phenotype, which makes this type of analysis 'unsupervised'. Next, pattern discovery algorithms are often applied to identify 'molecular phenotypes' or trends in the data.

   - Clustering: Clustering is commonly used for the discovery of expression patterns in large datasets. Hierarchical clustering is an iterative agglomerative clustering method that can be used to produce gene trees and condition trees. Condition tree clustering groups samples based on the similarity of their expression profiles across a specified gene list. Other commonly employed clustering algorithms include k-means clustering and self-organizing maps.

   - Class comparison: Such analyses identify genes that are differentially expressed among study groups ('classes') and/or time points. The methods for analysis are chosen based on the study design. For studies with independent observations and two or more groups, t-tests, ANOVA, Mann-Whitney U tests, or Kruskal-Wallis tests are used. Linear mixed model analyses are chosen for longitudinal studies.

   - Multiple testing correction: Multiple testing correction (MTC) methods provide a means to mitigate the level of noise in sets of transcripts identified by class comparison (in order to lower permissiveness of false positives). While it reduces noise, MTC promotes a higher false negative rate as a result of dampening the signal. The methods available are characterized by varying degrees of stringency, and therefore they produce gene lists with different levels of robustness.

• Bonferroni correction is the most stringent method used to control the familywise error rate (probability of making one or more type I errors) and can drastically reduce false positive rates. Conversely, it increases the probability of having false negatives.

• Benjamini and Hochberg false discovery rate [125] is a less stringent MTC method and provides a good balance between discovery of statistically significant genes while limiting false positives. By using this procedure with a value of 0.01, 1% of the statistically significant transcripts might be identified as significant by chance alone (false positives).

   - Class prediction: Class prediction analyses assess the ability of gene expression data to correctly classify a study subject or sample. K-nearest neighbors is a commonly used technique for this task. Other available class prediction procedures include, but are not limited to, discriminant analysis, general linear model selection, logistic regression, distance scoring, partial least squares, partition trees, and radial basis machine.

   - Sample size: The number of samples necessary for the identification of a robust signature is variable. Indeed, sample size requirements will depend on the amplitude of the difference between, and the variability within, study groups.

A number of approaches have been devised for the calculation of sample size for microarray experiments, but to date little consensus exists [126–129]. Hence, best practices in the field consist of the utilization of independent sets of samples for the purpose of validating candidate signatures. Thus, the robustness of the signature identified will rely on a statistically significant association between the predicted and true phenotypic class in the first and the second test sets.