Microarray is a powerful tool for genome analysis. It gives the global view of the genome analysis in a single experiment. Data analysis in the Microarray is a vital part as this part influences the final result. Each microarray study comprises multiple microarray experiments, each microarray study would give tens of thousands of data points. Since the volume of data growing exponential, the analysis becomes a challenging task. In general the greater the volume of data, the more chances arise for erroneous results. Handling such large volumes of data requires high end computational infrastructures and programs that can handle multiple data formats. There are already programs available for microarray data analysis on various platforms. But due to rapid development, diversity in microarray technology, and different data formats, there is always the need for comprehensive and complete microarray data analysis
Data analysis is the critical part of the whole analysis, since any error introduced in the data analysis part will result in biologically insignificant results. In data analysis, the information from the raw data file is further processed to yield meaningful biological results.
This part includes data normalization, Flagging of the data, Averaging the ratio for replicates, Clustering of similarly expressed genes, etc. Each replicate data has to undergo normalization before further analysis. Normalization removes the non-biological variation between the samples. After the normalization, the ratio is calculated for each gene in the replicate. Based on the ratio, differentially regulated genes are determined. There are various statistical analyses which are also done for confidence analysis. Each replicate data is also examined for various experimental artifacts, bias by computing parameters related to intensity, background, flags, spot details, etc.
It is important to note the necessity in conducting Microarray experiments in replicates. Like any other quantitative measurements, repeated experiments provide the ability to conduct confidence analysis and identify differentially expressed genes at a given level of confidence. More replicates provide more confidence in determining differentially expressed genes. In practice, three to five replicates would be an ideal.
Normalization is required to standardize data and focus on biologically relevant changes. There are many sources of systematic variation in Microarray experiments that affect the measured gene expression levels such as Dye bias, Heat and light sensitivity, Efficiency of dye incorporation, Difference in the labeled cDNA Hybridization conditions, Scanning conditions, and Unequal quantities of starting RNA etc. Normalization is important step to Adjust data set for technical variation and removing relative abundance of gene expression profiles, this is only point where 1 and 2 color data analysis vary. The normalization method depends on the data. The basic idea behind all the normalization methods is that the expected mean intensity ratio between the two channels is one. If the observed mean intensity ratio deviates from one, the data is mathematically processed in such a way that the final observed mean intensity ratio becomes one. When the mean intensity ratio is adjusted to one, the distribution of the gene expression is centered so that genuine differentials can be identified
Before doing analysis the biological variation must perform QC steps to determine if the data is fit for statistical test. Statistical tests are very sensitive to the nature of the input data.
Filtering of flag
Filtering on bad intensity spot is an important process of quality control For example; there is a certain limit of the scanner below which the intensity values cannot be trusted anymore. Typically, the lowest intensity value of the reliable data is about 100–200 for Affymetrix data and 100–1000 for cDNA Microarray data. These cut-offs are likely to change, as the scanners get more precise. The values below the cut-off point are usually removed (filtered) from the data, because they are likely to be artifacts.
Filtering of noise replicate
Filtering the noise replicate is one of the crucial parts in quality control. Experimental replicate should behave in similarly pattern. The replicates with noise should be eliminated before analysis .the noise replicate can be removed ANOVA statistical method
Filtering of non-significant gene
Filtering of non significant is done to reduce the number of genes so that analysis could be done on selected genes. Nonsignificant genes were removed by specifying relative fold changewith respect to normal control. For over expressed and underexpressed genes values were given 2 and −2. As a result of the filtration few genes where retained. the remaining gene are then subjected to statistical analysis.
Statistical analysis plays a vital role in identifying the gene which is statistically significant expressed.
Clustering is a data mining technique used to group the genes, which have similar expression patterns. Hierarchical clustering, k-mean clustering are widely used technique in microarray analysis.
Hierarchical clustering is a statistical method for finding relatively homogeneous Clusters. Hierarchical clustering consists of two separate phases. Initially, a distance matrix
containing all the pair wise distances between the genes is calculated. Pearson’s correlation or Spearman’s correlation are often used as dissimilarity estimates, but other methods, like Manhattan distance or Euclidian distance can also be applied. If the genes on a single chip need to be clustered, the Euclidian distance is the correct choice, since at least two chips are needed for calculation of any correlation measures.After calculation of the initial distance matrix, the hierarchical clustering algorithm Either iteratively joins the two closest clusters starting from single clusters (Agglomerative, bottom-up approach) or iteratively partitions clusters starting from the complete set (divisive, top-down approach). After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. If there are N cases, Hierarchical cluster analysis including:
• Single linkage (minimum method, nearest neighbor)
• Complete linkage (maximum method, furthest neighbor)
• Average Linkage (UPGMA).
is an algorithm to classify or to group genes based on pattern into K
number of group. K
is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data based on similar expression. (www.biostat.ucsf.edu).
Gene ontology study
Ontology study gives the biologically meaning full information like cellular location, molecular function and biological function about the gene which are differentially regulated in disease or drug treatment condition with respect to normal contol.
Pathway analysis gives the specific information about the pathway being affected in disease condition with reference to normal control. This pathway analysis also allows to identify the gene network and the genes how it regulated.
T.Hema Thanka Christlet,S.S.J.Shiek Fareeth Ahmed,A.Ahameethunisa,Janani Kannan. Dept of Biotechnology,SRM University