Wednesday 16 February 2011

Advanced data analysis using Visualization

A common problem affecting many scientists, especially those working in the area of molecular biology, is the vast amount of data that is created by their experiments. With such a large volume of data to consider, it is often impossible to derive any real biological meaning from their findings with the naked eye alone, which means that sophisticated data algorithms need to be developed in order for researchers to interpret their data effectively.

Until now, computer software designed for this purpose has focused on being able to handle increasingly vast amounts of data. As a result, the role of the scientist/researcher has partly been set aside, and a lot of data analysis is now performed by specialist bioinformaticians and biostatisticians. In most cases, however, this model has several drawbacks, since it is typically the scientist who knows the most about the specific area being studied.

Even though the exploration and analysis of large data sets can be challenging, the active use of Visualization techniques can provide a powerful way of identifying important structures and patterns very quickly. Visualization provides the user with instant feedback, and with results that present themselves as they are being generated.

Qlucore recommends a five-step method to ensure repeatable and significant results when using Visualization. By applying this five-step method, it is possible to investigate large and complex data sets without being a statistics expert. The method is described below in more detail, but some basics need to be in place at the start.

First of all, the high dimension data needs to be reduced to lower dimensions so that it can be plotted in 3D. Qlucore recommends the use of Principal Component Analysis (PCA) for this purpose. Tools to colour data to enhance the information are also required, as well as filters and tools to select and deselect parts of the data set.

At this stage, researchers can begin the five-step Visualization process by detecting and removing the strongest signal present in the active dataset. Once this signal is identified, it can be removed in order to see whether there are any other obscured (but still detectable) signals present. Removing a strong signal will usually result in the reduction of both the number of active samples and/or variables.

Step two of the Visualization process is to assess the signal-to-noise ratio in the data by using PCA and randomization. The strength of a visually detected signal or pattern is measured by examining the amount of variance captured in the 3D PCA-plot. This captured variance is compared with what the researcher would expect to capture if the real variables were all replaced by random variables, and will therefore give a clear indication of how reliable the identified pattern is.

Step three is to remove any 'noise' by variance filtering. If researchers can see a significant signal-to-noise ratio in their active dataset, they should try to remove some of the active variables that are most likely contributing to the noise.

Step four offers the option of performing statistical tests that can be applied to any/all of the other stages of the five-step process: either during the initial analysis, when a step is repeated, at the end of a step, or not at all.

The final step uses graphs to refine the search for subgroups or clusters. Connecting samples in networks or graphs, for example, makes it possible to move into higher dimensions (ie more than three), since the graph created in a sample plot is based on the distances in the space of all active variables, and can therefore provide more insight into the structure of the data.

These five steps are then repeated until there are no more structures to be found.

When used in this way, Visualization can be used as a powerful tool for researchers, since the human brain is very good at detecting structures and patterns. As such, if data can be visualized in a clear way, scientists can identify any interesting and/or significant results easily, by themselves, without having to rely on specialist bioinformaticians and biostatisticians.

Qlucore started as a collaborative research project at Lund University, Sweden, supported by researchers at the Departments of Mathematics and Clinical Genetics, in order to address the vast amount of high-dimensional data generated with microarray gene expression analysis. As a result, it was recognised that an interactive scientific software tool was needed to conceptualise the ideas evolving from the research collaboration.

The basic concept behind the software is to provide a tool that can take full advantage of the most powerful pattern recogniser that exists - the human brain. The result is a core software engine that visualises the data in 3D and will aid the user in identifying hidden structures and patterns. Over the last two years the major efforts have been to optimise the early ideas and to develop a core software engine that is extremely fast, allowing the user to interactively and in real time instantly explore and analyse high-dimensional data sets with the use of a normal PC.

Qlucore was founded in early 2007 and the first product released was the Qlucore Gene Expression Explorer 1.0. The latest version of this software, Version1.1, represents a major step forward with the advanced statistics support. All user action is at most two mouse clicks away. The company's early customers are mainly from the Life-science and Biotech industries, but solutions for other industries are currently under development.

One of the key methods used by Qlucore Gene Expression Explorer to visualise data is dynamic principal component analysis (PCA), an innovative way of combining PCA analysis with immediate user interaction. Dynamic PCA is PCA analysis combined with instant user response, a combination which provides an optimal way for users to visualise and analyse a large dataset by presenting a comprehensive view of the data set at the same time, since the user is given full freedom to explore all possible versions of the presented view.

PCA analysis works by projecting high dimensional data down to lower dimensions. The specific projections of the high-dimensional data are chosen in order to maintain as much variance as possible in the projected data set. With Qlucore Gene Expression Explorer, data is projected and plotted on the two dimensional computer screen and then rotated manually or automatically and examined by the naked eye.

Qlucore