Data mining predicts chemical-gene-cancer associations.

Patel, CJ and AJ Butte. 2010. Predicting environmental chemical factors associated with disease-related gene expression data. BMC Medical Genomics http:dx.doi.org/10.1186/1755-8794-3-17.

Synopsis by Thea Edwards

A mix of data gathered from two large databases is one of the next steps in understanding how the environment interacts with genes to influence disease, according to two Stanford scientists who are trying to untangle the interrelated effects. The pair analyzed information that was collected through new analytical methods – such as gene arrays – to better understand and predict environment-gene-disease patterns.

This so-called data-mining approach is a useful and cost-effective way to identify interactions among hundreds of chemicals and thousands of genetic measurements associated with a disease. The associations can then be targeted for more efficient and specific experimental tests or epidemiological studies.

Many diseases – including cancer – result from interactions between a person’s genes and the environment. Environmental factors – such as contaminants, temperature, food and others – can alter the way some genes function. That is, if and when they turn on or off and the kind and quantity of proteins they make. The changes to gene function can influence a cell’s chemical signals and lead to disease.

But, laboratory and human studies designed to understand the connections are time consuming and expensive. An alternative is to tap into the vast amount of stored genetic information that has been collected through faster and cheaper laboratory methods – such as gene arrays – and stored in large computer databases.

In this study, the researchers relied on two public databases. One tracks which chemicals influence which genes – known as a chemical/gene signature. It also has information about the next step: which genes can influence disease.  The other database has information on what proteins or products the genes make that may be associated with disease – called gene expression patterns. The researchers integrated information from the two databases, relating the chemical-gene signatures for 1,338 chemicals to the changes in gene expression that are associated with certain diseases.

The researchers specifically report on the environmental chemicals related to prostate, lung and breast cancers. They chose these three common cancers because much is already known about which chemicals and genes interact to influence them. By identifying these known interactions, they verified that the computer methods they developed work to predict environmental factors associated with disease.

They found that breast and prostate cancers were associated with estrogenic chemicals, including estradiol (the main form of estrogen in humans), genistein (a plant phytoestrogen found in soy) and bisphenol A (a synthetic estrogen used to make polycarbonate plastics).

Lung cancer was associated with exposure to sodium arsenite (an arsenic-containing mutagen), vanadium pentoxide (used to manufacture polyester, PVC plastics and newer vanadium-based batteries) and dimethylnitrosamine (found in tobacco smoke and a carcinogenic byproduct created during chlorination of wastewater).

These findings are consistent with other experimental and epidemiological studies. The results indicate that data-mining is a valid and cost-effective way to direct future experimental or epidemiological research that will investigate the specifics of how environmental factors affect disease.

The authors note that their approach shows association – that one is related to the other – and does not predict the direction of the association. Therefore, they cannot tell from their findings if the chemicals cause or prevent the disease.

Originally published in Environmental Health News, Sep 09, 2010