A new method for the detection of gene-environment interactions in cancer studies

The grant is funded by la Ligue Nationale Contre le Cancer

Principal curve based on the three principal components.

In recent years, the detection of heterogeneity in epidemiology has attracted increasing interest. One of the main reasons comes from the fact that taking into account heterogeneity in data analysis makes it possible to gain statistical power. In addition, the detection of gene-environment interactions is of major interest in epidemiology because it makes it possible to identify subgroups with high risks in the population. Although several methods have already been proposed for this problem the detection of gene-environment effects remains difficult, in particular because the causal effect is generally not directly observed (as for some treatments for example) and only proxi variables (such as BMI for example) are accessible.

The aim for this project is to develop a new statistical method for detecting gene-environment effects associated with the occurrence of cancer. The proposed approach will detect groups of individuals characterized by their environmental factors with different cancer risks.

The method works as follows: each patient is positioned in a proximity space using multiple covariates (personal, clinical, environmental, etc.). The approach then seeks to exploit the fact that two neighboring patients in this proximity space are more likely to be exposed to latent (and therefore not necessarily observed) common factors. Then a Principal Component Analysis (PCA) is applied to the proximity space and a smoothing curve, called "principal curve", is applied to it. This makes it possible to project each individual on this curve and to obtain an order on the individuals. Thus, close individuals on this curve will share similar exposure profiles. An example of a principal curve construction on the three main components of a dataset is shown in the figure above.

The grant (140k euros) is funded by the French National League Against Cancer (LNCC).


  • Developping the method by combining principal curves and the breakpoint methods from
  • Developing new test statistics for heterogeneity that will allow to test if gene-environment interactions exist in the data.
  • Investigating other approaches not based on principale curves such as minimun spanning tree.
  • Applying the method to EPIC and UK Biobank datasets.

The data:

  • EPIC is a multi-centric European cohort with more than 500,000 individuals, recruited in the 1990s. This dataset contains 7,491 genotyped women (3,831 cases of breast cancer and 3,623 controls) with different clinical and environmental information on patients: for example, socio-economic status, height, weight, BMI, smoking status, alcohol consumption, eating habits (obtained from a questionnaire), the status of menopause, the use of hormonal treatment (contraception or for menopause) etc.
  • UK Biobank is a prospective cohort on 488,377 British individuals, all genotyped. Recruitment took place from 2006 to 2010, for individuals aged 40 to 69 years. In May 2018, 79,000 cases of cancer were diagnosed. The main ones are melanoma, breast cancer, uterine cancer, prostate cancer and colon cancer. These data also contain several environmental and clinical information about these individuals: lifestyle, biological measurements, biomarkers in the blood and urine, also images of the brain and heart as well as repeated measures of physical activity.


  • Olivier Bouaziz, maître de conférence, Université de Paris, laboratory MAP5. Principal investigator of the project.
  • Grégory Nuel, senior CNRS researcher of the Institute of Mathematics (INSMI), Laboratory of Probability, Statistics and Modeling (LPSM), Sorbonne Université.
  • Vivian Viallon, maître de conférence, Université Claude Bernard, Lyon, currently on leave at Internation Agency for Research on Cancer (IARC).

Call for a post-doc

We are looking for a two years post-doc to work on this project. The post-doc can start at any time from January 2020 until September 2020 and will finish in September 2022 at the latest. The applicant should have strong computational skills, typically he/she should be familiar with the EM algorithm, constrained Hidden Markov Models and time to event analysis. The method will be implemented preferably using the R software. It will be applied to the EPIC and UK Biobank cohorts described above. More information is available here.