The ability to simulate realistic data is an integral part of applied statistics. Simulated data are frequently used to develop and compare analysis strategies, since the “truth” is both known and controllable in the simulated data, where it may be neither in the real data. Simulated data must bear a suitably close resemblance to the real data, suitable for the purpose the analyst has in mind. Moreover, the process of developing a simulation model involves an extensive exploratory data analysis that often reveals interesting patterns and structure in the data, some of which may be novel and of scientific interest.
Current methods for simulating bisulfite sequencing data are designed and are adequate for comparing read-mapping software, but not for many other important purposes, such as developing methods for identifying differential methylation. This is because current simulation software ignores the complex dependence structure of DNA methylation, such as the strong spatial correlations of CpG methylation, and the multiple haplotypes present in a sample of cells.
We have performed extensive exploratory data analyses on 20 whole genome bisulfite datasets in an attempt to elucidate this complex structure, including the spatial correlations of CpG methylation (at both the level of individual reads and at the level of aggregated beta values), the effect of genomic context on CpG methylation and the degree of epipolymorphism. We are using this knowledge to develop a way of simulating realistic whole genome bisulfite sequencing data that is suitable for developing, testing and comparing methods for identifying differential methylation.