In order to determine a set of peaks supported by an ensemble of peak calling techniques it is necessary to compare the different tools based on the co-localization of their predicted peaks. Positional similarity between peaks is traditionally based on a minimum of 1bp overlap (Hogart, Lichtenberg et al. , Kruczyk, Umer et al.). In the context of the SigSeeker peak integration, we allow for user-defined or "variable" overlap thresholds larger than 1bp.
To support positional integration of peaks, the set of all overlapping peaks between a given set of tools is determined. By filtering this set of overlapping peak fragments it is possible to generate an optimized peak set supported by a minimum number of tools in the ensemble with enforced overlap between the peaks corresponding to a minimum width of the retained peaks. Each optimized peak is assigned an intensity score equaling the mean of the normalized scores from the original overlapping peaks. While it is necessary to consider peak similarity from a positional perspective, it can be observed that peak prediction tools often generate co-localized peaks that differ greatly in their respective original intensities (Figure 1). Using the peak scores of overlapping peaks, a regression model is established that can be used to identify and remove peaks, which are co-located but have significantly different intensity scores.
Fig. 1. Intensities of co-localized peaks predicted by SWEMBL and SICER for NRSF binding in K562 cells. The strong difference in the intensity profiles of the co-localized peaks is highlighted by a low Pearson correlation coefficient. Outlier removal based on linear regression may have a negative impact on the correlation between the two variable (A), while an orthogonal regression (here the bivariate Deming regression) leads to an improvement in the Pearson correlation coefficient (B).
For each of the n overlapping peaks, predicted by p different approaches to be located at the same position within the genome, the set of intensity scores from each of these approaches is compiled. A regression model can be established by choosing the scores of one tool as the dependent variable Y and the scores of the remaining tools as independent variables X. An unknown parameter β is then established to minimize the distance between the dependent and independent variables, resulting in Y≈f(X,β). Linear regression models operate under the assumption that expected values can be mapped to observed values under the assumption of a certain error for the expected values. While this is appropriate for the generation of clinical or experimental models where Y contains observed values, each peak calling result can contain errors and a comprehensive analysis would require the results for each tool to be analyzed as the dependent variable. Using perpendicular least squares instead of the traditionally vertical least squares for the regression's minimization problem introduces the handling of errors for dependent and independent variables at the same time (Figure 2). In terms of the regression model this results in the representation of available data (yi,xi) via "true" values (yi*,xi*) and the independent errors εi and ηi:
This concept is used for the Deming regression in a bivariate space, which then finds the "best fit" for y*=β0+β1x*. This best fit is determined via minimizing the weighted sum of squared residuals, using the variances of the two independent errors σε and ση:
The parameters of the model enable a characterization of correlation between the different tools used to build it and allow highlighting of peaks that are outliers from the model. The identification of these outliers provides the foundation to prune the set of predicted peaks by removing additional false positive predictions. This quantitative filtering is of particular importance for peaks that have a less than complete overlap as these may represent different epigenetic marks.
Fig. 2. Example for the error concepts used for linear and Deming regression. The use of perpendicular deviation measures between observed and expected values allows for error in both X and Y.
Using the regression model it is possible to identify peaks that are co-localized (positional filter) but do not correlate well between the different approaches in terms of their associated intensity (quantitative filter) and could thus represent false positive predictions, most likely in form of peaks that are only narrowly co-located (glancing). By examining the residuals associated with the model, it is possible to group the set of peaks into those relatively close to the regression curve and those clearly remove from it. We choose the upper and lower quantiles (5% and 95%) as outliers and remove them from the peak lists (Figure 3). The application of this approach ensures that not only peaks with strong scores across all tools are retained but also those that are detected by multiple approaches with lower but still relevant intensities.
Fig. 3. Visualization of the orthogonal regression and quantile based filtering for peak predictions of MACS, SWEMBLE and SICER for NRSF binding in K562 cells (Rye et al. 2011).
Hogart, A., J. Lichtenberg, S. S. Ajay, S. M. Anderson, E. H. Margulies and D. M. Bodine (2012). "Genome-Wide DNA Methylation Profiles in Hematopoietic Stem and Progenitor Cells Reveal Over-Representation of ETS Transcription Factor Binding Sites." Genome Research.
Kruczyk, M., H. M. Umer, S. Enroth and J. Komorowski (2013). "Peak Finder Metaserver - a novel application for finding peaks in ChIP-seq data." Bmc Bioinformatics 14(1): 280.
Rye M. B., Saetrom P. and Drablos F. (2011). "A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs." Nucleic Acids Research 39(4): e25-e25.