Cluster validation to depict population genetic structure
Main Article Content
Abstract
Since the beginning of statistics, the identification of the underlying number of existing groups in a population has been a research question aimed at answering geneticists regarding the structure that is formed by similarities between individuals of one or more populations. Numerous indices have been proposed to obtain the optimal number of groups that make up the population genetic structure (PGS).However, there is no consensus on which are the best. In order to determine the optimal number of groups constituting the PGS,a simulation study was conducted of nine PGS scenarios with three subpopulation numbers (k = 2, 5, and 10) and three levels of genetic differentiation recreating various maize genomes to evaluate four internal validation indices: CH, Connectivity, Dunn and Silhouette. This study found that the Dunn and Silhouette indices had the best performance in identifying the true number of underlying groups while Connectivityhadthe worst. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
Article Details
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
How to Cite
References
Balzarini, M., Teich, I., Bruno, C. y Peña, A. (2011). Making genetic biodiversity measurable: A review of statistical multivariate methods to study variability at gene level. Revista de la Facultad de Ciencias Agrarias, 43(1), 261-275. http://www.scielo.org.ar/pdf/refca/v43n1/v43n1a20.pdf
Brock, G., Pihur, V., Datta, S. y Datta, S. (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1-22. https://doi.org/10.18637/jss.v025.i04
Bruno, C., Balzarini, M. y Di Rienzo, J. (2003). Comparación de medidas de distancias entre perfiles RAPD. Journal of Basic & Applied Genetics, 15(1), 29-32. https://www.researchgate.net/publication/283569265
Caliński, T., y Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
Charrad, M., Ghazzali, N., Boiteau, V. y Niknafs, A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36. https://doi.org/10.18637/jss.v061.i06
Córdoba, M., Paccioretti, P. A., Giannini Kurina, F., Bruno, C. I. y Balzarini, M. G. (2019). Guía para el análisis de datos espaciales en agricultura. Serie Estadística Aplicada. http://hdl.handle.net/11336/128391
Dudoit, S. y Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome biology, 3(7), 1-21. https://doi.org/10.1186/gb-2002-3-7-research0036
Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1), 95-104. https://doi.org/10.1080/01969727408546059
Eick, C. F., Vaezian, B., Jiang, D. y Wang, J. (2006). Discovery of Interesting Regions in Spatial Data Sets Using Supervised Clustering. En S. M. Fürnkranz J., Scheffer, T. (Ed.), Knowledge Discovery in Databases: PKDD 2006. PKDD 2006. Lecture Notes in Computer Science(127-138). Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871637_16
Esfandyari, H. y Sørensen, A. C. (2019). xbreed: An R package for genomic simulation of purebred and crossbred populations. https://cran.microsoft.com/snapshot/2020-04-05/web/packages/xbreed/vignettes/xbreedvignette.pdf
Evanno, G., Regnaut, S. y Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: A simulation study. Molecular Ecology, 14(8), 2611-2620. https://doi.org/10.1111/j.1365-294X.2005.02553.x
Excoffier, L., Smouse, P. E. y Quattro, J. M. (1992). Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics, 131(2), 479-491. https://doi.org/10.3354/meps198283
Frichot, E. y François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925-929. https://doi.org/10.1111/2041-210X.12382
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857-871.https://doi.org/10.2307/2528823
Halkidi, M. y Iordanis, K. (2011). Online clustering of distributed streaming data using belief propagation techniques. En 2011 IEEE 12th International Conference on Mobile Data Management, 216-225.Lulea, Sweden. https://doi.org/10.1109/MDM.2011.63
Halkidi, M., Vazirgiannis, M. y Batistakis, Y. (2000). Quality scheme assessment in the clustering process. En D. A. Zighed, J. Komorowskiy, J. Żytkow (Eds.), Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science,1910(265-276). https://doi.org/10.1007/3-540-45372-5_26
Handl, J. y Knowles, J. (2005). Exploiting the Trade-off — The Benefits of Multiple Objectives in Data Clustering. EnC. A. CoelloCoello, A. H. Aguirre y E. Zitzler (Eds.), Evolutionary Multi-Criterion Optimization. EMO 2005. Lecture Notes in Computer Science, 3410(547-560). Springer. https://doi.org/10.1007/978-3-540-31880-4_38
Hartigan, J. A. (1975). Clustering Algorithms. John Wiley & Sons, Inc.
Jombart, T., Devillard, S. y Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC genetics, 11(1), 1–15. https://doi.org/10.1186/1471-2156-11-94
Kaufman, L. y Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc.
Latch, E. K., Dharmarajan, G., Glaubitz, J. C. y Rhodes, O. E. (2006). Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conservation Genetics, 7(2), 295-302. https://doi.org/10.1007/s10592-005-9098-1
Lawson, D. J., vanDorp, L. y Falush, D. (2018). A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications, 9(1), 1-11. https://doi.org/10.1038/s41467-018-05257-7
Lee, E. A. y Tracy, W. F. (2009). Modern maize breeding. Handbook of Maize, 141-160. https://doi.org/10.1007/978-0-387-77863-1_7
Odong, T. L., van Heerwaarden, J., Jansen, J., van Hintum, T. J. L. y Van Eeuwijk, F. A. (2011). Determination of genetic structure of germplasm collections: Are traditional hierarchical clustering methods appropriate for molecular marker data? Theoretical and Applied Genetics, 123(2), 195-205. https://doi.org/10.1007/s00122-011-1576-x
Peña-Malavera, A., Bruno, C., Fernandez, E. y Balzarini, M. (2014). Comparison of algorithms to infer genetic population structure from unlinked molecular markers. Statistical Applications in Genetics and Molecular Biology, 13(4), 391-402. https://doi.org/10.1515/sagmb-2013-0006
Pritchard, J., Stephens, M. y Donnelly, P. (2000). Inference of population structure using multil ocusgenotype data. Genetics, 155(2), 945-959. https://doi.org/10.1093/genetics/155.2.945
Rezaee, M. R., Lelieveldt, B. P. y Reiber, J. H. C. (1998). A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19(3-4), 237-246. https://doi.org/10.1016/S0167-8655(97)00168-2
Salvador, S. y Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. 16th IEEE international conference on tools with artificial intelligence. IEEE, 576-584. Copenhagen, Denmark. https://doi.org/10.1109/ICTAI.2004.50
Tibshirani, R., Walther, G. y Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423. https://doi.org/10.1111/1467-9868.00293
Videla, M. E. (2021). Evaluación de algoritmos de agrupamientos para inferir estructura genética poblacional en datos genómicos.Tesis de maestría publicada. Universidad Nacional de Córdoba, Córdoba, Argentina. http://hdl.handle.net/11086/20184