skip to main content
10.1145/3307339.3342144acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Open access

Practical Universal k-mer Sets for Minimizer Schemes

Published: 04 September 2019 Publication History

Abstract

Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal k-mer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences are labeled as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers.

References

[1]
Rayan Chikhi, Antoine Limasset, and Paul Medvedev. 2016. Compacting de Bruijn Graphs from Sequencing Data Quickly and in Low Memory. Bioinformatics, Vol. 32, 12 (June 2016), i201--i208.
[2]
Nicolaas Govert de Bruijn. 1946. A combinatorial problem. Koninklijke Nederlandse Akademie V. Wetenschappen, Vol. 49, 7 (1946), 758--764.
[3]
Rene De La Briandais. 1959. File Searching Using Variable Length Keys. In Papers Presented at the the March 3--5, 1959, Western Joint Computer Conference (IRE-AIEE-ACM '59 (Western) ). ACM, New York, NY, USA, 295--298.
[4]
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. 2015. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, Vol. 31, 10 (May 2015), 1569--1576.
[5]
Marius Erbert, Steffen Rechner, and Matthias Müller-Hannemann. 2017. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology, Vol. 12 (March 2017), 9.
[6]
Szymon Grabowski and Marcin Raniszewski. 2015. Sampling the suffix array with minimizers. In String Processing and Information Retrieval, Costas Iliopoulos, Simon Puglisi, and Emine Yilmaz (Eds.). Number 9309 in Lecture Notes in Computer Science. Springer International Publishing, Cham, 287--298.
[7]
Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. 2017. A fast approximate algorithm for mapping long reads to large reference databases. In Research in Computational Molecular Biology (Lecture Notes in Computer Science ), S. Cenk Sahinalp (Ed.). Springer International Publishing, Cham, 66--81.
[8]
Jolanta Kawulok and Sebastian Deorowicz. 2015. CoMeta: classification of metagenomes using k-mers. PLOS ONE, Vol. 10, 4 (April 2015), e0121453.
[9]
Heng Li. 2016. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, Vol. 32, 14 (2016), 2103--2110.
[10]
Yang Li and XifengYan. 2015. MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting. (May 2015). http://arxiv.org/abs/1505.06550 arXiv: 1505.06550.
[11]
Guillaume Marc cais, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford. 2017. Improving the performance of minimizers and winnowing schemes. Bioinformatics, Vol. 33, 14 (July 2017), i110--i117.
[12]
Guillaume Marc cais, Brad Solomon, Rob Patro, and Carl Kingsford. 2019. Sketching and Sublinear Data Structures in Genomics. Annual Review of Biomedical Data Science, Vol. 2, 1 (2019), in press.
[13]
Guillaume Marccais, Dan DeBlasio, and Carl Kingsford. 2018. Asymptotically optimal minimizers schemes. Bioinformatics, Vol. 34, 13 (June 2018), i13--i22.
[14]
Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. 2016. Mash: fast genome and metagenome distance estimation using MinHash . Genome Biology, Vol. 17, 1 (2016), 132.
[15]
Yaron Orenstein, David Pellow, Guillaume Marc cais, Ron Shamir, and Carl Kingsford. 2016. Compact universal k-mer hitting sets. In Algorithms in Bioinformatics (Lecture Notes in Computer Science ). Springer, Cham, 257--268.
[16]
Yaron Orenstein, David Pellow, Guillaume Marccais, Ron Shamir, and Carl Kingsford. 2017. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLOS Computational Biology, Vol. 13, 10 (Oct. 2017), 1--15.
[17]
Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004 a. Reducing storage requirements for biological sequence comparison. Bioinformatics, Vol. 20, 18 (Dec. 2004), 3363--3369.
[18]
Michael Roberts, Brian R. Hunt, James A. Yorke, Randall A. Bolanos, and Arthur L. Delcher. 2004 b. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, Vol. 11, 4 (Aug. 2004), 734--752.
[19]
Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. 2003. Winnowing: Local Algorithms for Document Fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, New York, NY, USA, 76--85.
[20]
Valerie A. Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, Hsiu-Chuan Chen, Paul A. Kitts, Terence D. Murphy, Kim D. Pruitt, Franc coise Thibaud-Nissen, Derek Albracht, Robert S. Fulton, Milinn Kremitzki, Vincent Magrini, Chris Markovic, Sean McGrath, Karyn Meltz Steinberg, Kate Auger, William Chow, Joanna Collins, Glenn Harden, Timothy Hubbard, Sarah Pelan, Jared T. Simpson, Glen Threadgold, James Torrance, Jonathan M. Wood, Laura Clarke, Sergey Koren, Matthew Boitano, Paul Peluso, Heng Li, Chen-Shan Chin, Adam M. Phillippy, Richard Durbin, Richard K. Wilson, Paul Flicek, Evan E. Eichler, and Deanna M. Church. 2017. Evaluation of GRCh38 and de Novo Haploid Genome Assemblies Demonstrates the Enduring Quality of the Reference Assembly. Genome Research, Vol. 27, 5 (Jan. 2017), 849--864.
[21]
Derrick E. Wood and Steven L. Salzberg. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, Vol. 15, 3 (March 2014), R46.
[22]
Chengxi Ye, Zhanshan S. Ma, Charles H. Cannon, Mihai Pop, and Douglas W. Yu. 2012. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics, Vol. 13, Suppl 6 (April 2012), S1.

Cited By

View all
  • (2024)Sketching Methods with Small Window Guarantee Using Minimum Decycling SetsJournal of Computational Biology10.1089/cmb.2024.054431:7(597-615)Online publication date: 1-Jul-2024
  • (2024)Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching SchemeJournal of Computational Biology10.1089/cmb.2023.021231:1(2-20)Online publication date: 1-Jan-2024
  • (2023)Locality-sensitive bucketing functions for the edit distanceAlgorithms for Molecular Biology10.1186/s13015-023-00234-218:1Online publication date: 24-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
September 2019
716 pages
ISBN:9781450366663
DOI:10.1145/3307339
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2019

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Student Paper

Author Tags

  1. genomics
  2. k-mer
  3. minimizer schemes
  4. universal sets

Qualifiers

  • Research-article

Funding Sources

Conference

BCB '19
Sponsor:

Acceptance Rates

BCB '19 Paper Acceptance Rate 42 of 157 submissions, 27%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)125
  • Downloads (Last 6 weeks)23
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sketching Methods with Small Window Guarantee Using Minimum Decycling SetsJournal of Computational Biology10.1089/cmb.2024.054431:7(597-615)Online publication date: 1-Jul-2024
  • (2024)Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching SchemeJournal of Computational Biology10.1089/cmb.2023.021231:1(2-20)Online publication date: 1-Jan-2024
  • (2023)Locality-sensitive bucketing functions for the edit distanceAlgorithms for Molecular Biology10.1186/s13015-023-00234-218:1Online publication date: 24-Jul-2023
  • (2023)Bidirectional String Anchors for Improved Text Indexing and Top-$K$ Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323178035:11(11093-11111)Online publication date: 1-Nov-2023
  • (2023)Creating and Using Minimizer Sketches in Computational GenomicsJournal of Computational Biology10.1089/cmb.2023.009430:12(1251-1276)Online publication date: 1-Dec-2023
  • (2022)Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizerJournal of Computational Biology10.1089/cmb.2022.0275Online publication date: 12-Sep-2022
  • (2022)DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer SchemesResearch in Computational Molecular Biology10.1007/978-3-031-04749-7_4(52-69)Online publication date: 29-Apr-2022
  • (2020)Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path LengthJournal of Computational Biology10.1089/cmb.2020.0432Online publication date: 15-Dec-2020
  • (2020)A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting SetsResearch in Computational Molecular Biology10.1007/978-3-030-45257-5_3(37-53)Online publication date: 21-Apr-2020
  • (2020)Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path LengthResearch in Computational Molecular Biology10.1007/978-3-030-45257-5_13(202-217)Online publication date: 21-Apr-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media