Skip to content

A specialized datastructure for counting short DNA sequences for use in Bioinformatics.

License

Notifications You must be signed in to change notification settings

andrewdalpino/DNAHash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA Hash

A Python library for counting short DNA sequences for use in Bioinformatics. DNA Hash stores k-mer sequence counts by their up2bit encoding - a two-way hash that works with variable-length sequences. DNA Hash uses considerably less memory than a lookup table that stores sequences in plaintext. In addition, DNA Hash's novel autoscaling Bloom filter eliminates the need to explicitly store counts for sequences that have only been seen once.

  • Ultra-low memory footprint
  • Embarrassingly parallelizable
  • Open-source and free to use commercially

Note: The maximum sequence length is platform dependent. On a 64-bit machine, the max length is 31. On a 32-bit machine, the max length is 15.

Note: Due to the probabilistic nature of the Bloom filter, DNA Hash may over count sequences at a bounded rate.

References

About

A specialized datastructure for counting short DNA sequences for use in Bioinformatics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages