bam2tensor¶
bam2tensor is a Python package for converting .bam files to dense representations of methylation data (as .npz NumPy arrays). It is designed to evaluate all CpG sites and store methylation states for loading into other deep learning pipelines.
Features¶
Parses .bam files using pysam
Extracts methylation data from all CpG sites
Supports any genome (Hg38, T2T-CHM13, mm10, etc.)
Stores data in sparse format (COO matrix) for efficient loading
Exports methylation data to .npz NumPy arrays
Easily parallelizable
Requirements¶
Python 3.9+
pysam, numpy, scipy, tqdm
Installation¶
You can install bam2tensor via pip from PyPI:
pip install bam2tensor
Usage¶
Please see the [Reference Guide] for full details.
Data Structure¶
One .npz
file is generated for each separate .bam
, which can be loaded using scipy.sparse.load_npz()
. Each .npz
file contains a single sparse SciPy COO matrix.
In the COO matrix, each row represents a read and each column represents a CpG site. The value at each row/column is the methylation state (0
= unmethylated, 1
= methylated, -1
= no data). Note that -1
can represent indels or point mutations.
Todo¶
Consider storing a Read ID: Row ID mapping?
Export / more stably store & import embedding mapping? (.npz or other instead of .json?)
Store metadata / object reference in .npz file?
Explore using Xarray or Sparse?
Contributing¶
Contributions are welcome! Please see the Contributor Guide.
License¶
Distributed under the terms of the MIT license, bam2tensor is free and open source.
Issues¶
If you encounter any problems, please file an issue along with a detailed description.
Credits¶
This project is developed and maintained by Nick Semenkovich (@semenko), as part of the Medical College of Wisconsin’s Data Science Institute.
This project was generated from Statistics Norway’s SSB PyPI Template.