Reference

bam2tensor package

bam2tensor.embedding module

Genome methylation embedding for CpG site indexing and coordinate conversion.

This module provides the GenomeMethylationEmbedding class, which manages the mapping between genomic coordinates and matrix column indices for methylation data. It parses reference genome FASTA files to identify all CpG sites and provides efficient bidirectional lookup between:

  • Genomic coordinates (chromosome, position)

  • Matrix embedding indices (column numbers in the sparse output matrix)

The CpG site index is cached to disk as a gzipped JSON file, allowing fast subsequent runs on the same reference genome.

Example

Create an embedding for the human genome:

from bam2tensor.embedding import GenomeMethylationEmbedding

embedding = GenomeMethylationEmbedding(
    genome_name="hg38",
    expected_chromosomes=["chr" + str(i) for i in range(1, 23)] + ["chrX", "chrY"],
    fasta_source="/path/to/GRCh38.fa",
    verbose=True,
)

# Get total CpG count
print(f"Total CpG sites: {embedding.total_cpg_sites:,}")

# Convert between coordinate systems
chrom, pos = embedding.embedding_to_genomic_position(12345)
idx = embedding.genomic_position_to_embedding("chr1", 10525)
CACHE_FILE_SUFFIX

The suffix appended to genome_name for cache files.

class GenomeMethylationEmbedding(genome_name, expected_chromosomes, fasta_source, window_size=150, skip_cache=False, verbose=False)

Bases: object

Manages CpG site positions and coordinate conversions for a reference genome.

This class parses a reference genome FASTA file to identify all CpG sites (positions where cytosine is followed by guanine), creating a mapping between genomic coordinates and linear embedding indices. This mapping is essential for interpreting the sparse matrices produced by bam2tensor.

The embedding assigns each CpG site a unique integer index, ordered by: 1. Chromosome order (as specified in expected_chromosomes) 2. Genomic position within each chromosome

CpG site data is cached to disk after the first parse, significantly speeding up subsequent runs on the same genome.

Parameters:
  • genome_name (str)

  • expected_chromosomes (list[str])

  • fasta_source (str)

  • window_size (int)

  • skip_cache (bool)

  • verbose (bool)

genome_name

Identifier for this genome (used for cache file naming).

fasta_source

Path to the reference FASTA file.

expected_chromosomes

List of chromosome names to include.

verbose

Whether to print progress information.

cpg_sites_dict

Dictionary mapping chromosome names to lists of CpG positions (1-based coordinates).

total_cpg_sites

Total number of CpG sites across all chromosomes.

cache_file

Path to the cache file for this genome.

Example

>>> # xdoctest: +SKIP
>>> embedding = GenomeMethylationEmbedding(
...     genome_name="hg38",
...     expected_chromosomes=["chr1", "chr2"],
...     fasta_source="reference.fa",
... )
>>> embedding.total_cpg_sites
2847432
>>> embedding.embedding_to_genomic_position(0)
('chr1', 10525)
>>> embedding.genomic_position_to_embedding("chr1", 10525)
0
embedding_to_genomic_position(embedding)

Convert a matrix column index to genomic coordinates.

Given an embedding index (column number in the sparse matrix), returns the corresponding chromosome name and 1-based genomic position.

This is the inverse of genomic_position_to_embedding().

Parameters:

embedding (int | int64) – The matrix column index (0-based). Must be in the range [0, total_cpg_sites).

Returns:

  • chromosome is the chromosome name (e.g., “chr1”)

  • position is the 1-based genomic coordinate of the CpG site

Return type:

A tuple of (chromosome, position) where

Raises:

AssertionError – If embedding is out of range.

Example

>>> # xdoctest: +SKIP
>>> embedding = GenomeMethylationEmbedding(...)
>>> chrom, pos = embedding.embedding_to_genomic_position(12345)
>>> print(f"Column 12345 is {chrom}:{pos}")
Column 12345 is chr1:987654
genomic_position_to_embedding(chrom, pos)

Convert genomic coordinates to a matrix column index.

Given a chromosome name and 1-based genomic position, returns the corresponding matrix column index (embedding position).

This is the inverse of embedding_to_genomic_position().

Parameters:
  • chrom (str) – The chromosome name (e.g., “chr1”). Must be one of the chromosomes in expected_chromosomes.

  • pos (int) – The 1-based genomic position of the CpG site. Must be a valid CpG position in the specified chromosome.

Return type:

int

Returns:

The matrix column index (0-based) corresponding to this CpG site.

Example

>>> # xdoctest: +SKIP
>>> embedding = GenomeMethylationEmbedding(...)
>>> col_idx = embedding.genomic_position_to_embedding("chr1", 10525)
>>> print(f"chr1:10525 is at column {col_idx}")
chr1:10525 is at column 0

Note

This method assumes the position is a known CpG site. Calling with a position that is not a CpG site will raise a KeyError.

load_embedding_cache()

Load CpG site index from a previously saved cache file.

Attempts to load the cache file “{genome_name}.cache.json.gz” and restore all CpG site data. If successful, this avoids the slow FASTA parsing step.

Return type:

bool

Returns:

True if the cache was successfully loaded.

Raises:

FileNotFoundError – If the cache file does not exist.

Note

After loading, the caller should verify that expected_chromosomes and window_size match the current configuration, as this method overwrites those attributes with cached values.

parse_fasta_for_cpg_sites()

Parse the reference FASTA to find all CpG sites.

Scans each chromosome in the reference genome FASTA file to identify all CpG dinucleotide positions (where C is immediately followed by G). Results are stored in self.cpg_sites_dict.

This operation is slow for large genomes (~10 minutes for GRCh38/hg38) but only needs to be done once per genome. Results should be cached using save_embedding_cache() for faster subsequent runs.

The method populates self.cpg_sites_dict with the structure:

{
    "chr1": [10525, 10542, 10563, ...],  # 1-based positions
    "chr2": [10182, 10193, 10218, ...],
    ...
}

Positions are 1-based to match standard genomic coordinate conventions (e.g., samtools faidx uses 1-based coordinates).

Only chromosomes listed in self.expected_chromosomes are processed; other sequences in the FASTA are skipped with a message (if verbose).

Raises:

FileNotFoundError – If the FASTA file cannot be read.

Return type:

None

Note

For the human genome (GRCh38), expect approximately 28 million CpG sites across all chromosomes.

Return type:

None

save_embedding_cache()

Save the CpG site index to a compressed cache file.

Serializes the CpG site positions and metadata to a gzipped JSON file. The cache file is named “{genome_name}.cache.json.gz” and contains: - genome_name: The genome identifier - fasta_source: Path to the original FASTA file - expected_chromosomes: List of included chromosomes - window_size: The window_size parameter (for compatibility checking) - cpg_sites_dict: Dictionary of chromosome -> list of CpG positions

The cache can be loaded by future GenomeMethylationEmbedding instances to avoid re-parsing the FASTA file.

Raises:

AssertionError – If cpg_sites_dict is empty (nothing to cache).

Return type:

None

Note

Cache files use compression level 3 for a balance between file size and write speed.

Return type:

None

bam2tensor.functions module

Core methylation extraction functions for bam2tensor.

This module contains the main function for extracting methylation data from BAM files: extract_methylation_data_from_bam(). This function reads aligned bisulfite-sequencing (BS-seq) and enzymatic methylation sequencing (EM-seq) reads and generates a sparse matrix representation of methylation states at each CpG site.

The extraction process:
  1. Iterates through each chromosome in the genome embedding

  2. For each aligned read that passes quality filters: a. Identifies which CpG sites fall within the read’s alignment b. Determines the bisulfite strand (forward or reverse) c. Extracts methylation state at each covered CpG

  3. Builds a sparse COO matrix with the results

Methylation State Encoding:
  • 1: Methylated (cytosine remained as C after bisulfite treatment)

  • 0: Unmethylated (cytosine converted to T by bisulfite treatment)

  • -1: No data (indel, SNV, or other non-C/T base at CpG position)

Supported Aligners:

The module supports BAM files from aligners that provide strand information: - Bismark: Uses the XM tag (Z=methylated CpG, z=unmethylated CpG) - Biscuit: Uses the YD tag (f=forward, r=reverse) - bwameth: Uses the YD tag (f=forward, r=reverse), identical to Biscuit - gem3/Blueprint: Uses the XB tag (C=forward, G=reverse)

Example

>>> # xdoctest: +SKIP
>>> from bam2tensor.embedding import GenomeMethylationEmbedding
>>> from bam2tensor.functions import extract_methylation_data_from_bam
>>>
>>> # Create genome embedding
>>> embedding = GenomeMethylationEmbedding(
...     genome_name="hg38",
...     expected_chromosomes=["chr1", "chr2"],
...     fasta_source="reference.fa",
... )
>>>
>>> # Extract methylation data
>>> matrix = extract_methylation_data_from_bam(
...     input_bam="sample.bam",
...     genome_methylation_embedding=embedding,
...     quality_limit=20,
... )
>>>
>>> print(f"Extracted {matrix.shape[0]} reads, {matrix.nnz} data points")
check_chromosome_overlap(bam_references, embedding_chromosomes)

Check that BAM and embedding chromosome names overlap.

Detects the common mistake where a BAM file uses one chromosome naming convention (e.g., 1, 2, 3 from Ensembl) while the reference embedding uses another (e.g., chr1, chr2, chr3 from UCSC). If no overlap is found but adding or stripping a chr prefix would create overlap, raises a clear error with guidance.

Parameters:
  • bam_references (tuple[str, ...] | list[str]) – Tuple or list of chromosome/contig names from the BAM file header (e.g., from pysam.AlignmentFile.references).

  • embedding_chromosomes (list[str]) – List of chromosome names from the genome embedding (i.e., the expected chromosomes).

Raises:

ValueError – If there is no overlap between BAM and embedding chromosome names, with a diagnostic message indicating whether a chr prefix mismatch is the likely cause.

Return type:

None

detect_aligner(input_bam, sample_size=1000)

Detect the aligner used to produce a BAM file by checking read tags.

Samples up to sample_size reads from the BAM file and checks for aligner-specific tags: XM (Bismark), YD (Biscuit/bwameth), or XB (gem3/Blueprint).

Parameters:
  • input_bam (str) – Path to an indexed BAM file.

  • sample_size (int) – Maximum number of reads to examine. Only primary, non-duplicate, non-QC-failed reads are considered.

Return type:

str

Returns:

A human-readable string identifying the detected aligner, such as "Bismark (XM tag)" or "Unknown".

extract_methylation_data_from_bam(input_bam, genome_methylation_embedding, quality_limit=20, verbose=False, debug=False)

Extract read-level CpG methylation data from a BAM file.

Parses a bisulfite-sequencing or EM-seq BAM file and extracts methylation states at each CpG site covered by aligned reads. Returns a sparse matrix where each row represents a unique read and each column represents a CpG site.

The function applies several filters to reads:
  • Mapping quality must be >= quality_limit (default 20)

  • Duplicate reads are excluded

  • QC-failed reads are excluded

  • Secondary and supplementary alignments are excluded

  • For Biscuit/bwameth/gem3: only parent-strand reads are processed

  • For Bismark: all reads are processed (XM tag has pre-resolved calls)

Two extraction paths are supported, detected automatically per-read:

Bismark path (XM tag present):
  1. Uses binary search to find CpG sites within the read

  2. Reads the XM methylation call string directly

  3. Records: Z=methylated(1), z=unmethylated(0), other CpG=-1

Biscuit/bwameth/gem3 path (YD or XB tag):
  1. Checks the strand tag to filter daughter-strand reads

  2. Uses binary search to find CpG sites within the read

  3. Extracts the base at each CpG position from the read sequence

  4. Records: C=methylated(1), T=unmethylated(0), other=-1

Parameters:
  • input_bam (str) – Path to the input BAM file. The file must be indexed (have a corresponding .bam.bai file).

  • genome_methylation_embedding (GenomeMethylationEmbedding) – A GenomeMethylationEmbedding object that defines the CpG site positions and provides coordinate conversion methods.

  • quality_limit (int) – Minimum mapping quality (MAPQ) threshold for reads. Reads with MAPQ below this value are skipped. Default is 20, which excludes reads mapping to multiple locations equally well.

  • verbose (bool) – If True, display a progress bar and print the total read count. Useful for monitoring progress on large files.

  • debug (bool) – If True, enable extensive validation and debug output. Prints per-read information and validates that each read is only processed once. Significantly slower.

Returns:

  • n_reads is the number of reads that passed filters and covered at least one CpG site

  • n_cpg_sites is genome_methylation_embedding.total_cpg_sites

  • Values are: 1 (methylated), 0 (unmethylated), -1 (no data)

The matrix uses COO format for efficient construction. Convert to CSR (tocsr()) for row slicing or CSC (tocsc()) for column slicing.

Return type:

A scipy.sparse.coo_matrix with shape (n_reads, n_cpg_sites) where

Raises:

FileNotFoundError – If the BAM file index (.bam.bai) is missing. The BAM file must be indexed with samtools index.

Example

>>> # xdoctest: +SKIP
>>> from bam2tensor.embedding import GenomeMethylationEmbedding
>>> from bam2tensor.functions import extract_methylation_data_from_bam
>>> import scipy.sparse
>>>
>>> # Create embedding (or load from cache)
>>> embedding = GenomeMethylationEmbedding(
...     genome_name="hg38",
...     expected_chromosomes=["chr1", "chr22"],
...     fasta_source="GRCh38.fa",
... )
>>>
>>> # Extract methylation data
>>> matrix = extract_methylation_data_from_bam(
...     input_bam="sample.bam",
...     genome_methylation_embedding=embedding,
...     quality_limit=30,  # Stricter quality filter
...     verbose=True,
... )
>>>
>>> # Analyze results
>>> print(f"Reads with CpG data: {matrix.shape[0]:,}")
>>> print(f"Total CpG sites: {matrix.shape[1]:,}")
>>> print(f"Data points: {matrix.nnz:,}")
>>>
>>> # Save to file
>>> scipy.sparse.save_npz("sample.methylation.npz", matrix)

Note

The function processes chromosomes in the order they appear in genome_methylation_embedding.cpg_sites_dict. Reads are numbered sequentially as they are encountered across all chromosomes.

See also

GenomeMethylationEmbedding: For creating the genome embedding. scipy.sparse.coo_matrix: Documentation on the output format.

bam2tensor.reference module

Reference genome download and caching utilities.

Provides functionality to download and cache reference genome FASTA files from well-known sources (UCSC, NCBI) for common genomes like hg38, hg19, mm10, and T2T-CHM13. This avoids the need for users to manually locate and download reference FASTAs.

Example

>>> # xdoctest: +SKIP
>>> from bam2tensor.reference import download_reference, list_available_genomes
>>>
>>> # See what genomes are available
>>> genomes = list_available_genomes()
>>> for name, info in genomes.items():
...     print(f"{name}: {info['description']}")
>>>
>>> # Download hg38
>>> fasta_path = download_reference("hg38", verbose=True)
>>> print(f"Reference saved to: {fasta_path}")
download_reference(genome_name, verbose=False)

Download and cache a reference genome FASTA file.

Downloads a gzipped FASTA file from a well-known source (UCSC or NCBI), decompresses it, and stores it in the bam2tensor cache directory. If the genome has already been downloaded, returns the path to the cached file without re-downloading.

Parameters:
  • genome_name (str) – The genome identifier to download. Must be one of the keys in KNOWN_GENOMES (e.g., “hg38”, “hg19”, “mm10”, “T2T-CHM13”).

  • verbose (bool) – If True, display download progress and status messages.

Return type:

Path

Returns:

Path to the decompressed reference FASTA file in the cache directory.

Raises:
  • ValueError – If genome_name is not a recognized genome in KNOWN_GENOMES.

  • OSError – If the download fails due to network issues or insufficient disk space.

Example

>>> # xdoctest: +SKIP
>>> fasta_path = download_reference("hg38", verbose=True)
Downloading Human GRCh38/hg38 (UCSC)...
>>> print(fasta_path)
/home/user/.cache/bam2tensor/hg38/hg38.fa
get_cache_dir()

Get the bam2tensor cache directory for downloaded references.

Returns the platform-appropriate cache directory, creating it if it does not already exist. Uses ~/.cache/bam2tensor by default, or respects the XDG_CACHE_HOME environment variable on Linux/macOS.

Return type:

Path

Returns:

Path to the bam2tensor cache directory. The directory is created if it does not exist.

Example

>>> # xdoctest: +SKIP
>>> cache_dir = get_cache_dir()
>>> print(cache_dir)
/home/user/.cache/bam2tensor
get_cached_reference(genome_name)

Check if a reference genome FASTA is already cached locally.

Looks for a previously downloaded and decompressed FASTA file in the bam2tensor cache directory.

Parameters:

genome_name (str) – The genome identifier (e.g., “hg38”, “mm10”). Must be a key in KNOWN_GENOMES.

Return type:

Path | None

Returns:

Path to the cached FASTA file if it exists, or None if the genome has not been downloaded yet.

Example

>>> # xdoctest: +SKIP
>>> path = get_cached_reference("hg38")
>>> if path:
...     print(f"Found cached reference: {path}")
... else:
...     print("Not cached yet")
list_available_genomes()

List all known genome references available for download.

Returns the dictionary of genomes that can be downloaded using download_reference(), along with their descriptions and URLs.

Return type:

dict[str, dict[str, str]]

Returns:

A dictionary mapping genome names to their metadata. Each value is a dict with keys ‘url’, ‘description’, and ‘expected_chromosomes’.

Example

>>> genomes = list_available_genomes()
>>> "hg38" in genomes
True
>>> "description" in genomes["hg38"]
True