Command-line Usage

πŸ”Ή 1. Complete Pipeline

Required:

  • --mode Complete_Pipeline β†’ run full analysis from raw FASTQ/FASTA files.

  • -f, --input <dir> β†’ input directory containing .fastq.gz or .fasta.gz files.

  • -o, --output <dir> β†’ output directory.

Use --nanopore to enable nanopore mode

strmie --mode Complete_Pipeline \
       -f /path/to/input_dir \
       -o /path/to/output_dir \
       [other options]

πŸ”Ή 2. Index Calculation Only

Required:

  • --mode Index_Calculation β†’ recalculate indices starting from a user-supplied Excel table with allele definitions.

  • -f, --input <dir> β†’ input directory containing .fastq.gz or .fasta.gz files.

  • -o, --output <dir> β†’ output directory.

  • -p, --path <file.xlsx> β†’ Excel file with predefined alleles (Sample, CAG_Allele_1, CAG_Allele_2) for recalculation.

strmie --mode Index_Calculation \
       -f /path/to/input_dir \
       -o /path/to/output_dir \
       -p /path/to/CAG_data_for_recalculating_indices.xlsx

All command-line parameters

STRmie-HD provides two main operational modes: Complete_Pipeline and Index_Calculation.

Main modes

  • --mode Complete_Pipeline β†’ run full analysis from raw FASTQ/FASTA files.

  • --mode Index_Calculation β†’ recalculate indices starting from a user-supplied Excel table with allele definitions.

General options (required)

  • -f, --input <dir> β†’ input directory containing .fastq.gz or .fasta.gz files.

  • -o, --output <dir> β†’ output directory.

Nanopore arguments (used only with --nanopore)

  • --np-max-roi <int> β†’ max Region Of Interest (ROI) length (default: 300)

  • --np-max-edits <int> β†’ max edits allowed for both flanks (default: 2)

  • --np-max-edits-left <int> β†’ override edits for upstream flank (default: 2)

  • --np-max-edits-right <int> β†’ override edits for downstream flank (default: 3)

  • --np-seed-len <int> β†’ seed prefilter length (0 disables; suggested 5–7) (default: 0)

  • --np-bestmatch–np-no-bestmatchβ†’ enable/disableregex.BESTMATCH` (default: enabled)

  • --np-min-read-len <int> β†’ minimum read length (default: 50)

  • --np-min-cag-pct <float> β†’ discard reads if the fraction of in-frame CAG triplets is below this threshold
    (default: 0.70; set 0 to disable)

  • --np-cag-pct-scope {roi,cag_region} β†’ region for CAG fraction: - roi = entire ROI - cag_region = ROI prefix before the LOI/DOI motif block (default: cag_region)

  • --np-allow-caa β†’ count CAA as acceptable alongside CAG in the fraction calculation

Peak detection

  • --cwt β†’ enable wavelet-based peak detection (scipy.signal.find_peaks_cwt) as an alternative to histogram-based detection.

  • -bc, --cutpoint_based β†’ call peaks by splitting the histogram at the biological cutpoint (default: 27).

  • -a <list> β†’ list of widths (default [5,6,7,8,9,10]) used by find_peaks_cwt to match expected peak shapes. Because they determine the scale of features considered as peaks β€” too small misses broad peaks, too large merges or ignores narrow peaks β€” thus directly impacting sensitivity and specificity in peak calling.

  • -i <int> β†’ interval (default 6) around candidate peaks used for local refinement. Defines how many points on each side are considered when adjusting the peak position β€” too small may miss the true summit, too large may introduce noise β€” thus balancing precision and robustness in peak localization.

  • -m <int> β†’ minimum CAG repeats to consider (default 7).

Indices and thresholds

  • -c <int> β†’ cutpoint (default 27), separates β€œhealthy” vs. β€œphenotypic” allele range and used for Allele Ratio.

  • -ti <float> β†’ relative peak height threshold for Instability Index (rAdvanced).

  • -te <float> β†’ relative peak height threshold for Expansion Index (Advanced).

Graphical outputs

  • --cag_graph β†’ save histograms of CAG distributions per sample.

  • --ccg_graph β†’ save histograms of CCG distributions and warning cases.

Index calculation mode (required parameter only for mode Index_Calculation)

  • -p, --path <file.xlsx> β†’ Excel file with predefined alleles (Sample, CAG_Allele_1, CAG_Allele_2) for recalculation.