# Command-line Usage

## 🔹 1. Complete Pipeline

**Required:**
- `--mode Complete_Pipeline` → run full analysis from raw FASTQ/FASTA files.  
- `-f, --input <dir>` → input directory containing `.fastq.gz` or `.fasta.gz` files.  
- `-o, --output <dir>` → output directory.  

Use `--nanopore` to enable nanopore mode

```bash
strmie --mode Complete_Pipeline \
       -f /path/to/input_dir \
       -o /path/to/output_dir \
       [other options]
```

## 🔹 2. Index Calculation Only

**Required:**
- `--mode Index_Calculation` → recalculate indices starting from a user-supplied Excel table with allele definitions. 
- `-f, --input <dir>` → input directory containing `.fastq.gz` or `.fasta.gz` files.  
- `-o, --output <dir>` → output directory.
- `-p, --path <file.xlsx>` → Excel file with predefined alleles (`Sample, CAG_Allele_1, CAG_Allele_2`) for recalculation.

```bash
strmie --mode Index_Calculation \
       -f /path/to/input_dir \
       -o /path/to/output_dir \
       -p /path/to/CAG_data_for_recalculating_indices.xlsx
```


---

## Command-line parameters
STRmie-HD provides two main operational modes: Complete_Pipeline and Index_Calculation.

**Main modes**
- `--mode Complete_Pipeline` → run full analysis from raw FASTQ/FASTA files.  
- `--mode Index_Calculation` → recalculate indices starting from a user-supplied Excel table with allele definitions.  

**General options (required)**
- `-f, --input <dir>` → input directory containing `.fastq.gz` or `.fasta.gz` files.  
- `-o, --output <dir>` → output directory.  

**Nanopore arguments (used only with `--nanopore`)**
- `--np-max-roi <int>` → max Region Of Interest (ROI) length (default: 300)
- `--np-max-edits <int>` → max edits allowed for both flanks (default: 2)
- `--np-max-edits-left <int>` → override edits for upstream flank (default: 2)
- `--np-max-edits-right <int>` → override edits for downstream flank (default: 3)
- `--np-seed-len <int>` → seed prefilter length (0 disables; suggested 5–7) (default: 0)
- `--np-bestmatch`--np-no-bestmatch` → enable/disable `regex.BESTMATCH` (default: enabled)
- `--np-min-read-len <int>` → minimum read length (default: 50)
- `--np-min-cag-pct <float>` → discard reads if the fraction of in-frame CAG triplets is below this threshold  
         (default: 0.70; set `0` to disable)
- `--np-cag-pct-scope {roi,cag_region}` → region for CAG fraction:
       - `roi` = entire ROI
       - `cag_region` = ROI prefix before the LOI/DOI motif block (default: `cag_region`)
- `--np-allow-caa` → count `CAA` as acceptable alongside `CAG` in the fraction calculation

**Peak detection**
- `--cwt` → enable wavelet-based peak detection (`scipy.signal.find_peaks_cwt`) as an alternative to histogram-based detection.  
- `-bc, --cutpoint_based` → call peaks by splitting the histogram at the biological cutpoint (default: 27).  
- `-a <list>` → list of widths (default `[5,6,7,8,9,10]`) used by `find_peaks_cwt` to match expected peak shapes. Because they determine the scale of features considered as peaks — too small misses broad peaks, too large merges or ignores narrow peaks — thus directly impacting sensitivity and specificity in peak calling.
- `-i <int>` → interval (default `6`) around candidate peaks used for local refinement. Defines how many points on each side are considered when adjusting the peak position — too small may miss the true summit, too large may introduce noise — thus balancing precision and robustness in peak localization.
- `-m <int>` → minimum CAG repeats to consider (default `7`).  

**Indices and thresholds**
- `-c <int>` → cutpoint (default `27`), separates “healthy” vs. “phenotypic” allele range and used for Allele Ratio.  
- `-ti <float>` → relative peak height threshold for **Instability Index** (rAdvanced).  
- `-te <float>` → relative peak height threshold for **Expansion Index** (Advanced).  

**Graphical outputs**
- `--cag_graph` → save histograms of CAG distributions per sample.  
- `--ccg_graph` → save histograms of CCG distributions and warning cases.  

**Index calculation mode (required parameter only for mode Index_Calculation)**
- `-p, --path <file.xlsx>` → Excel file with predefined alleles (`Sample, CAG_Allele_1, CAG_Allele_2`) for recalculation.