Installation

Prerequisites

Before you begin, ensure you have:

At least 10 GB of free disk space (for miniconda, EXCAVATE-HT, and genome files)
A stable internet connection for downloading large genome files
Administrator/sudo access for initial setup

Set Up

Installing miniconda on MacOS/Unix

Go to the miniconda installation website: https://www.anaconda.com/download
Click “Skip registration”
Navigate to miniconda installers and download the installer for your device
Open the downloaded installation file and follow the install instructions. We recommend installing to your own user directory rather than for all users of the computer.
Open a new terminal window and verify the installation was successful by checking that (base) appears at the beginning of your command prompt. This indicates the base conda environment is active.

Installing miniconda on Windows

Some packages needed to run EXCAVATE-HT (bcftools and bedtools) cannot be easily installed via conda on Windows. Therefore, you must first install WSL (Windows Subsystem for Linux) to use a Linux environment.

Step 1: Install WSL

Open PowerShell as Administrator (right-click on PowerShell > Run as Administrator)
Run:
```
wsl --install
```
Restart your computer when prompted
After restart, open the Ubuntu app from the Start Menu (Ubuntu is automatically installed with WSL)
Follow the prompts to create a username and password for your Ubuntu environment

Step 2: Install miniconda in Ubuntu

In the Ubuntu terminal, download and install miniconda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Follow the installation prompts (press Enter to review the license, type “yes” to accept, and accept the default installation location)
Close and reopen the Ubuntu terminal
Verify installation by checking that (base) appears at the beginning of your command prompt

Important note for Windows users: To access your Windows filesystem from Ubuntu, use the /mnt/ prefix. For example:

Windows path: C:\Users\YourUsername\Downloads
Ubuntu path: /mnt/c/Users/YourUsername/Downloads

To find your Windows username, open Command Prompt on Windows and run echo %USERNAME%.

Create a working directory

Create a directory to store all files for your EXCAVATE-HT analysis. For example:

On Mac/Linux:

mkdir -p ~/Downloads/excavate-ht
cd ~/Downloads/excavate-ht

On Windows (in Ubuntu terminal):

mkdir -p /mnt/c/Users/YourUsername/Downloads/excavate-ht
cd /mnt/c/Users/YourUsername/Downloads/excavate-ht

Replace YourUsername with your actual Windows username.

Clone the EXCAVATE-HT repository

Clone the repository to get the source code and environment configuration file:

git clone https://github.com/akshitasax/EXCAVATE-HT.git
cd EXCAVATE-HT

Create and activate the excavate conda environment

The repository includes an environment.yml file that specifies all required dependencies. Use it to create a dedicated conda environment:

Create the environment (this may take several minutes):
```
conda env create -f environment.yml
```
Activate the environment:
```
conda activate excavate-ht
```
Your terminal prompt should now show (excavate-ht) instead of (base), indicating the environment is active.

The environment includes:

Python dependencies:

Python >= 3.9, < 3.13
numpy
pandas
biopython
regex
pyfaidx

External tools:

bedtools
bcftools
bowtie1 >=1.2.3 for off-targets analysis

Install EXCAVATE-HT

With the excavate environment activated, install EXCAVATE-HT in editable mode:

pip install -e .

Verify the installation was successful:

excavate-ht --help

If you see the help message, installation was successful!

Prepare input data files

Set up input directory

Create a subdirectory in your working folder to store input files:

cd ~/Downloads/excavate-ht  # or your working directory
mkdir -p input_data
cd input_data

Download required files

You will need the following files for your analysis:

1. Chromosome-specific FASTA file

To download an individual chromosome:

Go to https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/
Click on your chromosome of interest
Click either the RefSeq or GenBank blue link
Click “FASTA” under the page title
Click “Send to:” at the top right > Complete record > File > ensure format is FASTA > Create file
Rename the downloaded file to something descriptive (e.g., chr1_sequence.fasta)

2. Cell line VCF files

Download both the compressed VCF file and its index:

cell-line.vcf.gz
cell-line.vcf.gz.tbi

3. Population VCF files

Download population variant data (e.g., from the 1000 Genomes Project: https://hgdownload.soe.ucsc.edu/gbdb/hg38/1000Genomes/)

You’ll need:

population.vcf.gz
population.vcf.gz.tbi

Note: Genome files can be large (several GB). Download times will vary based on your internet connection.

Expected directory structure

After setup, your directory structure should look like this:

~/Downloads/excavate-ht/
├── EXCAVATE-HT/
│   ├── environment.yml
│   ├── (other repository files)
├── input_data/
│   ├── chr1_sequence.fasta
│   ├── cell-line.vcf.gz
│   ├── cell-line.vcf.gz.tbi
│   ├── population.vcf.gz
│   └── population.vcf.gz.tbi
└── output_results/

Using EXCAVATE-HT

EXCAVATE-HT can be run in two modes: generate and pair.

To see detailed help for each mode:

excavate-ht generate --help

excavate-ht pair --help

For parameter descriptions and usage examples, see the full documentation.

Off-target Analysis Setup

EXCAVATE-HT performs off-target analysis using Bowtie to identify:

Exact genome-wide matches
Optional 1-bp mismatches on a target chromosome

You have three options for providing genome reference data:

Use prebuilt hg38 Bowtie indexes (recommended)
Provide your own existing Bowtie indexes
Provide a genome FASTA file and let EXCAVATE-HT build indexes automatically

Option A: Automatically download hg38 indexes (Recommended)

EXCAVATE-HT can download official prebuilt GRCh38 (“hg38 no-alt”) Bowtie indexes for you.

Simply add the following flags to your command:

--off-targets --download-hg38

Example:

excavate-ht generate \
  --input variants.vcf \
  --off-targets \
  --download-hg38

What this does:

Downloads GRCh38_noalt_as Bowtie indexes (~3–4 GB)
Extracts them into: <outdir>/bowtie_index/
Automatically uses them for alignment
Caches them under <outdir>/bowtie_index for future runs using the same output directory.

Note

No genome FASTA is required for exact genome matching in this mode. This is the easiest way to get started.

Reusing downloaded genome indexes across projects

By default, EXCAVATE-HT downloads and caches genome indexes under:

<outdir>/bowtie_index/

These indexes are reused automatically for subsequent runs as long as the same output directory is used.

If you would like to reuse the same genome indexes across multiple projects or output directories, you may move the downloaded index files to a central location (for example, a shared data or genomes folder), and then point EXCAVATE-HT to that location using --genome-index-prefix.

For example:

mv <outdir>/bowtie_index/GRCh38_noalt_as* /data/genomes/hg38/

Then in future runs:

--genome-index-prefix /data/genomes/hg38/GRCh38_noalt_as

This avoids repeated downloads and allows a single set of indexes to be reused across analyses.

EXCAVATE-HT will automatically detect existing .bt2 or .ebwt index files at the specified prefix and will skip rebuilding or downloading when they are present.

Option B: Use your own Bowtie index

If you already have Bowtie indexes (either .ebwt or .bt2 format), specify the path to the index prefix:

--genome-index-prefix /full/path/to/index_prefix

Example:

excavate-ht generate \
  --input variants.vcf \
  --off-targets \
  --genome-index-prefix /data/genomes/hg38/GRCh38_noalt_as

This prefix should correspond to index files such as:

GRCh38_noalt_as.1.bt2
GRCh38_noalt_as.2.bt2
GRCh38_noalt_as.3.bt2
GRCh38_noalt_as.4.bt2
(or .ebwt equivalents)

EXCAVATE-HT will automatically detect whether .bt2 or .ebwt indexes are present.

Option C: Build indexes from a genome FASTA

If no indexes are provided, EXCAVATE-HT will build Bowtie indexes automatically from your genome FASTA file:

--genome-fa /path/to/genome.fa

Indexes are created in <outdir>/bowtie_index/ and reused on subsequent runs.

Example:

excavate-ht generate \
  --input variants.vcf \
  --genome-fa genome.fa \
  --off-targets

Example Off-target Commands

Using automatic hg38 download:

excavate-ht generate \
  --input variants.vcf \
  --genome-fa genome.fa \
  --off-targets \
  --download-hg38

Using existing indexes:

excavate-ht generate \
  --input variants.vcf \
  --off-targets \
  --genome-index-prefix /data/hg38/GRCh38_noalt_as

Building indexes from FASTA:

excavate-ht generate \
  --input variants.vcf \
  --genome-fa /path/to/hg38.fa \
  --off-targets

Notes and Requirements

Requirements:

Bowtie ≥ 1.2.3 is required (automatically checked at runtime)
Both Bowtie1 (.ebwt) and Bowtie2-format (.bt2) indexes are supported

Behavior:

By default, Bowtie searches both strands; EXCAVATE-HT writes forward queries only to avoid double-counting
Indexes are cached under <outdir>/bowtie_index/ for fast repeat runs
For HPC environments, use --bowtie-threads to control parallelism

Storage considerations:

Downloaded hg38 indexes require ~3–4 GB of disk space
Building indexes from FASTA may take 30–60 minutes depending on genome size and system resources

Troubleshooting

Issue: (base) doesn’t appear after installing miniconda

Close and reopen your terminal
If still not showing, run: conda init and restart your terminal

Issue: conda: command not found

The conda installation directory may not be in your PATH. Run the installer again and ensure you answer “yes” when asked to initialize conda.

Issue: Environment creation fails

Ensure you have a stable internet connection
Try: conda clean --all then retry conda env create -f environment.yml

Issue: Permission denied errors on Windows/WSL

Make sure you’re working in a directory you have write access to (like /mnt/c/Users/YourUsername/)

Issue: Cannot find environment.yml

Ensure you’re in the EXCAVATE-HT directory: cd EXCAVATE-HT
Verify the file exists: ls environment.yml

For additional help, please open an issue on the GitHub repository.