Installation
Prerequisites
Before you begin, ensure you have:
At least 10 GB of free disk space (for miniconda, EXCAVATE-HT, and genome files)
A stable internet connection for downloading large genome files
Administrator/sudo access for initial setup
Set Up
Installing miniconda on MacOS/Unix
Go to the miniconda installation website: https://www.anaconda.com/download
Click “Skip registration”
Navigate to miniconda installers and download the installer for your device
Open the downloaded installation file and follow the install instructions. We recommend installing to your own user directory rather than for all users of the computer.
Open a new terminal window and verify the installation was successful by checking that
(base)appears at the beginning of your command prompt. This indicates the base conda environment is active.
Installing miniconda on Windows
Some packages needed to run EXCAVATE-HT (bcftools and bedtools) cannot be easily installed via conda on Windows. Therefore, you must first install WSL (Windows Subsystem for Linux) to use a Linux environment.
Step 1: Install WSL
Open PowerShell as Administrator (right-click on PowerShell > Run as Administrator)
Run:
wsl --installRestart your computer when prompted
After restart, open the Ubuntu app from the Start Menu (Ubuntu is automatically installed with WSL)
Follow the prompts to create a username and password for your Ubuntu environment
Step 2: Install miniconda in Ubuntu
In the Ubuntu terminal, download and install miniconda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
Follow the installation prompts (press Enter to review the license, type “yes” to accept, and accept the default installation location)
Close and reopen the Ubuntu terminal
Verify installation by checking that
(base)appears at the beginning of your command prompt
Important note for Windows users: To access your Windows filesystem from Ubuntu, use the /mnt/ prefix. For example:
Windows path:
C:\Users\YourUsername\DownloadsUbuntu path:
/mnt/c/Users/YourUsername/Downloads
To find your Windows username, open Command Prompt on Windows and run echo %USERNAME%.
Create a working directory
Create a directory to store all files for your EXCAVATE-HT analysis. For example:
On Mac/Linux:
mkdir -p ~/Downloads/excavate-ht
cd ~/Downloads/excavate-ht
On Windows (in Ubuntu terminal):
mkdir -p /mnt/c/Users/YourUsername/Downloads/excavate-ht
cd /mnt/c/Users/YourUsername/Downloads/excavate-ht
Replace YourUsername with your actual Windows username.
Clone the EXCAVATE-HT repository
Clone the repository to get the source code and environment configuration file:
git clone https://github.com/akshitasax/EXCAVATE-HT.git
cd EXCAVATE-HT
Create and activate the excavate conda environment
The repository includes an environment.yml file that specifies all required dependencies. Use it to create a dedicated conda environment:
Create the environment (this may take several minutes):
conda env create -f environment.yml
Activate the environment:
conda activate excavate-ht
Your terminal prompt should now show
(excavate-ht)instead of(base), indicating the environment is active.
The environment includes:
Python dependencies:
Python >= 3.9, < 3.13
numpy
pandas
biopython
regex
pyfaidx
External tools:
bedtools
bcftools
bowtie1 >=1.2.3 for off-targets analysis
Install EXCAVATE-HT
With the excavate environment activated, install EXCAVATE-HT in editable mode:
pip install -e .
Verify the installation was successful:
excavate-ht --help
If you see the help message, installation was successful!
Prepare input data files
Set up input directory
Create a subdirectory in your working folder to store input files:
cd ~/Downloads/excavate-ht # or your working directory
mkdir -p input_data
cd input_data
Download required files
You will need the following files for your analysis:
1. Chromosome-specific FASTA file
To download an individual chromosome:
Go to https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/
Click on your chromosome of interest
Click either the RefSeq or GenBank blue link
Click “FASTA” under the page title
Click “Send to:” at the top right > Complete record > File > ensure format is FASTA > Create file
Rename the downloaded file to something descriptive (e.g.,
chr1_sequence.fasta)
2. Cell line VCF files
Download both the compressed VCF file and its index:
cell-line.vcf.gzcell-line.vcf.gz.tbi
3. Population VCF files
Download population variant data (e.g., from the 1000 Genomes Project: https://hgdownload.soe.ucsc.edu/gbdb/hg38/1000Genomes/)
You’ll need:
population.vcf.gzpopulation.vcf.gz.tbi
Note: Genome files can be large (several GB). Download times will vary based on your internet connection.
Expected directory structure
After setup, your directory structure should look like this:
~/Downloads/excavate-ht/
├── EXCAVATE-HT/
│ ├── environment.yml
│ ├── (other repository files)
├── input_data/
│ ├── chr1_sequence.fasta
│ ├── cell-line.vcf.gz
│ ├── cell-line.vcf.gz.tbi
│ ├── population.vcf.gz
│ └── population.vcf.gz.tbi
└── output_results/
Using EXCAVATE-HT
EXCAVATE-HT can be run in two modes: generate and pair.
To see detailed help for each mode:
excavate-ht generate --help
excavate-ht pair --help
For parameter descriptions and usage examples, see the full documentation.
Off-target Analysis Setup
EXCAVATE-HT performs off-target analysis using Bowtie to identify:
Exact genome-wide matches
Optional 1-bp mismatches on a target chromosome
You have three options for providing genome reference data:
Use prebuilt hg38 Bowtie indexes (recommended)
Provide your own existing Bowtie indexes
Provide a genome FASTA file and let EXCAVATE-HT build indexes automatically
Option A: Automatically download hg38 indexes (Recommended)
EXCAVATE-HT can download official prebuilt GRCh38 (“hg38 no-alt”) Bowtie indexes for you.
Simply add the following flags to your command:
--off-targets --download-hg38
Example:
excavate-ht generate \
--input variants.vcf \
--off-targets \
--download-hg38
What this does:
Downloads GRCh38_noalt_as Bowtie indexes (~3–4 GB)
Extracts them into:
<outdir>/bowtie_index/Automatically uses them for alignment
Caches them under <outdir>/bowtie_index for future runs using the same output directory.
Note
No genome FASTA is required for exact genome matching in this mode. This is the easiest way to get started.
Reusing downloaded genome indexes across projects
By default, EXCAVATE-HT downloads and caches genome indexes under:
<outdir>/bowtie_index/
These indexes are reused automatically for subsequent runs as long as the same output directory is used.
If you would like to reuse the same genome indexes across multiple projects or output directories, you may move the downloaded index files to a central location (for example, a shared data or genomes folder), and then point EXCAVATE-HT to that location using --genome-index-prefix.
For example:
mv <outdir>/bowtie_index/GRCh38_noalt_as* /data/genomes/hg38/
Then in future runs:
--genome-index-prefix /data/genomes/hg38/GRCh38_noalt_as
This avoids repeated downloads and allows a single set of indexes to be reused across analyses.
EXCAVATE-HT will automatically detect existing .bt2 or .ebwt index files at the specified prefix and will skip rebuilding or downloading when they are present.
Option B: Use your own Bowtie index
If you already have Bowtie indexes (either .ebwt or .bt2 format), specify the path to the index prefix:
--genome-index-prefix /full/path/to/index_prefix
Example:
excavate-ht generate \
--input variants.vcf \
--off-targets \
--genome-index-prefix /data/genomes/hg38/GRCh38_noalt_as
This prefix should correspond to index files such as:
GRCh38_noalt_as.1.bt2GRCh38_noalt_as.2.bt2GRCh38_noalt_as.3.bt2GRCh38_noalt_as.4.bt2(or
.ebwtequivalents)
EXCAVATE-HT will automatically detect whether .bt2 or .ebwt indexes are present.
Option C: Build indexes from a genome FASTA
If no indexes are provided, EXCAVATE-HT will build Bowtie indexes automatically from your genome FASTA file:
--genome-fa /path/to/genome.fa
Indexes are created in <outdir>/bowtie_index/ and reused on subsequent runs.
Example:
excavate-ht generate \
--input variants.vcf \
--genome-fa genome.fa \
--off-targets
Example Off-target Commands
Using automatic hg38 download:
excavate-ht generate \
--input variants.vcf \
--genome-fa genome.fa \
--off-targets \
--download-hg38
Using existing indexes:
excavate-ht generate \
--input variants.vcf \
--off-targets \
--genome-index-prefix /data/hg38/GRCh38_noalt_as
Building indexes from FASTA:
excavate-ht generate \
--input variants.vcf \
--genome-fa /path/to/hg38.fa \
--off-targets
Notes and Requirements
Requirements:
Bowtie ≥ 1.2.3 is required (automatically checked at runtime)
Both Bowtie1 (
.ebwt) and Bowtie2-format (.bt2) indexes are supported
Behavior:
By default, Bowtie searches both strands; EXCAVATE-HT writes forward queries only to avoid double-counting
Indexes are cached under
<outdir>/bowtie_index/for fast repeat runsFor HPC environments, use
--bowtie-threadsto control parallelism
Storage considerations:
Downloaded hg38 indexes require ~3–4 GB of disk space
Building indexes from FASTA may take 30–60 minutes depending on genome size and system resources
Troubleshooting
Issue: (base) doesn’t appear after installing miniconda
Close and reopen your terminal
If still not showing, run:
conda initand restart your terminal
Issue: conda: command not found
The conda installation directory may not be in your PATH. Run the installer again and ensure you answer “yes” when asked to initialize conda.
Issue: Environment creation fails
Ensure you have a stable internet connection
Try:
conda clean --allthen retryconda env create -f environment.yml
Issue: Permission denied errors on Windows/WSL
Make sure you’re working in a directory you have write access to (like
/mnt/c/Users/YourUsername/)
Issue: Cannot find environment.yml
Ensure you’re in the EXCAVATE-HT directory:
cd EXCAVATE-HTVerify the file exists:
ls environment.yml
For additional help, please open an issue on the GitHub repository.