Multi-Image VLA

We use VILA for training a VLA model with multi-image input.

Note

The models for multi-image or video input are still in its early stages, the platform is urgently experimenting. Currently, VILA can only handle a maximum of six images. We use a window of six images to input the observations.

Training

Clone repository:

git clone https://github.com/Efficient-Large-Model/VILA
cd VILA

Install VILA:

Install VILA

conda create -n vila python=3.10 -y
conda activate vila

pip install --upgrade pip  # enable PEP 660 support
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.    0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"

pip install git+https://github.com/huggingface/transformers@v4.36.2
cp -rv ./llava/train/transformers_replace/* ~/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/    models/

Generate the training data we need.

Data Generation

On the remote server, run:

xvfb-run python scripts/create_traj_come.py

or

xvfb-run python scripts/create_traj_where.py

The dataset will be saved at .legent/dataset on the remote server.

To Edit