# Florence-2 Captioning Pipeline

High-throughput asynchronous captioning pipeline using **Florence-2 Base PromptGen**.

## Goals

- Download images from S3/HTTP concurrently
- Preprocess (resize/normalize)
- Run batched caption generation on GPU
- Persist captions back to a database (async)

## Project structure

- `src/`: implementation code
- `tests/`: unit/integration tests
- `todo.md`: tasks list
- `implementationPlanV2.md`: architecture + design notes

## Quickstart

1. Install dependencies:

```bash
pip install -r requirements.txt
```

2. Configure environment variables (see `src/config.py` for expected vars).

3. Run the pipeline (example):

```bash
python -m src.pipeline --dry-run
```

## Notes

This repo is intended as a foundation for building a fast, async dataset captioning tool.