ACH-ARKIVO-ImportMedia/README.md

139 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ACH Server Media Import
This repository contains a script that imports media files from an S3-compatible bucket into a PostgreSQL database. It supports both local execution (Python virtual environment) and Docker deployment via `docker-compose`.
---
## Overview
### Asset hierarchy
- **Conservatory Copy (Master)**: High-quality source (e.g., `.mov`, `.wav`). This is the primary record in the database.
- **Streaming Copy (Derivative)**: Transcoded versions (`.mp4`, `.mp3`) linked to the master.
- **Sidecar Metadata (`.json`)**: Contains technical metadata (`mediainfo` / `ffprobe`) used for validation and to determine the correct MIME type.
- **Sidecar QC (`.pdf`, `.md5`)**: Quality control and checksum files.
> **Important:** all files belonging to the same asset must share the same 12-character inventory code (e.g., `VO-UMT-14387`).
---
## Process Phases
The importer runs in three clearly separated phases (each phase is logged in detail):
### Phase 1 S3 discovery + initial validation
- List objects in the configured S3 bucket.
- Keep only allowed extensions: `.mp4`, `.mp3`, `.json`, `.pdf`, `.md5`.
- Exclude configured folders (e.g., `TEST-FOLDER-DEV/`, `DOCUMENTAZIONE_FOTOGRAFICA/`, `UMT/`).
- Validate the inventory code format and ensure the folder prefix matches the type encoded in the inventory code.
- Files failing validation are rejected **before** any database interaction.
### Phase 2 Database cross-reference + filtering
- Load existing filenames from the database.
- Skip files already represented in the DB, including sidecar records.
- Build the final list of S3 objects to parse.
### Phase 3 Parse & insert
- Read and validate sidecars (`.json`, `.md5`, `.pdf`) alongside the media file.
- Use metadata (from `mediainfo` / `ffprobe`) to derive the **master mime type** and enforce container rules.
- Insert new records into the database (unless `ACH_DRY_RUN=true`).
---
## Validation Policy
The import pipeline enforces strict validation to prevent bad data from entering the database.
### Inventory Code & Folder Prefix
- Expected inventory code format: `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`.
- The folder prefix (e.g., `BRD/`, `DVD/`, `FILE/`) must match the code type.
- If the prefix does not match the inventory code, the file is rejected in Phase 1.
### Safe Run (`ACH_SAFE_RUN`)
- When `ACH_SAFE_RUN=true`, **any warning during Phase 3 causes an immediate abort**.
- This prevents partial inserts when the importer detects inconsistent or already-present data.
### MIME Type Determination
- The MIME type for master files is derived from the JSON sidecar metadata (`mediainfo` / `ffprobe`), not from the streaming derivative extension.
---
## Quick Start (Local)
### Prerequisites
- Python 3.8+
- Virtual environment support (`venv`)
### Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### Run
```bash
python main.py
```
---
## Docker (docker-compose)
The project includes a `docker-compose.yml` with an `app` service (container name `ACH_server_media_importer`). It reads environment variables from `.env` and mounts a `logs` volume.
### Build & run
```bash
docker compose up -d --build
```
### Logs
```bash
docker compose logs -f app
```
### Run inside the container (from the host)
If you want to execute the importer manually inside the running container (for debugging or one-off runs), you can use either of the following:
```bash
# Using docker compose (recommended)
docker compose exec app python /app/main.py
# Or using docker exec with the container name
docker exec -it ACH_server_media_importer python /app/main.py
```
### Stop
```bash
docker compose stop
```
### Rebuild (clean)
```bash
docker compose down --volumes --rmi local
docker compose up -d --build
```
---
## Configuration
Configuration is driven by `.env` and `config.py`. Key variables include:
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`
- `DB_HOST`, `DB_NAME`, `DB_USER`, `DB_PASSWORD`, `DB_PORT`
- `ACH_DRY_RUN` (`true` / `false`)
- `ACH_SAFE_RUN` (`true` / `false`)
---
## Troubleshooting
- If Docker does not pick up changes, ensure `docker compose up -d --build` is run from the repo root.
- Inspect runtime errors via `docker compose logs -f app`.