|
|
||
|---|---|---|
| .github | ||
| output | ||
| tests | ||
| .gitignore | ||
| Dockerfile | ||
| README.md | ||
| TODO-mime.md | ||
| build.sh | ||
| config.py | ||
| countfiles.py | ||
| cron_launch.sh | ||
| db_utils.py | ||
| docker-compose.yml | ||
| email_utils.py | ||
| error_handler.py | ||
| file_utils.py | ||
| logging_config.py | ||
| main.py | ||
| query-sql.md | ||
| requirements.txt | ||
| s3_utils.py | ||
| utils.py | ||
| validation_utils.py | ||
README.md
ACH Server Media Import
This repository contains a script that imports media files from an S3-compatible bucket into a PostgreSQL database. It supports both local execution (Python virtual environment) and Docker deployment via docker-compose.
Overview
Asset hierarchy
- Conservatory Copy (Master): High-quality source (e.g.,
.mov,.wav). This is the primary record in the database. - Streaming Copy (Derivative): Transcoded versions (
.mp4,.mp3) linked to the master. - Sidecar Metadata (
.json): Contains technical metadata (mediainfo/ffprobe) used for validation and to determine the correct MIME type. - Sidecar QC (
.pdf,.md5): Quality control and checksum files.
Important: all files belonging to the same asset must share the same 12-character inventory code (e.g.,
VO-UMT-14387).
Process Phases
The importer runs in three clearly separated phases (each phase is logged in detail):
Phase 1 – S3 discovery + initial validation
- List objects in the configured S3 bucket.
- Keep only allowed extensions:
.mp4,.mp3,.json,.pdf,.md5. - Exclude configured folders (e.g.,
TEST-FOLDER-DEV/,DOCUMENTAZIONE_FOTOGRAFICA/,UMT/). - Validate the inventory code format and ensure the folder prefix matches the type encoded in the inventory code.
- Files failing validation are rejected before any database interaction.
Phase 2 – Database cross-reference + filtering
- Load existing filenames from the database.
- Skip files already represented in the DB, including sidecar records.
- Build the final list of S3 objects to parse.
Phase 3 – Parse & insert
- Read and validate sidecars (
.json,.md5,.pdf) alongside the media file. - Use metadata (from
mediainfo/ffprobe) to derive the master mime type and enforce container rules. - Insert new records into the database (unless
ACH_DRY_RUN=true).
Validation Policy
The import pipeline enforces strict validation to prevent bad data from entering the database.
Inventory Code & Folder Prefix
- Expected inventory code format:
^[VA][OC]-[A-Z0-9]{3}-\d{5}$. - The folder prefix (e.g.,
BRD/,DVD/,FILE/) must match the code type. - If the prefix does not match the inventory code, the file is rejected in Phase 1.
Safe Run (ACH_SAFE_RUN)
- When
ACH_SAFE_RUN=true, any warning during Phase 3 causes an immediate abort. - This prevents partial inserts when the importer detects inconsistent or already-present data.
MIME Type Determination
- The MIME type for master files is derived from the JSON sidecar metadata (
mediainfo/ffprobe), not from the streaming derivative extension.
Quick Start (Local)
Prerequisites
- Python 3.8+
- Virtual environment support (
venv)
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Run
python main.py
Docker (docker-compose)
The project includes a docker-compose.yml with an app service (container name ACH_server_media_importer). It reads environment variables from .env and mounts a logs volume.
Build & run
docker compose up -d --build
Logs
docker compose logs -f app
Stop
docker compose stop
Rebuild (clean)
docker compose down --volumes --rmi local
docker compose up -d --build
Configuration
Configuration is driven by .env and config.py. Key variables include:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,BUCKET_NAMEDB_HOST,DB_NAME,DB_USER,DB_PASSWORD,DB_PORTACH_DRY_RUN(true/false)ACH_SAFE_RUN(true/false)
Troubleshooting
- If Docker does not pick up changes, ensure
docker compose up -d --buildis run from the repo root. - Inspect runtime errors via
docker compose logs -f app.