feat: Enhance media import functionality with centralized MIME type management and improved validation
- Added EXTENSION_MIME_MAP to config.py for centralized MIME type mapping. - Updated countfiles.py to utilize centralized MIME types for file counting. - Refactored db_utils.py to import EXTENSION_MIME_MAP and added get_mime_from_mediainfo function for better MIME type determination. - Enhanced file_utils.py with prefix matching for S3 files and JSON references. - Improved logging in logging_config.py for better error tracking. - Added security checks in main.py for development environment imports. - Updated validation_utils.py to include stricter extension validation and improved error handling. - Introduced new instructions for developers in .github/copilot-instructions.md for project setup and coding standards.
This commit is contained in:
parent
6030cc3f84
commit
ccd700a4a8
|
|
@ -0,0 +1,65 @@
|
|||
# ACH Server Media Import - Agent Instructions
|
||||
|
||||
Guidelines and standards for the ACH Media Import project.
|
||||
|
||||
## Project Overview
|
||||
This project is a Python-based utility that imports media files from an S3-compatible bucket into a PostgreSQL database, enforcing specific naming conventions and metadata validation.
|
||||
|
||||
## Technical Stack
|
||||
- **Language**: Python 3.8+
|
||||
- **Database**: PostgreSQL (via `psycopg2`)
|
||||
- **Cloud Storage**: AWS S3/S3-compatible storage (via `boto3`)
|
||||
- **Containerization**: Docker & Docker Compose
|
||||
- **Environment**: Managed via `.env` and `config.py`
|
||||
|
||||
## Architecture & Modular Design
|
||||
The project uses a utility-based modular architecture orchestrated by `main.py`.
|
||||
- [main.py](main.py): Entry point and workflow orchestrator.
|
||||
- [s3_utils.py](s3_utils.py): S3 client operations and bucket listing.
|
||||
- [db_utils.py](db_utils.py): Database connectivity and SQL execution.
|
||||
- [validation_utils.py](validation_utils.py): Pattern matching and business logic validation.
|
||||
- [logging_config.py](logging_config.py): Centralized logging configuration.
|
||||
- [error_handler.py](error_handler.py): Error handling and notifications.
|
||||
- [email_utils.py](email_utils.py): SMTP integration for alerts.
|
||||
|
||||
## Domain Logic: Inventory Codes
|
||||
The core validation revolves around "Inventory Codes" which MUST follow a strict 12-character format:
|
||||
- `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`
|
||||
- Examples: `VA-C01-12345`, `OC-A99-67890`.
|
||||
- Files not matching this pattern in S3 are logged but skipped.
|
||||
|
||||
## Development Workflows
|
||||
|
||||
### Environment Setup
|
||||
- **Windows**: Use `. .venv\Scripts\Activate.ps1`
|
||||
- **Linux/macOS**: Use `source .venv/bin/activate`
|
||||
- **Dependency installation**: `pip install -r requirements.txt`
|
||||
|
||||
### Local Execution
|
||||
- **Run script**: `python main.py`
|
||||
- **Verify Configuration**: Ensure `.env` is populated with `DB_`, `AWS_`, and `SMTP_` variables.
|
||||
|
||||
### Docker Operations
|
||||
- **Build/Up**: `docker compose up -d --build`
|
||||
- **Logs**: `docker compose logs -f app`
|
||||
- **Stop**: `docker compose stop`
|
||||
|
||||
## Coding Standards & Conventions
|
||||
|
||||
### Logging
|
||||
- Use the custom logger from `logging_config.py`.
|
||||
- **Log Levels**: Use `logging.INFO`, `logging.WARNING`, and the custom `CUSTOM_ERROR_LEVEL` (35) via `error_handler.py`.
|
||||
- Logs are rotated and stored in the `logs/` directory.
|
||||
|
||||
### Error Handling
|
||||
- Wrap critical operations that should trigger notifications in try-except blocks that call `error_handler.notify_error()`.
|
||||
- Avoid silent failures; ensure errors are logged to the appropriate file sync.
|
||||
|
||||
### Configuration
|
||||
- Access settings exclusively via the `config.py` module's dictionaries: `db_config`, `aws_config`, `ach_config`.
|
||||
- Never hardcode credentials or endpoints.
|
||||
|
||||
## Related Files
|
||||
- [query-sql.md](query-sql.md): Reference for database schema and SQL logic.
|
||||
- [requirements.txt](requirements.txt): Project dependencies.
|
||||
- [docker-compose.yml](docker-compose.yml): Deployment configuration.
|
||||
178
README.md
178
README.md
|
|
@ -1,138 +1,126 @@
|
|||
# Project Setup
|
||||
# ACH Server Media Import
|
||||
|
||||
## Setting up a Virtual Environment
|
||||
This repository contains a script that imports media files from an S3-compatible bucket into a PostgreSQL database. It supports both local execution (Python virtual environment) and Docker deployment via `docker-compose`.
|
||||
|
||||
1. **Create a virtual environment:**
|
||||
---
|
||||
|
||||
### For Linux/macOS:
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
```
|
||||
## Overview
|
||||
|
||||
### For Windows:
|
||||
```bash
|
||||
## ACH-server-import-media
|
||||
### Asset hierarchy
|
||||
- **Conservatory Copy (Master)**: High-quality source (e.g., `.mov`, `.wav`). This is the primary record in the database.
|
||||
- **Streaming Copy (Derivative)**: Transcoded versions (`.mp4`, `.mp3`) linked to the master.
|
||||
- **Sidecar Metadata (`.json`)**: Contains technical metadata (`mediainfo` / `ffprobe`) used for validation and to determine the correct MIME type.
|
||||
- **Sidecar QC (`.pdf`, `.md5`)**: Quality control and checksum files.
|
||||
|
||||
This repository contains a script to import media files from an S3-compatible bucket into a database. It supports both local execution (virtual environment) and Docker-based deployment via `docker-compose`.
|
||||
> **Important:** all files belonging to the same asset must share the same 12-character inventory code (e.g., `VO-UMT-14387`).
|
||||
|
||||
Contents
|
||||
- `main.py` - main import script
|
||||
- `docker-compose.yml` - docker-compose service for running the importer in a container
|
||||
- `requirements.txt` - Python dependencies
|
||||
- `config.py`, `.env` - configuration and environment variables
|
||||
---
|
||||
|
||||
Prerequisites
|
||||
- Docker & Docker Compose (or Docker Desktop)
|
||||
## Process Phases
|
||||
|
||||
The importer runs in three clearly separated phases (each phase is logged in detail):
|
||||
|
||||
### Phase 1 – S3 discovery + initial validation
|
||||
- List objects in the configured S3 bucket.
|
||||
- Keep only allowed extensions: `.mp4`, `.mp3`, `.json`, `.pdf`, `.md5`.
|
||||
- Exclude configured folders (e.g., `TEST-FOLDER-DEV/`, `DOCUMENTAZIONE_FOTOGRAFICA/`, `UMT/`).
|
||||
- Validate the inventory code format and ensure the folder prefix matches the type encoded in the inventory code.
|
||||
- Files failing validation are rejected **before** any database interaction.
|
||||
|
||||
### Phase 2 – Database cross-reference + filtering
|
||||
- Load existing filenames from the database.
|
||||
- Skip files already represented in the DB, including sidecar records.
|
||||
- Build the final list of S3 objects to parse.
|
||||
|
||||
### Phase 3 – Parse & insert
|
||||
- Read and validate sidecars (`.json`, `.md5`, `.pdf`) alongside the media file.
|
||||
- Use metadata (from `mediainfo` / `ffprobe`) to derive the **master mime type** and enforce container rules.
|
||||
- Insert new records into the database (unless `ACH_DRY_RUN=true`).
|
||||
|
||||
---
|
||||
|
||||
## Validation Policy
|
||||
|
||||
The import pipeline enforces strict validation to prevent bad data from entering the database.
|
||||
|
||||
### Inventory Code & Folder Prefix
|
||||
- Expected inventory code format: `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`.
|
||||
- The folder prefix (e.g., `BRD/`, `DVD/`, `FILE/`) must match the code type.
|
||||
- If the prefix does not match the inventory code, the file is rejected in Phase 1.
|
||||
|
||||
### Safe Run (`ACH_SAFE_RUN`)
|
||||
- When `ACH_SAFE_RUN=true`, **any warning during Phase 3 causes an immediate abort**.
|
||||
- This prevents partial inserts when the importer detects inconsistent or already-present data.
|
||||
|
||||
### MIME Type Determination
|
||||
- The MIME type for master files is derived from the JSON sidecar metadata (`mediainfo` / `ffprobe`), not from the streaming derivative extension.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start (Local)
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.8+
|
||||
- Git (optional)
|
||||
- Virtual environment support (`venv`)
|
||||
|
||||
Quick local setup (virtual environment)
|
||||
### Setup
|
||||
|
||||
Linux / macOS
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Windows (PowerShell)
|
||||
```powershell
|
||||
python -m venv .venv
|
||||
. .venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
### Run
|
||||
|
||||
Running locally
|
||||
1. Ensure your configuration is available (see `config.py` or provide a `.env` file with the environment variables used by the project).
|
||||
2. Run the script (from the project root):
|
||||
|
||||
Linux / macOS
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Windows (PowerShell)
|
||||
```powershell
|
||||
& .venv\Scripts\python.exe main.py
|
||||
```
|
||||
---
|
||||
|
||||
Docker Compose
|
||||
## Docker (docker-compose)
|
||||
|
||||
This project includes a `docker-compose.yml` with a service named `app` (container name `ACH_server_media_importer`). The compose file reads environment variables from `.env` and mounts a `logs` named volume.
|
||||
The project includes a `docker-compose.yml` with an `app` service (container name `ACH_server_media_importer`). It reads environment variables from `.env` and mounts a `logs` volume.
|
||||
|
||||
Build and run (detached)
|
||||
```powershell
|
||||
# Docker Compose v2 syntax (recommended)
|
||||
# From the repository root
|
||||
### Build & run
|
||||
|
||||
```bash
|
||||
docker compose up -d --build
|
||||
|
||||
# OR if your environment uses the v1 binary
|
||||
# docker-compose up -d --build
|
||||
```
|
||||
|
||||
Show logs
|
||||
```powershell
|
||||
# Follow logs for the 'app' service
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
docker compose logs -f app
|
||||
|
||||
# Or use the container name
|
||||
docker logs -f ACH_server_media_importer
|
||||
```
|
||||
|
||||
Stop / start / down
|
||||
```powershell
|
||||
# Stop containers
|
||||
### Stop
|
||||
|
||||
```bash
|
||||
docker compose stop
|
||||
|
||||
# Start again
|
||||
docker compose start
|
||||
|
||||
# Take down containers and network
|
||||
docker compose down
|
||||
```
|
||||
|
||||
Rebuild when already running
|
||||
### Rebuild (clean)
|
||||
|
||||
There are two safe, common ways to rebuild a service when the containers are already running:
|
||||
|
||||
1) Rebuild in-place and recreate changed containers (recommended for most changes):
|
||||
|
||||
```powershell
|
||||
# Rebuild images and recreate services in the background
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
This tells Compose to rebuild the image(s) and recreate containers for services whose image or configuration changed.
|
||||
|
||||
2) Full clean rebuild (use when you need to remove volumes or ensure a clean state):
|
||||
|
||||
```powershell
|
||||
# Stop and remove containers, networks, and optionally volumes & images, then rebuild
|
||||
```bash
|
||||
docker compose down --volumes --rmi local
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
Notes
|
||||
- `docker compose up -d --build` will recreate containers for services that need updating; it does not destroy named volumes unless you pass `--volumes` to `down`.
|
||||
- If you need to execute a shell inside the running container:
|
||||
---
|
||||
|
||||
```powershell
|
||||
# run a shell inside the 'app' service
|
||||
docker compose exec app /bin/sh
|
||||
# or (if bash is available)
|
||||
docker compose exec app /bin/bash
|
||||
```
|
||||
## Configuration
|
||||
|
||||
Environment and configuration
|
||||
- Provide sensitive values via a `.env` file (the `docker-compose.yml` already references `.env`).
|
||||
- Typical variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`, `DB_HOST`, `DB_NAME`, `DB_USER`, `SMTP_SERVER`, etc.
|
||||
|
||||
Troubleshooting
|
||||
- If Compose fails to pick up code changes, ensure your local Dockerfile `COPY` commands include the source files and that `docker compose up -d --build` is run from the repository root.
|
||||
- Use `docker compose logs -f app` to inspect runtime errors.
|
||||
|
||||
If you'd like, I can add a short `Makefile` or a PowerShell script to wrap the common Docker Compose commands (build, rebuild, logs) for convenience.
|
||||
Configuration is driven by `.env` and `config.py`. Key variables include:
|
||||
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`
|
||||
- `DB_HOST`, `DB_NAME`, `DB_USER`, `DB_PASSWORD`, `DB_PORT`
|
||||
- `ACH_DRY_RUN` (`true` / `false`)
|
||||
- `ACH_SAFE_RUN` (`true` / `false`)
|
||||
|
||||
---
|
||||
Edited to add clear docker-compose rebuild and run instructions.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- If Docker does not pick up changes, ensure `docker compose up -d --build` is run from the repo root.
|
||||
- Inspect runtime errors via `docker compose logs -f app`.
|
||||
|
|
|
|||
16
config.py
16
config.py
|
|
@ -66,5 +66,21 @@ def load_config():
|
|||
|
||||
return aws_config, db_config, ach_config, bucket_name, ach_variables
|
||||
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/quicktime',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
'.aif': 'audio/aiff',
|
||||
'.wmv': 'video/x-ms-asf',
|
||||
'.m4a': 'audio/mp4',
|
||||
}
|
||||
|
||||
MIME_TYPES = sorted(list(set(EXTENSION_MIME_MAP.values())))
|
||||
|
||||
|
||||
# Consider using a class for a more structured approach (optional)
|
||||
|
|
@ -130,33 +130,13 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
# connect to database
|
||||
conn = psycopg2.connect(**db_config)
|
||||
cur = conn.cursor()
|
||||
# function count_files that are wav and mov in db
|
||||
# Map file extensions (include leading dot) to mime types
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/mov',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
}
|
||||
|
||||
# populate mime_type list with all relevant MediaInfo/MIME values
|
||||
mime_type = [
|
||||
'video/x-msvideo', # .avi
|
||||
'video/mov', # .mov
|
||||
'audio/wav', # .wav
|
||||
'video/mp4', # .mp4, .m4v
|
||||
'audio/mp3', # .mp3
|
||||
'application/mxf', # .mxf
|
||||
'video/mpeg', # .mpg
|
||||
]
|
||||
# Use centralized mime types from config
|
||||
from config import EXTENSION_MIME_MAP, MIME_TYPES
|
||||
|
||||
logging.info(f"Mime types for counting files: {mime_type}")
|
||||
logging.info(f"Mime types for counting files: {MIME_TYPES}")
|
||||
|
||||
all_files_on_db = count_files(cur, mime_type,'*', False)
|
||||
all_files_on_db = count_files(cur, MIME_TYPES,'*', False)
|
||||
mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
|
||||
mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
|
||||
mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
|
||||
|
|
|
|||
99
db_utils.py
99
db_utils.py
|
|
@ -34,18 +34,7 @@ from email_utils import handle_error
|
|||
import json
|
||||
import os
|
||||
import config
|
||||
|
||||
# Map file extensions (include leading dot) to mime types
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/mov',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
}
|
||||
from config import EXTENSION_MIME_MAP
|
||||
|
||||
def get_mime_for_extension(extension: str) -> str:
|
||||
"""Return the mime type for an extension. Accepts with or without leading dot.
|
||||
|
|
@ -58,6 +47,76 @@ def get_mime_for_extension(extension: str) -> str:
|
|||
extension = f'.{extension}'
|
||||
return EXTENSION_MIME_MAP.get(extension.lower(), 'application/octet-stream')
|
||||
|
||||
|
||||
def get_mime_from_mediainfo(ach_variables: dict) -> str:
|
||||
"""Determine a MIME type from the JSON sidecar mediainfo.
|
||||
|
||||
This is used to capture the *master* format (conservatory copy) even when
|
||||
the stream copy on S3 is a different container (e.g., _H264.mp4).
|
||||
|
||||
If mediainfo is missing or cannot be mapped, fall back to extension-based mapping.
|
||||
"""
|
||||
# Prefer the master (conservative copy) extension when it is explicitly available.
|
||||
# In some cases MediaInfo reports "MPEG-4" for .mov containers, so the extension
|
||||
# is a more reliable hint for the correct mime type.
|
||||
conservative_ext = ach_variables.get('conservative_copy_extension')
|
||||
if conservative_ext and conservative_ext.lower() == '.mov':
|
||||
return 'video/quicktime'
|
||||
|
||||
# Also check the actual conservative copy object key (from JSON @ref). This is the
|
||||
# name that will be stored in the DB as the master file, so it should drive the MIME.
|
||||
conservative_copy_key = ach_variables.get('objectKeys', {}).get('conservative_copy', '')
|
||||
if conservative_copy_key and os.path.splitext(conservative_copy_key)[1].lower() == '.mov':
|
||||
return 'video/quicktime'
|
||||
|
||||
# Try to find the General track format from mediainfo
|
||||
try:
|
||||
mediainfo = ach_variables.get('custom_data_in', {}).get('mediainfo', {})
|
||||
tracks = mediainfo.get('media', {}).get('track', [])
|
||||
for track in tracks:
|
||||
if track.get('@type', '') == 'General':
|
||||
format_value = track.get('Format', '')
|
||||
if format_value:
|
||||
# Map common MediaInfo format values to MIME types
|
||||
mapping = {
|
||||
'AVI': 'video/x-msvideo',
|
||||
'MOV': 'video/quicktime',
|
||||
'QuickTime': 'video/quicktime',
|
||||
'MPEG-4': 'video/mp4',
|
||||
'MP4': 'video/mp4',
|
||||
'MXF': 'application/mxf',
|
||||
'MPEG': 'video/mpeg',
|
||||
'MPEG-PS': 'video/mpeg',
|
||||
'MPEG-TS': 'video/MP2T',
|
||||
'MPEG Audio': 'audio/mpeg',
|
||||
'MPEG Audio/Layer 3': 'audio/mpeg',
|
||||
'AAC': 'audio/aac',
|
||||
'PCM': 'audio/wav',
|
||||
'WAV': 'audio/wav',
|
||||
'AIFF': 'audio/aiff',
|
||||
'FLAC': 'audio/flac',
|
||||
}
|
||||
# Do a case-insensitive match
|
||||
for k, v in mapping.items():
|
||||
if format_value.lower() == k.lower():
|
||||
return v
|
||||
# Try a fuzzy match based on known substrings
|
||||
if 'avi' in format_value.lower():
|
||||
return 'video/x-msvideo'
|
||||
if 'mp4' in format_value.lower():
|
||||
return 'video/mp4'
|
||||
if 'mpeg' in format_value.lower():
|
||||
return 'video/mpeg'
|
||||
if 'wav' in format_value.lower() or 'pcm' in format_value.lower():
|
||||
return 'audio/wav'
|
||||
if 'mp3' in format_value.lower():
|
||||
return 'audio/mpeg'
|
||||
# Fall back to extension-based mapping when metadata doesn't yield a mime
|
||||
extension = ach_variables.get('extension')
|
||||
return get_mime_for_extension(extension)
|
||||
except Exception:
|
||||
return get_mime_for_extension(ach_variables.get('extension'))
|
||||
|
||||
def get_distinct_filenames_from_db():
|
||||
"""Retrieve distinct digital file names from the Postgres DB.
|
||||
|
||||
|
|
@ -94,15 +153,19 @@ def check_inventory_in_db(s3_client, cur, base_name):
|
|||
|
||||
# Define the pattern for the inventory code
|
||||
media_tipology_A = ['MCC', 'OA4', 'DAT']
|
||||
# FOR FILE add to media_tipology_A for readibily
|
||||
media_tipology_A += ['M4A','AIF'] # add for "FILE" folders 04112025
|
||||
|
||||
# TODO add other tipologies: AVI, M4V, MOV, MP4, MXF, MPG (done 04112025)
|
||||
media_tipology_V = [
|
||||
'OV1', 'OV2', 'UMT', 'VHS', 'HI8', 'VD8', 'BTC', 'DBT', 'IMX', 'DVD',
|
||||
'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV',
|
||||
'AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG' # add for "file" folders 04112025
|
||||
'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV'
|
||||
]
|
||||
# FOR FILE add to media_tipology_V for readibily
|
||||
media_tipology_V += ['AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG', 'WMV'] # add for "FILE" folders 04112025
|
||||
|
||||
# list of known mime types (derived from EXTENSION_MIME_MAP)
|
||||
mime_type = list({v for v in EXTENSION_MIME_MAP.values()})
|
||||
# Use centralized mime types from config
|
||||
from config import MIME_TYPES
|
||||
|
||||
try:
|
||||
logging.info(f"SUPPORT TYPOLOGY : {base_name[3:6]}")
|
||||
|
|
@ -270,8 +333,8 @@ def add_file_record_and_relationship(s3_client, cur, base_name,ach_variables):
|
|||
file_availability_dict = 7 # Place Holder
|
||||
# add a new file record for the "copia conservativa"
|
||||
ach_variables['custom_data_in']['media_usage'] = 'master' # can be "copia conservativa"
|
||||
# determine master mime type from the file extension
|
||||
master_mime_type = get_mime_for_extension(ach_variables.get('extension'))
|
||||
# determine master mime type using the JSON sidecar metadata (preferred)
|
||||
master_mime_type = get_mime_from_mediainfo(ach_variables)
|
||||
|
||||
new_file_id = add_file_record(
|
||||
cur,
|
||||
|
|
|
|||
|
|
@ -286,6 +286,30 @@ def extract_and_validate_file_info(file_contents, file, ach_variables):
|
|||
else:
|
||||
logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON ffprobe file name '{basename_ffprobe}'.")
|
||||
|
||||
# Check folder prefixes (e.g. FILE/ vs DBT/) match between S3 file and JSON refs
|
||||
def _extract_prefix(path: str) -> str:
|
||||
if not path:
|
||||
return ''
|
||||
# Normalize separators to '/' so we can reliably split on a single character
|
||||
normalized = path.replace('\\', '/').lstrip('/')
|
||||
return normalized.split('/', 1)[0] if '/' in normalized else normalized
|
||||
|
||||
prefix_fullpath = _extract_prefix(ach_variables['file_fullpath'])
|
||||
prefix_mediainfo = _extract_prefix(json_ref_mediainfo_path)
|
||||
prefix_ffprobe = _extract_prefix(json_ref_ffprobe_path)
|
||||
|
||||
if prefix_fullpath != prefix_mediainfo or prefix_fullpath != prefix_ffprobe:
|
||||
logging.warning(
|
||||
"Prefix mismatch for S3 file '%s': S3 prefix='%s' (fullpath='%s') vs JSON prefixes (mediainfo='%s' [%s], ffprobe='%s' [%s]).",
|
||||
ach_variables.get('file_fullpath'),
|
||||
prefix_fullpath,
|
||||
ach_variables.get('file_fullpath'),
|
||||
prefix_mediainfo,
|
||||
json_ref_mediainfo_path,
|
||||
prefix_ffprobe,
|
||||
json_ref_ffprobe_path,
|
||||
)
|
||||
|
||||
if basename_fullpath != basename_mediainfo and basename_fullpath != basename_ffprobe:
|
||||
logging.error(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
|
||||
raise ValueError(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
|
||||
|
|
|
|||
|
|
@ -33,8 +33,8 @@ def _create_timed_handler(path: str, level=None, when='midnight', interval=1, ba
|
|||
"""
|
||||
_ensure_dir_for_file(path)
|
||||
handler = TimedRotatingFileHandler(path, when=when, interval=interval, backupCount=backupCount, encoding='utf-8')
|
||||
# Use a readable suffix for rotated files (handler will append this after the filename)
|
||||
handler.suffix = "%Y%m%d_%H%M%S"
|
||||
# Default behavior is to append to the existing log file.
|
||||
# Rotation happens when 'when' occurs.
|
||||
if fmt:
|
||||
handler.setFormatter(fmt)
|
||||
if level is not None:
|
||||
|
|
@ -58,6 +58,8 @@ def setup_logging():
|
|||
|
||||
error_log_path = os.getenv('ERROR_LOG_FILE_PATH', "./logs/ACH_media_import_errors.log")
|
||||
warning_log_path = os.getenv('WARNING_LOG_FILE_PATH', "./logs/ACH_media_import_warnings.log")
|
||||
if os.getenv('WARING_LOG_FILE_PATH'): # Fix typo in .env if present
|
||||
warning_log_path = os.getenv('WARING_LOG_FILE_PATH')
|
||||
info_log_path = os.getenv('INFO_LOG_FILE_PATH', "./logs/ACH_media_import_info.log")
|
||||
|
||||
# Create three handlers: info (all), warning (warning+), error (error+)
|
||||
|
|
|
|||
141
main.py
141
main.py
|
|
@ -12,7 +12,7 @@ from error_handler import handle_general_error, handle_file_not_found_error, han
|
|||
from file_utils import is_file_empty
|
||||
from db_utils import count_files, get_distinct_filenames_from_db
|
||||
from dotenv import load_dotenv
|
||||
from validation_utils import validate_inventory_code, analyze_pattern_match, validate_icode_extension
|
||||
from validation_utils import validate_inventory_code, analyze_pattern_match, validate_icode_extension, list_s3_not_in_db, validate_mp4_file, validate_mp3_file
|
||||
import config
|
||||
import psycopg2
|
||||
|
||||
|
|
@ -33,6 +33,25 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
|
||||
logging.info(f"bucket_name: {bucket_name}")
|
||||
|
||||
# SECURITY CHECK: If DRY_RUN is false and ENV is development, ask for confirmation
|
||||
dry_run_env = os.getenv('ACH_DRY_RUN', 'true').lower()
|
||||
ach_env = os.getenv('ACH_ENV', 'development').lower()
|
||||
|
||||
if dry_run_env == 'false' and ach_env == 'development':
|
||||
print("\n" + "!"*60)
|
||||
print("!!! SECURITY CHECK: RUNNING IMPORT ON DEVELOPMENT ENVIRONMENT !!!")
|
||||
print(f"DB_HOST: {db_config.get('host')}")
|
||||
print(f"DB_NAME: {db_config.get('database')}")
|
||||
print(f"DB_USER: {db_config.get('user')}")
|
||||
print(f"DB_PORT: {db_config.get('port')}")
|
||||
print("!"*60 + "\n")
|
||||
|
||||
user_input = input(f"Please type the DB_NAME '{db_config.get('database')}' to proceed: ")
|
||||
if user_input != db_config.get('database'):
|
||||
print("Action aborted by user. Database name did not match.")
|
||||
logging.error("Process aborted: User failed to confirm DB_NAME for development import.")
|
||||
return
|
||||
|
||||
# Ensure timing variables are always defined so later error-email logic
|
||||
# won't fail if an exception is raised before end_time/elapsed_time is set.
|
||||
start_time = time.time()
|
||||
|
|
@ -44,6 +63,20 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
try:
|
||||
logging.info("Starting the main process...")
|
||||
|
||||
# ---------------------------------------------------------------------
|
||||
# PHASE 1: S3 OBJECT DISCOVERY + INITIAL VALIDATION
|
||||
#
|
||||
# 1) List objects in the configured S3 bucket.
|
||||
# 2) Filter objects by allowed extensions and excluded folders.
|
||||
# 3) Validate the inventory code format (e.g. VA-C01-12345) and ensure the
|
||||
# folder prefix matches the code type (e.g. "BRD" folder for BRD code).
|
||||
# 4) Reject files that violate naming conventions before any DB interaction.
|
||||
#
|
||||
# This phase is intentionally descriptive so the workflow can be understood
|
||||
# from logs even if the function names are not immediately clear.
|
||||
# ---------------------------------------------------------------------
|
||||
logging.info("PHASE 1: S3 object discovery + initial validation")
|
||||
|
||||
# Helper to make spaces visible in filenames for logging (replace ' ' with open-box char)
|
||||
def _visible_spaces(name: str) -> str:
|
||||
try:
|
||||
|
|
@ -60,7 +93,9 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
# Define valid extensions and excluded folders
|
||||
valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
|
||||
# excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
|
||||
excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/'}
|
||||
# excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'FILE/', 'DVD/', 'UMT/'}
|
||||
excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'UMT/'}
|
||||
# excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/',}
|
||||
# included_folders = {'FILE/'} # uncomment this to NOT use excluded folders
|
||||
# included_folders = {'TEST-FOLDER-DEV/'} # uncomment this to NOT use excluded folders
|
||||
|
||||
|
|
@ -98,9 +133,33 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
# s3_file_names contains the object keys (strings), not dicts.
|
||||
base_name = os.path.basename(s3file)
|
||||
logging.info(f"S3 Base name: {base_name}")
|
||||
|
||||
# extract folder prefix and media type from inventory code
|
||||
folder_prefix = os.path.dirname(s3file).rstrip('/')
|
||||
media_type_in_code = base_name[3:6] if len(base_name) >= 6 else None
|
||||
|
||||
# Generic sanity check: prefix (folder name) should equal the media type in the code
|
||||
is_valid_prefix = (folder_prefix == media_type_in_code)
|
||||
|
||||
# Special folder allowance rules
|
||||
folder_allowances = {
|
||||
'DVD': ['DVD', 'BRD'],
|
||||
'FILE': ['M4V', 'AVI', 'MOV', 'MP4', 'MXF', 'AIF', 'WMV', 'M4A', 'MPG'],
|
||||
}
|
||||
|
||||
if folder_prefix in folder_allowances:
|
||||
if media_type_in_code in folder_allowances[folder_prefix]:
|
||||
is_valid_prefix = True
|
||||
|
||||
if folder_prefix and media_type_in_code and not is_valid_prefix:
|
||||
logging.warning(f"Prefix mismatch for {s3file}: Folder '{folder_prefix}' does not match code type '{media_type_in_code}'")
|
||||
# we only warning here but still proceed with standard validation
|
||||
|
||||
if validate_inventory_code(base_name): # truncated to first 12 char in the function
|
||||
logging.info(f"File {base_name} matches pattern.")
|
||||
# if valid check extension too
|
||||
# only check inventory code extension for media files (.mp4, .mp3)
|
||||
# sidecars (.json, .pdf, .md5) only need their base validated
|
||||
if s3file.lower().endswith(('.mp4', '.mp3')):
|
||||
if not validate_icode_extension(s3file):
|
||||
logging.warning(f"File {s3file} has invalid extension for its inventory code.")
|
||||
continue # skip adding this file to validated contents
|
||||
|
|
@ -112,6 +171,15 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
folder_name = os.path.dirname(s3file)
|
||||
logging.warning(f"File {s3file} in folder {folder_name} does not match pattern.")
|
||||
|
||||
# ---------------------------------------------------------------------
|
||||
# PHASE 2: DATABASE CROSS-REFERENCE + FILTERING
|
||||
#
|
||||
# 1) Fetch existing filenames from the database.
|
||||
# 2) Skip files already represented in the DB (including sidecar records).
|
||||
# 3) Produce the final list of S3 object keys that should be parsed/inserted.
|
||||
# ---------------------------------------------------------------------
|
||||
logging.info("PHASE 2: Database cross-reference + filtering")
|
||||
|
||||
# filter_s3_files_not_in_db
|
||||
# --- Get all DB filenames in one call ---
|
||||
db_file_names = get_distinct_filenames_from_db()
|
||||
|
|
@ -138,6 +206,14 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
total_files = len(filtered_file_names)
|
||||
logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
|
||||
|
||||
# Log the files that need to be updated (those not yet in DB)
|
||||
if total_files > 0:
|
||||
logging.info("List of files to be updated in the database:")
|
||||
for f in filtered_file_names:
|
||||
logging.info(f" - {f}")
|
||||
else:
|
||||
logging.info("No new files found to update in the database.")
|
||||
|
||||
# Count files with .mp4 and .mp3 extensions
|
||||
mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
|
||||
mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
|
||||
|
|
@ -167,6 +243,10 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
for file in s3_file_names:
|
||||
if file.endswith('.mp4'):
|
||||
validate_mp4_file(file) # validation_utils.py - check also _H264 at the end
|
||||
elif file.endswith('.mp3'):
|
||||
validate_mp3_file(file) # validation_utils.py
|
||||
# Count by CODE media type (e.g. OA4, MCC) and log the counts for each type
|
||||
|
||||
|
||||
# If ACH_SAFE_RUN is 'false' we enforce strict mp4/pdf parity and abort
|
||||
# when mismatched. Default is 'true' which skips this abort to allow
|
||||
|
|
@ -177,7 +257,7 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
logging.error("Number of .mp4 files is not equal to number of .pdf files")
|
||||
# MOD 20251103
|
||||
# add a check to find the missing pdf or mp4 files and report them
|
||||
# use file_names to find missing files
|
||||
# use filtered_file_names to find missing files
|
||||
# store tuples (source_file, expected_counterpart) for clearer logging
|
||||
missing_pdfs = [] # list of (mp4_file, expected_pdf)
|
||||
missing_mp4s = [] # list of (pdf_file, expected_mp4)
|
||||
|
|
@ -187,7 +267,7 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
base_name = os.path.splitext(file)[0]
|
||||
# if the mp4 is an H264 variant (e.g. name_H264.mp4) remove the suffix
|
||||
if base_name.endswith('_H264'):
|
||||
# must check if has extra number for DBT and DVD amd [FILE]
|
||||
# must check if has extra number for DBT and DVD and [FILE]
|
||||
base_name = base_name[:-5]
|
||||
expected_pdf = base_name + '.pdf'
|
||||
if expected_pdf not in filtered_file_names:
|
||||
|
|
@ -221,13 +301,9 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
if mp3_count + mp4_count != json_count:
|
||||
logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
|
||||
|
||||
|
||||
|
||||
# add check of mp3 +6 mp4 vs json and md5 file like above for mp4 and pdf
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
|
||||
|
||||
|
||||
# search wich file dont match TODO
|
||||
raise ValueError("Inconsistent file counts mp3+mp4 vs json")
|
||||
|
||||
|
|
@ -237,11 +313,20 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
# search wich file dont match TODO
|
||||
raise ValueError("Inconsistent file counts mp3+mp4 vs md5")
|
||||
|
||||
# ---------------------------------------------------------------------
|
||||
# PHASE 3: PARSE & INSERT INTO DATABASE
|
||||
#
|
||||
# 1) Process each remaining S3 object and validate its associated metadata.
|
||||
# 2) Insert new records into the database (unless running in DRY_RUN).
|
||||
# 3) Report counts of successful uploads, warnings, and errors.
|
||||
# ---------------------------------------------------------------------
|
||||
logging.info("PHASE 3: Parse S3 objects and insert new records into the database")
|
||||
|
||||
# Try to parse S3 files
|
||||
try:
|
||||
# If DRY RUN is set to True, the files will not be uploaded to the database
|
||||
if os.getenv('ACH_DRY_RUN', 'true') == 'false':
|
||||
uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, file_names, ach_variables, excluded_folders)
|
||||
uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, filtered_file_names, ach_variables, excluded_folders)
|
||||
else:
|
||||
logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
|
||||
# set the tuples to zero
|
||||
|
|
@ -258,33 +343,13 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
# connect to database
|
||||
conn = psycopg2.connect(**db_config)
|
||||
cur = conn.cursor()
|
||||
# function count_files that are wav and mov in db
|
||||
# Map file extensions (include leading dot) to mime types
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/mov',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
}
|
||||
|
||||
# populate mime_type list with all relevant MediaInfo/MIME values
|
||||
mime_type = [
|
||||
'video/x-msvideo', # .avi
|
||||
'video/mov', # .mov
|
||||
'audio/wav', # .wav
|
||||
'video/mp4', # .mp4, .m4v
|
||||
'audio/mp3', # .mp3
|
||||
'application/mxf', # .mxf
|
||||
'video/mpeg', # .mpg
|
||||
]
|
||||
# Use centralized mime types from config
|
||||
from config import EXTENSION_MIME_MAP, MIME_TYPES
|
||||
|
||||
logging.info(f"Mime types for counting files: {mime_type}")
|
||||
logging.info(f"Mime types for counting files: {MIME_TYPES}")
|
||||
|
||||
all_files_on_db = count_files(cur, mime_type,'*', False)
|
||||
all_files_on_db = count_files(cur, MIME_TYPES,'*', False)
|
||||
mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
|
||||
mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
|
||||
mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
|
||||
|
|
@ -300,18 +365,18 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
|||
logging.info(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
|
||||
|
||||
# compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
|
||||
missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names]
|
||||
missing_mp4s = [f for f in s3_file_names if f.endswith('.mp4') and f not in db_file_names]
|
||||
# if missing_mp4s empty do not return a warning
|
||||
if missing_mp4s:
|
||||
logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}")
|
||||
logging.warning(f"Missing {len(missing_mp4s)} .mp4 files in DB compared to S3: {missing_mp4s}")
|
||||
|
||||
logging.info(f"Number of .wav files in the database: {wav_files_on_db} ")
|
||||
logging.info(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
|
||||
|
||||
missing_mp3s = [f for f in file_names if f.endswith('.mp3') and f not in db_file_names]
|
||||
missing_mp3s = [f for f in s3_file_names if f.endswith('.mp3') and f not in db_file_names]
|
||||
# if missing_mp3s empty do not return a warning
|
||||
if missing_mp3s:
|
||||
logging.warning(f"Missing .mp3 files in DB compared to S3: {missing_mp3s}")
|
||||
logging.warning(f"Missing {len(missing_mp3s)} .mp3 files in DB compared to S3: {missing_mp3s}")
|
||||
|
||||
logging.info(f"Number of .avi files in the database: {avi_files_on_db} ")
|
||||
logging.info(f"Number of .m4v files in the database: {m4v_files_on_db} ")
|
||||
|
|
|
|||
19
s3_utils.py
19
s3_utils.py
|
|
@ -31,10 +31,16 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
|
|||
# logg ach_variables
|
||||
logging.info(f"ach_variables: {ach_variables}")
|
||||
|
||||
logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
|
||||
# ---------------------------------------------------------------------
|
||||
# PHASE 3: PARSE & INSERT INTO DATABASE (DETAILS)
|
||||
#
|
||||
# 3.1) Filter out excluded prefixes and keep only files we care about.
|
||||
# 3.2) Validate each media file alongside its related sidecars (.json, .md5, .pdf).
|
||||
# 3.3) Cross-check the inventory code in the database and insert new records.
|
||||
# ---------------------------------------------------------------------
|
||||
logging.info("PHASE 3: Parse & insert - starting detailed file processing")
|
||||
|
||||
try:
|
||||
logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
|
||||
# Ensure db_config is not None
|
||||
if db_config is None:
|
||||
raise ValueError("Database configuration is not loaded")
|
||||
|
|
@ -67,6 +73,9 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
|
|||
if result:
|
||||
# logging.warning(f"File {file} already exists in the database.")
|
||||
warning_files_count += 1
|
||||
if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true':
|
||||
logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (file already exists in DB): %s", file)
|
||||
raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3")
|
||||
continue
|
||||
|
||||
ach_variables['file_fullpath'] = file # is the Object key
|
||||
|
|
@ -76,12 +85,13 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
|
|||
ach_variables['objectKeys']['media'] = file
|
||||
ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf"
|
||||
ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '')
|
||||
from config import EXTENSION_MIME_MAP
|
||||
if file.endswith('.mp4'):
|
||||
ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov" # remove _H264 is done later
|
||||
elif file.endswith('.mp3'):
|
||||
ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav"
|
||||
else:
|
||||
logging.KeyError(f"Unsupported file type: {file}")
|
||||
logging.error(f"Unsupported file type: {file}")
|
||||
error_files_count +=1
|
||||
continue
|
||||
|
||||
|
|
@ -116,6 +126,9 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
|
|||
else:
|
||||
logging.warning("Could not retrieve file size for %s.", file)
|
||||
warning_files_count += 1
|
||||
if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true':
|
||||
logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (missing file size): %s", file)
|
||||
raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3")
|
||||
continue # Skip to the next file in the loop
|
||||
|
||||
logging.info("Start Validating files for %s...", base_name)
|
||||
|
|
|
|||
57
utils.py
57
utils.py
|
|
@ -24,8 +24,10 @@ def check_video_info(media_info):
|
|||
# If the parent directory is 'FILE' accept multiple container types
|
||||
if parent_dir.lower() == 'file':
|
||||
# Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive)
|
||||
if not any(file_name.lower().endswith(ext) for ext in ('.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg')):
|
||||
return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg or .mpeg file."
|
||||
# video alowed extension
|
||||
video_allowed_extensions = ['.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg', '.wmv']
|
||||
if not any(file_name.lower().endswith(ext) for ext in video_allowed_extensions):
|
||||
return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg, .mpeg or .wmv file."
|
||||
|
||||
# Map file extensions to lists of acceptable general formats (video)
|
||||
general_formats = {
|
||||
|
|
@ -58,18 +60,23 @@ def check_video_info(media_info):
|
|||
# Outside FILE/ directory require .mov specifically
|
||||
if not file_name.lower().endswith('.mov'):
|
||||
return False, "The file is not a .mov file."
|
||||
# Check if track 1's format is ProRes
|
||||
|
||||
# Strict master MOV rule: track[1] must be ProRes
|
||||
tracks = media_info.get('media', {}).get('track', [])
|
||||
if len(tracks) > 1:
|
||||
track_1 = tracks[1] # Assuming track 1 is the second element (index 1)
|
||||
logging.info(f"Track 1: {track_1}")
|
||||
if track_1.get('@type', '') == 'Video' and track_1.get('Format', '') == 'ProRes' and track_1.get('Format_Profile', '') == '4444':
|
||||
return True, "The file is a .mov file with ProRes format in track 1."
|
||||
else:
|
||||
return False, "Track 1 format is not ProRes."
|
||||
else:
|
||||
if len(tracks) <= 1:
|
||||
return False, "No track 1 found."
|
||||
|
||||
track_1 = tracks[1] # track[1] should represent the video stream
|
||||
logging.info(f"Track 1: {track_1}")
|
||||
if track_1.get('@type', '') != 'Video':
|
||||
return False, "Track 1 is not a video track."
|
||||
if track_1.get('Format', '') != 'ProRes':
|
||||
return False, "Track 1 format is not ProRes."
|
||||
if track_1.get('Format_Profile', '') != '4444':
|
||||
return False, "Track 1 format profile is not 4444."
|
||||
|
||||
return True, "The file is a .mov master with ProRes track 1."
|
||||
|
||||
return True, "The file passed the video format checks."
|
||||
except Exception as e:
|
||||
return False, f"Error processing the content: {e}"
|
||||
|
|
@ -77,10 +84,32 @@ def check_video_info(media_info):
|
|||
# result, message = check_audio_info(json_content)
|
||||
def check_audio_info(media_info):
|
||||
try:
|
||||
# Check if the file name ends with .wav
|
||||
# Determine source filename (from JSON) and its parent folder
|
||||
file_name = media_info.get('media', {}).get('@ref', '')
|
||||
if not file_name.endswith('.wav'):
|
||||
logging.info(f"File name in JSON: {file_name}")
|
||||
parent_dir = os.path.basename(os.path.dirname(file_name))
|
||||
|
||||
# If the file lives under FILE/, allow MP3/WAV/M4A/AIF as valid audio containers
|
||||
if parent_dir.lower() == 'file':
|
||||
audio_allowed_extensions = ['.wav', '.mp3', '.m4a', '.aif']
|
||||
if not any(file_name.lower().endswith(ext) for ext in audio_allowed_extensions):
|
||||
return False, f"The file is not one of the allowed audio containers: {', '.join(audio_allowed_extensions)}."
|
||||
|
||||
# For WAV, do the strict Wave/PCM validation
|
||||
if file_name.lower().endswith('.wav'):
|
||||
tracks = media_info.get('media', {}).get('track', [])
|
||||
if len(tracks) > 1:
|
||||
track_1 = tracks[1]
|
||||
if track_1.get('@type', '') == 'Audio' and track_1.get('Format', '') == 'PCM' and track_1.get('SamplingRate', '') == '96000' and track_1.get('BitDepth', '') == '24':
|
||||
return True, "The file is a .wav file with Wave format in track 1."
|
||||
else:
|
||||
return False, f"Track 1 format is not Wave. Format: {track_1.get('Format', '')}, SamplingRate: {track_1.get('SamplingRate', '')}, BitDepth: {track_1.get('BitDepth', '')}"
|
||||
return False, "No track 1 found."
|
||||
|
||||
# For MP3/M4A we accept it without strict Wave validation
|
||||
return True, "The file is an accepted audio container under FILE/ (mp3/m4a/wav)."
|
||||
|
||||
# Outside FILE/ directory require .wav specifically
|
||||
if not file_name.lower().endswith('.wav'):
|
||||
return False, "The file is not a .wav file."
|
||||
|
||||
# Check if track 1's format is Wave
|
||||
|
|
|
|||
|
|
@ -110,17 +110,39 @@ def validate_icode_extension(file_inventory_code):
|
|||
'BTC': r'_\d{4}',
|
||||
'OA4': r'_\d{2}',
|
||||
'DVD': r'_\d{2}',
|
||||
'MCC': r'_[AB]'
|
||||
'BRD': r'_\d{2}',
|
||||
'MCC': r'_[AB]',
|
||||
'DBT': r'_\d{4}',
|
||||
'M4V': r'_\d{2}',
|
||||
'AVI': r'_\d{2}',
|
||||
'MOV': r'_\d{2}',
|
||||
'MP4': r'_\d{2}',
|
||||
'MXF': r'_\d{2}',
|
||||
'MPG': r'_\d{2}'
|
||||
}
|
||||
|
||||
if not isinstance(file_inventory_code, str) or file_inventory_code == '':
|
||||
logging.warning("Empty or non-string inventory code provided to validate_icode_extension")
|
||||
return False
|
||||
|
||||
# security: ignore any folder prefix or path components that might be
|
||||
# passed in from external sources.
|
||||
file_inventory_code = os.path.basename(file_inventory_code)
|
||||
|
||||
# remove common file extensions and only remove _H264 if it's an .mp4 file
|
||||
if file_inventory_code.lower().endswith('.mp4'):
|
||||
file_inventory_code = file_inventory_code.replace("_H264", "")
|
||||
|
||||
file_inventory_code = os.path.splitext(file_inventory_code)[0]
|
||||
|
||||
# Enforce maximum length (12 base + up to 5 extension chars)
|
||||
if len(file_inventory_code) > 17:
|
||||
logging.warning("Inventory code '%s' exceeds maximum allowed length (17).", file_inventory_code)
|
||||
# Only raise the error if DRY RUN is false; otherwise, just log it as a warning
|
||||
if os.getenv('ACH_DRY_RUN', 'true').lower() == 'false':
|
||||
raise ValueError("Inventory code with extension exceeds maximum length of 17 characters.")
|
||||
else:
|
||||
return False
|
||||
|
||||
# Validate base first (first 12 chars). If base invalid -> reject.
|
||||
base = file_inventory_code[:12]
|
||||
|
|
@ -132,6 +154,22 @@ def validate_icode_extension(file_inventory_code):
|
|||
support_type = base[3:6]
|
||||
extension = file_inventory_code[12:]
|
||||
|
||||
if extension == '':
|
||||
logging.info("Extension for '%s' is empty (valid).", file_inventory_code)
|
||||
return True
|
||||
|
||||
expected_ext_pattern = file_type_to_regex.get(support_type)
|
||||
if expected_ext_pattern is None:
|
||||
logging.warning("Unsupported type '%s' for extension validation in '%s'.", support_type, file_inventory_code)
|
||||
return False
|
||||
|
||||
if not re.fullmatch(expected_ext_pattern, extension):
|
||||
logging.warning("Extension '%s' does not match expected pattern '%s' for type '%s'.", extension, expected_ext_pattern, support_type)
|
||||
return False
|
||||
|
||||
logging.info("Inventory code with extension '%s' is valid.", file_inventory_code)
|
||||
return True
|
||||
|
||||
|
||||
def analyze_extension_pattern(file_inventory_code):
|
||||
"""Analyze extension and base-length issues for a full inventory code.
|
||||
|
|
@ -338,11 +376,13 @@ if __name__ == "__main__":
|
|||
|
||||
# validate_icode_extension (valid and invalid)
|
||||
{"name": "validate_icode_extension no ext", "fn": lambda: validate_icode_extension("VO-DVD-12345"), "expect": True},
|
||||
{"name": "validate_icode_extension BARE prefix", "fn": lambda: validate_icode_extension("BTC/VO-DVD-12345"), "expect": True},
|
||||
{"name": "validate_icode_extension BTC valid", "fn": lambda: validate_icode_extension("VO-BTC-12345_1234"), "expect": True},
|
||||
{"name": "validate_icode_extension DVD valid", "fn": lambda: validate_icode_extension("VO-DVD-12345_12"), "expect": True},
|
||||
{"name": "validate_icode_extension MCC valid", "fn": lambda: validate_icode_extension("VO-MCC-12345_A"), "expect": True},
|
||||
{"name": "validate_icode_extension unsupported type", "fn": lambda: validate_icode_extension("VO-XYZ-12345_12"), "expect": False},
|
||||
{"name": "validate_icode_extension too long extension (raises)", "fn": lambda: validate_icode_extension("VO-DVD-12345_12345"), "expect_exception": ValueError},
|
||||
{"name": "validate_icode_extension prefix too long (raises)", "fn": lambda: validate_icode_extension("XYZ/VO-DVD-12345_12345"), "expect_exception": ValueError},
|
||||
{"name": "validate_icode_extension BTC invalid pattern", "fn": lambda: validate_icode_extension("VO-BTC-12345_12"), "expect": False},
|
||||
|
||||
# mp4/mp3 validators - return lists
|
||||
|
|
|
|||
Loading…
Reference in New Issue