feat: Enhance media import functionality with centralized MIME type management and improved validation

- Added EXTENSION_MIME_MAP to config.py for centralized MIME type mapping.
- Updated countfiles.py to utilize centralized MIME types for file counting.
- Refactored db_utils.py to import EXTENSION_MIME_MAP and added get_mime_from_mediainfo function for better MIME type determination.
- Enhanced file_utils.py with prefix matching for S3 files and JSON references.
- Improved logging in logging_config.py for better error tracking.
- Added security checks in main.py for development environment imports.
- Updated validation_utils.py to include stricter extension validation and improved error handling.
- Introduced new instructions for developers in .github/copilot-instructions.md for project setup and coding standards.
This commit is contained in:
MSVstudios 2026-03-15 15:04:50 +01:00
parent 6030cc3f84
commit ccd700a4a8
11 changed files with 500 additions and 215 deletions

65
.github/copilot-instructions.md vendored Normal file
View File

@ -0,0 +1,65 @@
# ACH Server Media Import - Agent Instructions
Guidelines and standards for the ACH Media Import project.
## Project Overview
This project is a Python-based utility that imports media files from an S3-compatible bucket into a PostgreSQL database, enforcing specific naming conventions and metadata validation.
## Technical Stack
- **Language**: Python 3.8+
- **Database**: PostgreSQL (via `psycopg2`)
- **Cloud Storage**: AWS S3/S3-compatible storage (via `boto3`)
- **Containerization**: Docker & Docker Compose
- **Environment**: Managed via `.env` and `config.py`
## Architecture & Modular Design
The project uses a utility-based modular architecture orchestrated by `main.py`.
- [main.py](main.py): Entry point and workflow orchestrator.
- [s3_utils.py](s3_utils.py): S3 client operations and bucket listing.
- [db_utils.py](db_utils.py): Database connectivity and SQL execution.
- [validation_utils.py](validation_utils.py): Pattern matching and business logic validation.
- [logging_config.py](logging_config.py): Centralized logging configuration.
- [error_handler.py](error_handler.py): Error handling and notifications.
- [email_utils.py](email_utils.py): SMTP integration for alerts.
## Domain Logic: Inventory Codes
The core validation revolves around "Inventory Codes" which MUST follow a strict 12-character format:
- `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`
- Examples: `VA-C01-12345`, `OC-A99-67890`.
- Files not matching this pattern in S3 are logged but skipped.
## Development Workflows
### Environment Setup
- **Windows**: Use `. .venv\Scripts\Activate.ps1`
- **Linux/macOS**: Use `source .venv/bin/activate`
- **Dependency installation**: `pip install -r requirements.txt`
### Local Execution
- **Run script**: `python main.py`
- **Verify Configuration**: Ensure `.env` is populated with `DB_`, `AWS_`, and `SMTP_` variables.
### Docker Operations
- **Build/Up**: `docker compose up -d --build`
- **Logs**: `docker compose logs -f app`
- **Stop**: `docker compose stop`
## Coding Standards & Conventions
### Logging
- Use the custom logger from `logging_config.py`.
- **Log Levels**: Use `logging.INFO`, `logging.WARNING`, and the custom `CUSTOM_ERROR_LEVEL` (35) via `error_handler.py`.
- Logs are rotated and stored in the `logs/` directory.
### Error Handling
- Wrap critical operations that should trigger notifications in try-except blocks that call `error_handler.notify_error()`.
- Avoid silent failures; ensure errors are logged to the appropriate file sync.
### Configuration
- Access settings exclusively via the `config.py` module's dictionaries: `db_config`, `aws_config`, `ach_config`.
- Never hardcode credentials or endpoints.
## Related Files
- [query-sql.md](query-sql.md): Reference for database schema and SQL logic.
- [requirements.txt](requirements.txt): Project dependencies.
- [docker-compose.yml](docker-compose.yml): Deployment configuration.

178
README.md
View File

@ -1,138 +1,126 @@
# Project Setup # ACH Server Media Import
## Setting up a Virtual Environment This repository contains a script that imports media files from an S3-compatible bucket into a PostgreSQL database. It supports both local execution (Python virtual environment) and Docker deployment via `docker-compose`.
1. **Create a virtual environment:** ---
### For Linux/macOS: ## Overview
```bash
python3 -m venv .venv
```
### For Windows: ### Asset hierarchy
```bash - **Conservatory Copy (Master)**: High-quality source (e.g., `.mov`, `.wav`). This is the primary record in the database.
## ACH-server-import-media - **Streaming Copy (Derivative)**: Transcoded versions (`.mp4`, `.mp3`) linked to the master.
- **Sidecar Metadata (`.json`)**: Contains technical metadata (`mediainfo` / `ffprobe`) used for validation and to determine the correct MIME type.
- **Sidecar QC (`.pdf`, `.md5`)**: Quality control and checksum files.
This repository contains a script to import media files from an S3-compatible bucket into a database. It supports both local execution (virtual environment) and Docker-based deployment via `docker-compose`. > **Important:** all files belonging to the same asset must share the same 12-character inventory code (e.g., `VO-UMT-14387`).
Contents ---
- `main.py` - main import script
- `docker-compose.yml` - docker-compose service for running the importer in a container
- `requirements.txt` - Python dependencies
- `config.py`, `.env` - configuration and environment variables
Prerequisites ## Process Phases
- Docker & Docker Compose (or Docker Desktop)
The importer runs in three clearly separated phases (each phase is logged in detail):
### Phase 1 S3 discovery + initial validation
- List objects in the configured S3 bucket.
- Keep only allowed extensions: `.mp4`, `.mp3`, `.json`, `.pdf`, `.md5`.
- Exclude configured folders (e.g., `TEST-FOLDER-DEV/`, `DOCUMENTAZIONE_FOTOGRAFICA/`, `UMT/`).
- Validate the inventory code format and ensure the folder prefix matches the type encoded in the inventory code.
- Files failing validation are rejected **before** any database interaction.
### Phase 2 Database cross-reference + filtering
- Load existing filenames from the database.
- Skip files already represented in the DB, including sidecar records.
- Build the final list of S3 objects to parse.
### Phase 3 Parse & insert
- Read and validate sidecars (`.json`, `.md5`, `.pdf`) alongside the media file.
- Use metadata (from `mediainfo` / `ffprobe`) to derive the **master mime type** and enforce container rules.
- Insert new records into the database (unless `ACH_DRY_RUN=true`).
---
## Validation Policy
The import pipeline enforces strict validation to prevent bad data from entering the database.
### Inventory Code & Folder Prefix
- Expected inventory code format: `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`.
- The folder prefix (e.g., `BRD/`, `DVD/`, `FILE/`) must match the code type.
- If the prefix does not match the inventory code, the file is rejected in Phase 1.
### Safe Run (`ACH_SAFE_RUN`)
- When `ACH_SAFE_RUN=true`, **any warning during Phase 3 causes an immediate abort**.
- This prevents partial inserts when the importer detects inconsistent or already-present data.
### MIME Type Determination
- The MIME type for master files is derived from the JSON sidecar metadata (`mediainfo` / `ffprobe`), not from the streaming derivative extension.
---
## Quick Start (Local)
### Prerequisites
- Python 3.8+ - Python 3.8+
- Git (optional) - Virtual environment support (`venv`)
Quick local setup (virtual environment) ### Setup
Linux / macOS
```bash ```bash
python3 -m venv .venv python3 -m venv .venv
source .venv/bin/activate source .venv/bin/activate
pip install -r requirements.txt pip install -r requirements.txt
``` ```
Windows (PowerShell) ### Run
```powershell
python -m venv .venv
. .venv\Scripts\Activate.ps1
pip install -r requirements.txt
```
Running locally
1. Ensure your configuration is available (see `config.py` or provide a `.env` file with the environment variables used by the project).
2. Run the script (from the project root):
Linux / macOS
```bash ```bash
python main.py python main.py
``` ```
Windows (PowerShell) ---
```powershell
& .venv\Scripts\python.exe main.py
```
Docker Compose ## Docker (docker-compose)
This project includes a `docker-compose.yml` with a service named `app` (container name `ACH_server_media_importer`). The compose file reads environment variables from `.env` and mounts a `logs` named volume. The project includes a `docker-compose.yml` with an `app` service (container name `ACH_server_media_importer`). It reads environment variables from `.env` and mounts a `logs` volume.
Build and run (detached) ### Build & run
```powershell
# Docker Compose v2 syntax (recommended)
# From the repository root
```bash
docker compose up -d --build docker compose up -d --build
# OR if your environment uses the v1 binary
# docker-compose up -d --build
``` ```
Show logs ### Logs
```powershell
# Follow logs for the 'app' service ```bash
docker compose logs -f app docker compose logs -f app
# Or use the container name
docker logs -f ACH_server_media_importer
``` ```
Stop / start / down ### Stop
```powershell
# Stop containers ```bash
docker compose stop docker compose stop
# Start again
docker compose start
# Take down containers and network
docker compose down
``` ```
Rebuild when already running ### Rebuild (clean)
There are two safe, common ways to rebuild a service when the containers are already running: ```bash
1) Rebuild in-place and recreate changed containers (recommended for most changes):
```powershell
# Rebuild images and recreate services in the background
docker compose up -d --build
```
This tells Compose to rebuild the image(s) and recreate containers for services whose image or configuration changed.
2) Full clean rebuild (use when you need to remove volumes or ensure a clean state):
```powershell
# Stop and remove containers, networks, and optionally volumes & images, then rebuild
docker compose down --volumes --rmi local docker compose down --volumes --rmi local
docker compose up -d --build docker compose up -d --build
``` ```
Notes ---
- `docker compose up -d --build` will recreate containers for services that need updating; it does not destroy named volumes unless you pass `--volumes` to `down`.
- If you need to execute a shell inside the running container:
```powershell ## Configuration
# run a shell inside the 'app' service
docker compose exec app /bin/sh
# or (if bash is available)
docker compose exec app /bin/bash
```
Environment and configuration Configuration is driven by `.env` and `config.py`. Key variables include:
- Provide sensitive values via a `.env` file (the `docker-compose.yml` already references `.env`). - `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`
- Typical variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`, `DB_HOST`, `DB_NAME`, `DB_USER`, `SMTP_SERVER`, etc. - `DB_HOST`, `DB_NAME`, `DB_USER`, `DB_PASSWORD`, `DB_PORT`
- `ACH_DRY_RUN` (`true` / `false`)
Troubleshooting - `ACH_SAFE_RUN` (`true` / `false`)
- If Compose fails to pick up code changes, ensure your local Dockerfile `COPY` commands include the source files and that `docker compose up -d --build` is run from the repository root.
- Use `docker compose logs -f app` to inspect runtime errors.
If you'd like, I can add a short `Makefile` or a PowerShell script to wrap the common Docker Compose commands (build, rebuild, logs) for convenience.
--- ---
Edited to add clear docker-compose rebuild and run instructions.
## Troubleshooting
- If Docker does not pick up changes, ensure `docker compose up -d --build` is run from the repo root.
- Inspect runtime errors via `docker compose logs -f app`.

View File

@ -66,5 +66,21 @@ def load_config():
return aws_config, db_config, ach_config, bucket_name, ach_variables return aws_config, db_config, ach_config, bucket_name, ach_variables
EXTENSION_MIME_MAP = {
'.avi': 'video/x-msvideo',
'.mov': 'video/quicktime',
'.wav': 'audio/wav',
'.mp4': 'video/mp4',
'.m4v': 'video/mp4',
'.mp3': 'audio/mp3',
'.mxf': 'application/mxf',
'.mpg': 'video/mpeg',
'.aif': 'audio/aiff',
'.wmv': 'video/x-ms-asf',
'.m4a': 'audio/mp4',
}
MIME_TYPES = sorted(list(set(EXTENSION_MIME_MAP.values())))
# Consider using a class for a more structured approach (optional) # Consider using a class for a more structured approach (optional)

View File

@ -130,33 +130,13 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
# connect to database # connect to database
conn = psycopg2.connect(**db_config) conn = psycopg2.connect(**db_config)
cur = conn.cursor() cur = conn.cursor()
# function count_files that are wav and mov in db
# Map file extensions (include leading dot) to mime types
EXTENSION_MIME_MAP = {
'.avi': 'video/x-msvideo',
'.mov': 'video/mov',
'.wav': 'audio/wav',
'.mp4': 'video/mp4',
'.m4v': 'video/mp4',
'.mp3': 'audio/mp3',
'.mxf': 'application/mxf',
'.mpg': 'video/mpeg',
}
# populate mime_type list with all relevant MediaInfo/MIME values # Use centralized mime types from config
mime_type = [ from config import EXTENSION_MIME_MAP, MIME_TYPES
'video/x-msvideo', # .avi
'video/mov', # .mov
'audio/wav', # .wav
'video/mp4', # .mp4, .m4v
'audio/mp3', # .mp3
'application/mxf', # .mxf
'video/mpeg', # .mpg
]
logging.info(f"Mime types for counting files: {mime_type}") logging.info(f"Mime types for counting files: {MIME_TYPES}")
all_files_on_db = count_files(cur, mime_type,'*', False) all_files_on_db = count_files(cur, MIME_TYPES,'*', False)
mov_files_on_db = count_files(cur,['video/mov'],'.mov', False ) mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False ) mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False ) mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )

View File

@ -34,18 +34,7 @@ from email_utils import handle_error
import json import json
import os import os
import config import config
from config import EXTENSION_MIME_MAP
# Map file extensions (include leading dot) to mime types
EXTENSION_MIME_MAP = {
'.avi': 'video/x-msvideo',
'.mov': 'video/mov',
'.wav': 'audio/wav',
'.mp4': 'video/mp4',
'.m4v': 'video/mp4',
'.mp3': 'audio/mp3',
'.mxf': 'application/mxf',
'.mpg': 'video/mpeg',
}
def get_mime_for_extension(extension: str) -> str: def get_mime_for_extension(extension: str) -> str:
"""Return the mime type for an extension. Accepts with or without leading dot. """Return the mime type for an extension. Accepts with or without leading dot.
@ -58,6 +47,76 @@ def get_mime_for_extension(extension: str) -> str:
extension = f'.{extension}' extension = f'.{extension}'
return EXTENSION_MIME_MAP.get(extension.lower(), 'application/octet-stream') return EXTENSION_MIME_MAP.get(extension.lower(), 'application/octet-stream')
def get_mime_from_mediainfo(ach_variables: dict) -> str:
"""Determine a MIME type from the JSON sidecar mediainfo.
This is used to capture the *master* format (conservatory copy) even when
the stream copy on S3 is a different container (e.g., _H264.mp4).
If mediainfo is missing or cannot be mapped, fall back to extension-based mapping.
"""
# Prefer the master (conservative copy) extension when it is explicitly available.
# In some cases MediaInfo reports "MPEG-4" for .mov containers, so the extension
# is a more reliable hint for the correct mime type.
conservative_ext = ach_variables.get('conservative_copy_extension')
if conservative_ext and conservative_ext.lower() == '.mov':
return 'video/quicktime'
# Also check the actual conservative copy object key (from JSON @ref). This is the
# name that will be stored in the DB as the master file, so it should drive the MIME.
conservative_copy_key = ach_variables.get('objectKeys', {}).get('conservative_copy', '')
if conservative_copy_key and os.path.splitext(conservative_copy_key)[1].lower() == '.mov':
return 'video/quicktime'
# Try to find the General track format from mediainfo
try:
mediainfo = ach_variables.get('custom_data_in', {}).get('mediainfo', {})
tracks = mediainfo.get('media', {}).get('track', [])
for track in tracks:
if track.get('@type', '') == 'General':
format_value = track.get('Format', '')
if format_value:
# Map common MediaInfo format values to MIME types
mapping = {
'AVI': 'video/x-msvideo',
'MOV': 'video/quicktime',
'QuickTime': 'video/quicktime',
'MPEG-4': 'video/mp4',
'MP4': 'video/mp4',
'MXF': 'application/mxf',
'MPEG': 'video/mpeg',
'MPEG-PS': 'video/mpeg',
'MPEG-TS': 'video/MP2T',
'MPEG Audio': 'audio/mpeg',
'MPEG Audio/Layer 3': 'audio/mpeg',
'AAC': 'audio/aac',
'PCM': 'audio/wav',
'WAV': 'audio/wav',
'AIFF': 'audio/aiff',
'FLAC': 'audio/flac',
}
# Do a case-insensitive match
for k, v in mapping.items():
if format_value.lower() == k.lower():
return v
# Try a fuzzy match based on known substrings
if 'avi' in format_value.lower():
return 'video/x-msvideo'
if 'mp4' in format_value.lower():
return 'video/mp4'
if 'mpeg' in format_value.lower():
return 'video/mpeg'
if 'wav' in format_value.lower() or 'pcm' in format_value.lower():
return 'audio/wav'
if 'mp3' in format_value.lower():
return 'audio/mpeg'
# Fall back to extension-based mapping when metadata doesn't yield a mime
extension = ach_variables.get('extension')
return get_mime_for_extension(extension)
except Exception:
return get_mime_for_extension(ach_variables.get('extension'))
def get_distinct_filenames_from_db(): def get_distinct_filenames_from_db():
"""Retrieve distinct digital file names from the Postgres DB. """Retrieve distinct digital file names from the Postgres DB.
@ -94,15 +153,19 @@ def check_inventory_in_db(s3_client, cur, base_name):
# Define the pattern for the inventory code # Define the pattern for the inventory code
media_tipology_A = ['MCC', 'OA4', 'DAT'] media_tipology_A = ['MCC', 'OA4', 'DAT']
# FOR FILE add to media_tipology_A for readibily
media_tipology_A += ['M4A','AIF'] # add for "FILE" folders 04112025
# TODO add other tipologies: AVI, M4V, MOV, MP4, MXF, MPG (done 04112025) # TODO add other tipologies: AVI, M4V, MOV, MP4, MXF, MPG (done 04112025)
media_tipology_V = [ media_tipology_V = [
'OV1', 'OV2', 'UMT', 'VHS', 'HI8', 'VD8', 'BTC', 'DBT', 'IMX', 'DVD', 'OV1', 'OV2', 'UMT', 'VHS', 'HI8', 'VD8', 'BTC', 'DBT', 'IMX', 'DVD',
'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV', 'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV'
'AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG' # add for "file" folders 04112025
] ]
# FOR FILE add to media_tipology_V for readibily
media_tipology_V += ['AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG', 'WMV'] # add for "FILE" folders 04112025
# list of known mime types (derived from EXTENSION_MIME_MAP) # Use centralized mime types from config
mime_type = list({v for v in EXTENSION_MIME_MAP.values()}) from config import MIME_TYPES
try: try:
logging.info(f"SUPPORT TYPOLOGY : {base_name[3:6]}") logging.info(f"SUPPORT TYPOLOGY : {base_name[3:6]}")
@ -270,8 +333,8 @@ def add_file_record_and_relationship(s3_client, cur, base_name,ach_variables):
file_availability_dict = 7 # Place Holder file_availability_dict = 7 # Place Holder
# add a new file record for the "copia conservativa" # add a new file record for the "copia conservativa"
ach_variables['custom_data_in']['media_usage'] = 'master' # can be "copia conservativa" ach_variables['custom_data_in']['media_usage'] = 'master' # can be "copia conservativa"
# determine master mime type from the file extension # determine master mime type using the JSON sidecar metadata (preferred)
master_mime_type = get_mime_for_extension(ach_variables.get('extension')) master_mime_type = get_mime_from_mediainfo(ach_variables)
new_file_id = add_file_record( new_file_id = add_file_record(
cur, cur,

View File

@ -286,6 +286,30 @@ def extract_and_validate_file_info(file_contents, file, ach_variables):
else: else:
logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON ffprobe file name '{basename_ffprobe}'.") logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON ffprobe file name '{basename_ffprobe}'.")
# Check folder prefixes (e.g. FILE/ vs DBT/) match between S3 file and JSON refs
def _extract_prefix(path: str) -> str:
if not path:
return ''
# Normalize separators to '/' so we can reliably split on a single character
normalized = path.replace('\\', '/').lstrip('/')
return normalized.split('/', 1)[0] if '/' in normalized else normalized
prefix_fullpath = _extract_prefix(ach_variables['file_fullpath'])
prefix_mediainfo = _extract_prefix(json_ref_mediainfo_path)
prefix_ffprobe = _extract_prefix(json_ref_ffprobe_path)
if prefix_fullpath != prefix_mediainfo or prefix_fullpath != prefix_ffprobe:
logging.warning(
"Prefix mismatch for S3 file '%s': S3 prefix='%s' (fullpath='%s') vs JSON prefixes (mediainfo='%s' [%s], ffprobe='%s' [%s]).",
ach_variables.get('file_fullpath'),
prefix_fullpath,
ach_variables.get('file_fullpath'),
prefix_mediainfo,
json_ref_mediainfo_path,
prefix_ffprobe,
json_ref_ffprobe_path,
)
if basename_fullpath != basename_mediainfo and basename_fullpath != basename_ffprobe: if basename_fullpath != basename_mediainfo and basename_fullpath != basename_ffprobe:
logging.error(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.") logging.error(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
raise ValueError(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.") raise ValueError(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")

View File

@ -33,8 +33,8 @@ def _create_timed_handler(path: str, level=None, when='midnight', interval=1, ba
""" """
_ensure_dir_for_file(path) _ensure_dir_for_file(path)
handler = TimedRotatingFileHandler(path, when=when, interval=interval, backupCount=backupCount, encoding='utf-8') handler = TimedRotatingFileHandler(path, when=when, interval=interval, backupCount=backupCount, encoding='utf-8')
# Use a readable suffix for rotated files (handler will append this after the filename) # Default behavior is to append to the existing log file.
handler.suffix = "%Y%m%d_%H%M%S" # Rotation happens when 'when' occurs.
if fmt: if fmt:
handler.setFormatter(fmt) handler.setFormatter(fmt)
if level is not None: if level is not None:
@ -58,6 +58,8 @@ def setup_logging():
error_log_path = os.getenv('ERROR_LOG_FILE_PATH', "./logs/ACH_media_import_errors.log") error_log_path = os.getenv('ERROR_LOG_FILE_PATH', "./logs/ACH_media_import_errors.log")
warning_log_path = os.getenv('WARNING_LOG_FILE_PATH', "./logs/ACH_media_import_warnings.log") warning_log_path = os.getenv('WARNING_LOG_FILE_PATH', "./logs/ACH_media_import_warnings.log")
if os.getenv('WARING_LOG_FILE_PATH'): # Fix typo in .env if present
warning_log_path = os.getenv('WARING_LOG_FILE_PATH')
info_log_path = os.getenv('INFO_LOG_FILE_PATH', "./logs/ACH_media_import_info.log") info_log_path = os.getenv('INFO_LOG_FILE_PATH', "./logs/ACH_media_import_info.log")
# Create three handlers: info (all), warning (warning+), error (error+) # Create three handlers: info (all), warning (warning+), error (error+)

141
main.py
View File

@ -12,7 +12,7 @@ from error_handler import handle_general_error, handle_file_not_found_error, han
from file_utils import is_file_empty from file_utils import is_file_empty
from db_utils import count_files, get_distinct_filenames_from_db from db_utils import count_files, get_distinct_filenames_from_db
from dotenv import load_dotenv from dotenv import load_dotenv
from validation_utils import validate_inventory_code, analyze_pattern_match, validate_icode_extension from validation_utils import validate_inventory_code, analyze_pattern_match, validate_icode_extension, list_s3_not_in_db, validate_mp4_file, validate_mp3_file
import config import config
import psycopg2 import psycopg2
@ -33,6 +33,25 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
logging.info(f"bucket_name: {bucket_name}") logging.info(f"bucket_name: {bucket_name}")
# SECURITY CHECK: If DRY_RUN is false and ENV is development, ask for confirmation
dry_run_env = os.getenv('ACH_DRY_RUN', 'true').lower()
ach_env = os.getenv('ACH_ENV', 'development').lower()
if dry_run_env == 'false' and ach_env == 'development':
print("\n" + "!"*60)
print("!!! SECURITY CHECK: RUNNING IMPORT ON DEVELOPMENT ENVIRONMENT !!!")
print(f"DB_HOST: {db_config.get('host')}")
print(f"DB_NAME: {db_config.get('database')}")
print(f"DB_USER: {db_config.get('user')}")
print(f"DB_PORT: {db_config.get('port')}")
print("!"*60 + "\n")
user_input = input(f"Please type the DB_NAME '{db_config.get('database')}' to proceed: ")
if user_input != db_config.get('database'):
print("Action aborted by user. Database name did not match.")
logging.error("Process aborted: User failed to confirm DB_NAME for development import.")
return
# Ensure timing variables are always defined so later error-email logic # Ensure timing variables are always defined so later error-email logic
# won't fail if an exception is raised before end_time/elapsed_time is set. # won't fail if an exception is raised before end_time/elapsed_time is set.
start_time = time.time() start_time = time.time()
@ -44,6 +63,20 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
try: try:
logging.info("Starting the main process...") logging.info("Starting the main process...")
# ---------------------------------------------------------------------
# PHASE 1: S3 OBJECT DISCOVERY + INITIAL VALIDATION
#
# 1) List objects in the configured S3 bucket.
# 2) Filter objects by allowed extensions and excluded folders.
# 3) Validate the inventory code format (e.g. VA-C01-12345) and ensure the
# folder prefix matches the code type (e.g. "BRD" folder for BRD code).
# 4) Reject files that violate naming conventions before any DB interaction.
#
# This phase is intentionally descriptive so the workflow can be understood
# from logs even if the function names are not immediately clear.
# ---------------------------------------------------------------------
logging.info("PHASE 1: S3 object discovery + initial validation")
# Helper to make spaces visible in filenames for logging (replace ' ' with open-box char) # Helper to make spaces visible in filenames for logging (replace ' ' with open-box char)
def _visible_spaces(name: str) -> str: def _visible_spaces(name: str) -> str:
try: try:
@ -60,7 +93,9 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
# Define valid extensions and excluded folders # Define valid extensions and excluded folders
valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'} valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
# excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'} # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/'} # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'FILE/', 'DVD/', 'UMT/'}
excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'UMT/'}
# excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/',}
# included_folders = {'FILE/'} # uncomment this to NOT use excluded folders # included_folders = {'FILE/'} # uncomment this to NOT use excluded folders
# included_folders = {'TEST-FOLDER-DEV/'} # uncomment this to NOT use excluded folders # included_folders = {'TEST-FOLDER-DEV/'} # uncomment this to NOT use excluded folders
@ -98,9 +133,33 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
# s3_file_names contains the object keys (strings), not dicts. # s3_file_names contains the object keys (strings), not dicts.
base_name = os.path.basename(s3file) base_name = os.path.basename(s3file)
logging.info(f"S3 Base name: {base_name}") logging.info(f"S3 Base name: {base_name}")
# extract folder prefix and media type from inventory code
folder_prefix = os.path.dirname(s3file).rstrip('/')
media_type_in_code = base_name[3:6] if len(base_name) >= 6 else None
# Generic sanity check: prefix (folder name) should equal the media type in the code
is_valid_prefix = (folder_prefix == media_type_in_code)
# Special folder allowance rules
folder_allowances = {
'DVD': ['DVD', 'BRD'],
'FILE': ['M4V', 'AVI', 'MOV', 'MP4', 'MXF', 'AIF', 'WMV', 'M4A', 'MPG'],
}
if folder_prefix in folder_allowances:
if media_type_in_code in folder_allowances[folder_prefix]:
is_valid_prefix = True
if folder_prefix and media_type_in_code and not is_valid_prefix:
logging.warning(f"Prefix mismatch for {s3file}: Folder '{folder_prefix}' does not match code type '{media_type_in_code}'")
# we only warning here but still proceed with standard validation
if validate_inventory_code(base_name): # truncated to first 12 char in the function if validate_inventory_code(base_name): # truncated to first 12 char in the function
logging.info(f"File {base_name} matches pattern.") logging.info(f"File {base_name} matches pattern.")
# if valid check extension too # only check inventory code extension for media files (.mp4, .mp3)
# sidecars (.json, .pdf, .md5) only need their base validated
if s3file.lower().endswith(('.mp4', '.mp3')):
if not validate_icode_extension(s3file): if not validate_icode_extension(s3file):
logging.warning(f"File {s3file} has invalid extension for its inventory code.") logging.warning(f"File {s3file} has invalid extension for its inventory code.")
continue # skip adding this file to validated contents continue # skip adding this file to validated contents
@ -112,6 +171,15 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
folder_name = os.path.dirname(s3file) folder_name = os.path.dirname(s3file)
logging.warning(f"File {s3file} in folder {folder_name} does not match pattern.") logging.warning(f"File {s3file} in folder {folder_name} does not match pattern.")
# ---------------------------------------------------------------------
# PHASE 2: DATABASE CROSS-REFERENCE + FILTERING
#
# 1) Fetch existing filenames from the database.
# 2) Skip files already represented in the DB (including sidecar records).
# 3) Produce the final list of S3 object keys that should be parsed/inserted.
# ---------------------------------------------------------------------
logging.info("PHASE 2: Database cross-reference + filtering")
# filter_s3_files_not_in_db # filter_s3_files_not_in_db
# --- Get all DB filenames in one call --- # --- Get all DB filenames in one call ---
db_file_names = get_distinct_filenames_from_db() db_file_names = get_distinct_filenames_from_db()
@ -138,6 +206,14 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
total_files = len(filtered_file_names) total_files = len(filtered_file_names)
logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}") logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
# Log the files that need to be updated (those not yet in DB)
if total_files > 0:
logging.info("List of files to be updated in the database:")
for f in filtered_file_names:
logging.info(f" - {f}")
else:
logging.info("No new files found to update in the database.")
# Count files with .mp4 and .mp3 extensions # Count files with .mp4 and .mp3 extensions
mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4')) mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3')) mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
@ -167,6 +243,10 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
for file in s3_file_names: for file in s3_file_names:
if file.endswith('.mp4'): if file.endswith('.mp4'):
validate_mp4_file(file) # validation_utils.py - check also _H264 at the end validate_mp4_file(file) # validation_utils.py - check also _H264 at the end
elif file.endswith('.mp3'):
validate_mp3_file(file) # validation_utils.py
# Count by CODE media type (e.g. OA4, MCC) and log the counts for each type
# If ACH_SAFE_RUN is 'false' we enforce strict mp4/pdf parity and abort # If ACH_SAFE_RUN is 'false' we enforce strict mp4/pdf parity and abort
# when mismatched. Default is 'true' which skips this abort to allow # when mismatched. Default is 'true' which skips this abort to allow
@ -177,7 +257,7 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
logging.error("Number of .mp4 files is not equal to number of .pdf files") logging.error("Number of .mp4 files is not equal to number of .pdf files")
# MOD 20251103 # MOD 20251103
# add a check to find the missing pdf or mp4 files and report them # add a check to find the missing pdf or mp4 files and report them
# use file_names to find missing files # use filtered_file_names to find missing files
# store tuples (source_file, expected_counterpart) for clearer logging # store tuples (source_file, expected_counterpart) for clearer logging
missing_pdfs = [] # list of (mp4_file, expected_pdf) missing_pdfs = [] # list of (mp4_file, expected_pdf)
missing_mp4s = [] # list of (pdf_file, expected_mp4) missing_mp4s = [] # list of (pdf_file, expected_mp4)
@ -187,7 +267,7 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
base_name = os.path.splitext(file)[0] base_name = os.path.splitext(file)[0]
# if the mp4 is an H264 variant (e.g. name_H264.mp4) remove the suffix # if the mp4 is an H264 variant (e.g. name_H264.mp4) remove the suffix
if base_name.endswith('_H264'): if base_name.endswith('_H264'):
# must check if has extra number for DBT and DVD amd [FILE] # must check if has extra number for DBT and DVD and [FILE]
base_name = base_name[:-5] base_name = base_name[:-5]
expected_pdf = base_name + '.pdf' expected_pdf = base_name + '.pdf'
if expected_pdf not in filtered_file_names: if expected_pdf not in filtered_file_names:
@ -221,13 +301,9 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
if mp3_count + mp4_count != json_count: if mp3_count + mp4_count != json_count:
logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files") logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
# add check of mp3 +6 mp4 vs json and md5 file like above for mp4 and pdf # add check of mp3 +6 mp4 vs json and md5 file like above for mp4 and pdf
logging.error("Abort Import Process due to missing files") logging.error("Abort Import Process due to missing files")
# search wich file dont match TODO # search wich file dont match TODO
raise ValueError("Inconsistent file counts mp3+mp4 vs json") raise ValueError("Inconsistent file counts mp3+mp4 vs json")
@ -237,11 +313,20 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
# search wich file dont match TODO # search wich file dont match TODO
raise ValueError("Inconsistent file counts mp3+mp4 vs md5") raise ValueError("Inconsistent file counts mp3+mp4 vs md5")
# ---------------------------------------------------------------------
# PHASE 3: PARSE & INSERT INTO DATABASE
#
# 1) Process each remaining S3 object and validate its associated metadata.
# 2) Insert new records into the database (unless running in DRY_RUN).
# 3) Report counts of successful uploads, warnings, and errors.
# ---------------------------------------------------------------------
logging.info("PHASE 3: Parse S3 objects and insert new records into the database")
# Try to parse S3 files # Try to parse S3 files
try: try:
# If DRY RUN is set to True, the files will not be uploaded to the database # If DRY RUN is set to True, the files will not be uploaded to the database
if os.getenv('ACH_DRY_RUN', 'true') == 'false': if os.getenv('ACH_DRY_RUN', 'true') == 'false':
uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, file_names, ach_variables, excluded_folders) uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, filtered_file_names, ach_variables, excluded_folders)
else: else:
logging.warning("DRY RUN is set to TRUE - No files will be added to the database") logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
# set the tuples to zero # set the tuples to zero
@ -258,33 +343,13 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
# connect to database # connect to database
conn = psycopg2.connect(**db_config) conn = psycopg2.connect(**db_config)
cur = conn.cursor() cur = conn.cursor()
# function count_files that are wav and mov in db
# Map file extensions (include leading dot) to mime types
EXTENSION_MIME_MAP = {
'.avi': 'video/x-msvideo',
'.mov': 'video/mov',
'.wav': 'audio/wav',
'.mp4': 'video/mp4',
'.m4v': 'video/mp4',
'.mp3': 'audio/mp3',
'.mxf': 'application/mxf',
'.mpg': 'video/mpeg',
}
# populate mime_type list with all relevant MediaInfo/MIME values # Use centralized mime types from config
mime_type = [ from config import EXTENSION_MIME_MAP, MIME_TYPES
'video/x-msvideo', # .avi
'video/mov', # .mov
'audio/wav', # .wav
'video/mp4', # .mp4, .m4v
'audio/mp3', # .mp3
'application/mxf', # .mxf
'video/mpeg', # .mpg
]
logging.info(f"Mime types for counting files: {mime_type}") logging.info(f"Mime types for counting files: {MIME_TYPES}")
all_files_on_db = count_files(cur, mime_type,'*', False) all_files_on_db = count_files(cur, MIME_TYPES,'*', False)
mov_files_on_db = count_files(cur,['video/mov'],'.mov', False ) mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False ) mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False ) mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
@ -300,18 +365,18 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
logging.info(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}") logging.info(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
# compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list # compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names] missing_mp4s = [f for f in s3_file_names if f.endswith('.mp4') and f not in db_file_names]
# if missing_mp4s empty do not return a warning # if missing_mp4s empty do not return a warning
if missing_mp4s: if missing_mp4s:
logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}") logging.warning(f"Missing {len(missing_mp4s)} .mp4 files in DB compared to S3: {missing_mp4s}")
logging.info(f"Number of .wav files in the database: {wav_files_on_db} ") logging.info(f"Number of .wav files in the database: {wav_files_on_db} ")
logging.info(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}") logging.info(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
missing_mp3s = [f for f in file_names if f.endswith('.mp3') and f not in db_file_names] missing_mp3s = [f for f in s3_file_names if f.endswith('.mp3') and f not in db_file_names]
# if missing_mp3s empty do not return a warning # if missing_mp3s empty do not return a warning
if missing_mp3s: if missing_mp3s:
logging.warning(f"Missing .mp3 files in DB compared to S3: {missing_mp3s}") logging.warning(f"Missing {len(missing_mp3s)} .mp3 files in DB compared to S3: {missing_mp3s}")
logging.info(f"Number of .avi files in the database: {avi_files_on_db} ") logging.info(f"Number of .avi files in the database: {avi_files_on_db} ")
logging.info(f"Number of .m4v files in the database: {m4v_files_on_db} ") logging.info(f"Number of .m4v files in the database: {m4v_files_on_db} ")

View File

@ -31,10 +31,16 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
# logg ach_variables # logg ach_variables
logging.info(f"ach_variables: {ach_variables}") logging.info(f"ach_variables: {ach_variables}")
logging.info(f"Starting to parse S3 files from bucket {bucket_name}...") # ---------------------------------------------------------------------
# PHASE 3: PARSE & INSERT INTO DATABASE (DETAILS)
#
# 3.1) Filter out excluded prefixes and keep only files we care about.
# 3.2) Validate each media file alongside its related sidecars (.json, .md5, .pdf).
# 3.3) Cross-check the inventory code in the database and insert new records.
# ---------------------------------------------------------------------
logging.info("PHASE 3: Parse & insert - starting detailed file processing")
try: try:
logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
# Ensure db_config is not None # Ensure db_config is not None
if db_config is None: if db_config is None:
raise ValueError("Database configuration is not loaded") raise ValueError("Database configuration is not loaded")
@ -67,6 +73,9 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
if result: if result:
# logging.warning(f"File {file} already exists in the database.") # logging.warning(f"File {file} already exists in the database.")
warning_files_count += 1 warning_files_count += 1
if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true':
logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (file already exists in DB): %s", file)
raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3")
continue continue
ach_variables['file_fullpath'] = file # is the Object key ach_variables['file_fullpath'] = file # is the Object key
@ -76,12 +85,13 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
ach_variables['objectKeys']['media'] = file ach_variables['objectKeys']['media'] = file
ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf" ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf"
ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '') ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '')
from config import EXTENSION_MIME_MAP
if file.endswith('.mp4'): if file.endswith('.mp4'):
ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov" # remove _H264 is done later ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov" # remove _H264 is done later
elif file.endswith('.mp3'): elif file.endswith('.mp3'):
ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav" ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav"
else: else:
logging.KeyError(f"Unsupported file type: {file}") logging.error(f"Unsupported file type: {file}")
error_files_count +=1 error_files_count +=1
continue continue
@ -116,6 +126,9 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
else: else:
logging.warning("Could not retrieve file size for %s.", file) logging.warning("Could not retrieve file size for %s.", file)
warning_files_count += 1 warning_files_count += 1
if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true':
logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (missing file size): %s", file)
raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3")
continue # Skip to the next file in the loop continue # Skip to the next file in the loop
logging.info("Start Validating files for %s...", base_name) logging.info("Start Validating files for %s...", base_name)

View File

@ -24,8 +24,10 @@ def check_video_info(media_info):
# If the parent directory is 'FILE' accept multiple container types # If the parent directory is 'FILE' accept multiple container types
if parent_dir.lower() == 'file': if parent_dir.lower() == 'file':
# Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive) # Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive)
if not any(file_name.lower().endswith(ext) for ext in ('.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg')): # video alowed extension
return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg or .mpeg file." video_allowed_extensions = ['.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg', '.wmv']
if not any(file_name.lower().endswith(ext) for ext in video_allowed_extensions):
return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg, .mpeg or .wmv file."
# Map file extensions to lists of acceptable general formats (video) # Map file extensions to lists of acceptable general formats (video)
general_formats = { general_formats = {
@ -58,18 +60,23 @@ def check_video_info(media_info):
# Outside FILE/ directory require .mov specifically # Outside FILE/ directory require .mov specifically
if not file_name.lower().endswith('.mov'): if not file_name.lower().endswith('.mov'):
return False, "The file is not a .mov file." return False, "The file is not a .mov file."
# Check if track 1's format is ProRes
# Strict master MOV rule: track[1] must be ProRes
tracks = media_info.get('media', {}).get('track', []) tracks = media_info.get('media', {}).get('track', [])
if len(tracks) > 1: if len(tracks) <= 1:
track_1 = tracks[1] # Assuming track 1 is the second element (index 1)
logging.info(f"Track 1: {track_1}")
if track_1.get('@type', '') == 'Video' and track_1.get('Format', '') == 'ProRes' and track_1.get('Format_Profile', '') == '4444':
return True, "The file is a .mov file with ProRes format in track 1."
else:
return False, "Track 1 format is not ProRes."
else:
return False, "No track 1 found." return False, "No track 1 found."
track_1 = tracks[1] # track[1] should represent the video stream
logging.info(f"Track 1: {track_1}")
if track_1.get('@type', '') != 'Video':
return False, "Track 1 is not a video track."
if track_1.get('Format', '') != 'ProRes':
return False, "Track 1 format is not ProRes."
if track_1.get('Format_Profile', '') != '4444':
return False, "Track 1 format profile is not 4444."
return True, "The file is a .mov master with ProRes track 1."
return True, "The file passed the video format checks." return True, "The file passed the video format checks."
except Exception as e: except Exception as e:
return False, f"Error processing the content: {e}" return False, f"Error processing the content: {e}"
@ -77,10 +84,32 @@ def check_video_info(media_info):
# result, message = check_audio_info(json_content) # result, message = check_audio_info(json_content)
def check_audio_info(media_info): def check_audio_info(media_info):
try: try:
# Check if the file name ends with .wav # Determine source filename (from JSON) and its parent folder
file_name = media_info.get('media', {}).get('@ref', '') file_name = media_info.get('media', {}).get('@ref', '')
if not file_name.endswith('.wav'): parent_dir = os.path.basename(os.path.dirname(file_name))
logging.info(f"File name in JSON: {file_name}")
# If the file lives under FILE/, allow MP3/WAV/M4A/AIF as valid audio containers
if parent_dir.lower() == 'file':
audio_allowed_extensions = ['.wav', '.mp3', '.m4a', '.aif']
if not any(file_name.lower().endswith(ext) for ext in audio_allowed_extensions):
return False, f"The file is not one of the allowed audio containers: {', '.join(audio_allowed_extensions)}."
# For WAV, do the strict Wave/PCM validation
if file_name.lower().endswith('.wav'):
tracks = media_info.get('media', {}).get('track', [])
if len(tracks) > 1:
track_1 = tracks[1]
if track_1.get('@type', '') == 'Audio' and track_1.get('Format', '') == 'PCM' and track_1.get('SamplingRate', '') == '96000' and track_1.get('BitDepth', '') == '24':
return True, "The file is a .wav file with Wave format in track 1."
else:
return False, f"Track 1 format is not Wave. Format: {track_1.get('Format', '')}, SamplingRate: {track_1.get('SamplingRate', '')}, BitDepth: {track_1.get('BitDepth', '')}"
return False, "No track 1 found."
# For MP3/M4A we accept it without strict Wave validation
return True, "The file is an accepted audio container under FILE/ (mp3/m4a/wav)."
# Outside FILE/ directory require .wav specifically
if not file_name.lower().endswith('.wav'):
return False, "The file is not a .wav file." return False, "The file is not a .wav file."
# Check if track 1's format is Wave # Check if track 1's format is Wave

View File

@ -110,17 +110,39 @@ def validate_icode_extension(file_inventory_code):
'BTC': r'_\d{4}', 'BTC': r'_\d{4}',
'OA4': r'_\d{2}', 'OA4': r'_\d{2}',
'DVD': r'_\d{2}', 'DVD': r'_\d{2}',
'MCC': r'_[AB]' 'BRD': r'_\d{2}',
'MCC': r'_[AB]',
'DBT': r'_\d{4}',
'M4V': r'_\d{2}',
'AVI': r'_\d{2}',
'MOV': r'_\d{2}',
'MP4': r'_\d{2}',
'MXF': r'_\d{2}',
'MPG': r'_\d{2}'
} }
if not isinstance(file_inventory_code, str) or file_inventory_code == '': if not isinstance(file_inventory_code, str) or file_inventory_code == '':
logging.warning("Empty or non-string inventory code provided to validate_icode_extension") logging.warning("Empty or non-string inventory code provided to validate_icode_extension")
return False return False
# security: ignore any folder prefix or path components that might be
# passed in from external sources.
file_inventory_code = os.path.basename(file_inventory_code)
# remove common file extensions and only remove _H264 if it's an .mp4 file
if file_inventory_code.lower().endswith('.mp4'):
file_inventory_code = file_inventory_code.replace("_H264", "")
file_inventory_code = os.path.splitext(file_inventory_code)[0]
# Enforce maximum length (12 base + up to 5 extension chars) # Enforce maximum length (12 base + up to 5 extension chars)
if len(file_inventory_code) > 17: if len(file_inventory_code) > 17:
logging.warning("Inventory code '%s' exceeds maximum allowed length (17).", file_inventory_code) logging.warning("Inventory code '%s' exceeds maximum allowed length (17).", file_inventory_code)
# Only raise the error if DRY RUN is false; otherwise, just log it as a warning
if os.getenv('ACH_DRY_RUN', 'true').lower() == 'false':
raise ValueError("Inventory code with extension exceeds maximum length of 17 characters.") raise ValueError("Inventory code with extension exceeds maximum length of 17 characters.")
else:
return False
# Validate base first (first 12 chars). If base invalid -> reject. # Validate base first (first 12 chars). If base invalid -> reject.
base = file_inventory_code[:12] base = file_inventory_code[:12]
@ -132,6 +154,22 @@ def validate_icode_extension(file_inventory_code):
support_type = base[3:6] support_type = base[3:6]
extension = file_inventory_code[12:] extension = file_inventory_code[12:]
if extension == '':
logging.info("Extension for '%s' is empty (valid).", file_inventory_code)
return True
expected_ext_pattern = file_type_to_regex.get(support_type)
if expected_ext_pattern is None:
logging.warning("Unsupported type '%s' for extension validation in '%s'.", support_type, file_inventory_code)
return False
if not re.fullmatch(expected_ext_pattern, extension):
logging.warning("Extension '%s' does not match expected pattern '%s' for type '%s'.", extension, expected_ext_pattern, support_type)
return False
logging.info("Inventory code with extension '%s' is valid.", file_inventory_code)
return True
def analyze_extension_pattern(file_inventory_code): def analyze_extension_pattern(file_inventory_code):
"""Analyze extension and base-length issues for a full inventory code. """Analyze extension and base-length issues for a full inventory code.
@ -338,11 +376,13 @@ if __name__ == "__main__":
# validate_icode_extension (valid and invalid) # validate_icode_extension (valid and invalid)
{"name": "validate_icode_extension no ext", "fn": lambda: validate_icode_extension("VO-DVD-12345"), "expect": True}, {"name": "validate_icode_extension no ext", "fn": lambda: validate_icode_extension("VO-DVD-12345"), "expect": True},
{"name": "validate_icode_extension BARE prefix", "fn": lambda: validate_icode_extension("BTC/VO-DVD-12345"), "expect": True},
{"name": "validate_icode_extension BTC valid", "fn": lambda: validate_icode_extension("VO-BTC-12345_1234"), "expect": True}, {"name": "validate_icode_extension BTC valid", "fn": lambda: validate_icode_extension("VO-BTC-12345_1234"), "expect": True},
{"name": "validate_icode_extension DVD valid", "fn": lambda: validate_icode_extension("VO-DVD-12345_12"), "expect": True}, {"name": "validate_icode_extension DVD valid", "fn": lambda: validate_icode_extension("VO-DVD-12345_12"), "expect": True},
{"name": "validate_icode_extension MCC valid", "fn": lambda: validate_icode_extension("VO-MCC-12345_A"), "expect": True}, {"name": "validate_icode_extension MCC valid", "fn": lambda: validate_icode_extension("VO-MCC-12345_A"), "expect": True},
{"name": "validate_icode_extension unsupported type", "fn": lambda: validate_icode_extension("VO-XYZ-12345_12"), "expect": False}, {"name": "validate_icode_extension unsupported type", "fn": lambda: validate_icode_extension("VO-XYZ-12345_12"), "expect": False},
{"name": "validate_icode_extension too long extension (raises)", "fn": lambda: validate_icode_extension("VO-DVD-12345_12345"), "expect_exception": ValueError}, {"name": "validate_icode_extension too long extension (raises)", "fn": lambda: validate_icode_extension("VO-DVD-12345_12345"), "expect_exception": ValueError},
{"name": "validate_icode_extension prefix too long (raises)", "fn": lambda: validate_icode_extension("XYZ/VO-DVD-12345_12345"), "expect_exception": ValueError},
{"name": "validate_icode_extension BTC invalid pattern", "fn": lambda: validate_icode_extension("VO-BTC-12345_12"), "expect": False}, {"name": "validate_icode_extension BTC invalid pattern", "fn": lambda: validate_icode_extension("VO-BTC-12345_12"), "expect": False},
# mp4/mp3 validators - return lists # mp4/mp3 validators - return lists