diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 95468b4..9877b96 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -59,6 +59,34 @@ The core validation revolves around "Inventory Codes" which MUST follow a strict - Access settings exclusively via the `config.py` module's dictionaries: `db_config`, `aws_config`, `ach_config`. - Never hardcode credentials or endpoints. +## Copilot / Agent Behavior + +This repository is used with an AI assistant. When interacting with the assistant, follow these principles: + +- **Do not modify code unless explicitly requested.** The assistant should not change files unless given a clear instruction to do so. +- **Ask before acting.** If a change is needed, the assistant should describe the required modification and confirm before applying it. +- **Prefer explanation over edits.** When debugging or answering questions, provide guidance and analysis rather than directly editing source files. +- **Keep changes minimal.** If a code change is approved, apply the smallest possible edit that resolves the issue. + +## Code Style & Maintainability + +When generating or modifying code, prioritize **maintainability and clarity over optimization**. + +This is **development-stage code**, so it must remain easy to read, understand, and modify by humans. + +Guidelines: + +- Prefer **clear, explicit implementations** rather than clever or overly compact solutions. +- Avoid **micro-optimizations** or complex patterns that reduce readability. +- Do **not introduce obscure algorithms or creative tricks** that make the code difficult to understand. +- Write code that a developer unfamiliar with the project can quickly follow. +- Use **meaningful variable and function names**. +- Add **thoughtful comments** explaining non-obvious logic, assumptions, and decisions. +- Favor **simple and conventional approaches** instead of experimental or highly abstract ones. +- Maintain a **consistent structure and formatting**. + +The goal is **clean, maintainable, well-documented code**, not maximum performance or cleverness. + ## Related Files - [query-sql.md](query-sql.md): Reference for database schema and SQL logic. - [requirements.txt](requirements.txt): Project dependencies. diff --git a/TODO-mime.md b/TODO-mime.md new file mode 100644 index 0000000..836d9ad --- /dev/null +++ b/TODO-mime.md @@ -0,0 +1,186 @@ +Read [](file:///d%3A/02-Programming/20_ACH_20/ACH-server-import-media02/validation_utils.py#1-1), lines 1 to 260 + +Searched for regex `allowed_extensions|allowed audio|allowed|endswith\('` (`**/*.py`), 20 results + +Ecco il report sui **processi di validazione** che usano valori “hard-coded” (cioè codificati direttamente nel sorgente, non estratti da config.py o da una tabella centrale). Questo include liste di estensioni, pattern, e regole applicate in modo statico nel codice. + +--- + +## 1) Hard-coded estensioni / mime in utils.py + +### 1.1 `check_video_info()` (video) +- **Lista fissa di estensioni video** (solo quando il file è sotto `FILE/`): + - `['.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg', '.wmv']` +- Questa lista viene usata per rifiutare i file che non finiscano con queste estensioni. + +### 1.2 `check_audio_info()` (audio) +- **Lista fissa di estensioni audio** (quando il file è sotto `FILE/`): + - `['.wav', '.mp3', '.m4a', '.aif', '.aiff']` +- Qui è dove saltava l’errore che ti ha dato: se la stringa usata (`@ref` dal JSON) non finisce con una di queste estensioni, la validazione fallisce. +- Questa lista non è derivata da `config.EXTENSION_MIME_MAP` ma è hardcoded. + +--- + +## 2) Hard-coded pattern/validazione in validation_utils.py + +### 2.1 Inventory code (core validation) +- Pattern fisso: `r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$'` (ok, è una regola di business, quindi anche hardcoded è accettabile) + +### 2.2 Regole per l’estensione dopo i 12 caratteri +- `file_type_to_regex` è un dizionario hardcoded: + - `BTC`, `OA4`, `DVD`, `BRD`, `MCC`, `DBT`, `M4V`, `AVI`, `MOV`, `MP4`, `MXF`, `MPG` +- Anche qui: hardcoded ma è fatto apposta (linee guida di formato). + +--- + +## 3) Hard-coded estensioni in main.py (filtri e contatori) + +### 3.1 Filtri iniziali (Phase 1) +- `valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}` → hardcoded +- `excluded_folders` → hardcoded set (per ora `{'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'UMT/'}`) + +### 3.2 Contatori (sempre in main.py) +- Conta file con `.mp4`, `.mp3`, `.md5`, `.pdf`, `.json`, `.mov`, `.avi`, `.m4v` + +--- + +## 4) Hard-coded estensioni in `validation_utils.validate_mp4_file()` +- Filtra semplicemente su `.mp4` con `f.lower().endswith('.mp4')` + +--- + +## 5) Logica MIME in db_utils.py +- Anche se usa `EXTENSION_MIME_MAP`, c’è una mappatura “extra” basata su valori di MediaInfo (che è anch’essa hardcoded dentro `get_mime_from_mediainfo`) +- Esempi: `'MPEG-4' -> video/mp4`, `'QuickTime' -> video/quicktime`, ecc. + +--- + +## Conclusione (cosa fare/controllare) + +### A) Se vuoi centralizzare le regole (suggerito) +- Le liste di estensioni usate in utils.py (audio/video) potrebbe essere prese da **`config.EXTENSION_MIME_MAP`** (o da una struttura simile, per evitare duplicati e incongruenze). +- Lo stesso vale per le estensioni usate nei filtri di main.py (valid_extensions). + +### B) Se vuoi sapere cosa esattamente è “errato” +- Quelle liste sono i punti dove potrebbero esserci discrepanze rispetto al mapping MIME desiderato. +- In particolare, la validazione audio si basa esclusivamente su `@ref` e sulla lista `audio_allowed_extensions`. Se `@ref` non finisce con un’estensione ammessa, viene rifiutato — a prescindere da cosa dice `FileExtension` nel JSON. + +--- + +Se vuoi, posso farti un report più “operativo”: + +- elenco esatto di tutte le linee contenenti liste hardcoded di estensioni +- suggerimenti su come “collegare” queste liste a `EXTENSION_MIME_MAP` per avere un’unica fonte di verità (senza toccare altri comportamenti) + +--- + + + + +Read [](file:///d%3A/02-Programming/20_ACH_20/ACH-server-import-media02/db_utils.py#1-1), lines 1 to 200 + +## ✅ “Logica MIME” in db_utils.py (dettaglio completo) + +In db_utils.py ci sono **due funzioni chiave** che determinano il MIME da associare a un file: + +1. **`get_mime_for_extension(ext)`** — mappa direttamente un’estensione (es. `.mp4`) a un MIME, usando la tabella centrale `EXTENSION_MIME_MAP` in config.py. +2. **`get_mime_from_mediainfo(ach_variables)`** — cerca di capire il MIME dal contenuto del sidecar JSON (`mediainfo`), con un lungo flusso di fallback. + +--- + +## 🔹 1) `get_mime_for_extension(ext)` — mappatura “classica” + +Questa funzione è usata come fallback quando la logica “mediainfo” non trova niente di utilizzabile. + +- Prende un’estensione **con o senza punto** +- Normalizza (aggiunge il punto se manca) +- Cerca in `EXTENSION_MIME_MAP` +- Se non trova, usa **`application/octet-stream`** + +📍 **Dove viene usato:** +- Quando non si riesce a inferire un MIME riuscendo dalla metadata JSON (`get_mime_from_mediainfo`), oppure quando si vuole usare solo l’estensione dichiarata del file (es. nel record streaming `.mp4`). + +--- + +## 🔹 2) `get_mime_from_mediainfo(ach_variables)` — logica “intelligente” + +Questa funzione è quella più complessa e cerca di essere **“corretta” rispetto al master**, perché: + +- La copia master (conservatory copy) potrebbe essere `.mov` (video/quicktime) +- ma lo stream nel bucket potrebbe essere `.mp4` (video/mp4) e il JSON potrebbe essere basato su `.mp4` +- vogliamo comunque salvare il MIME del **master**, non dello stream + +### Passaggi principali + +#### ✅ 2.1 Priorità master `.mov` +Se il codice conosce già che il master è `.mov`, **ritorna subito** `video/quicktime`: + +- controlla `ach_variables['conservative_copy_extension']` +- poi controlla `ach_variables['objectKeys']['conservative_copy']` (il path JSON `@ref`) + +> Questo serve a evitare che MediaInfo dica “MPEG-4” e invece venga salvato `video/mp4` per un `.mov`. + +#### ✅ 2.2 Se non è `.mov`, prova con il JSON `mediainfo` +- Cerca il track `@type == "General"` (di solito il primo track) +- Legge il campo `Format` (es. `MPEG-4`, `AVI`, `AIFF`, …) +- Confronta il valore `Format` con una **mappa hardcoded**: + +```py +mapping = { + 'AVI': 'video/x-msvideo', + 'MOV': 'video/quicktime', + 'QuickTime': 'video/quicktime', + 'MPEG-4': 'video/mp4', + 'MP4': 'video/mp4', + 'MXF': 'application/mxf', + 'MPEG': 'video/mpeg', + 'MPEG-PS': 'video/mpeg', + 'MPEG-TS': 'video/MP2T', + 'MPEG Audio': 'audio/mpeg', + 'MPEG Audio/Layer 3': 'audio/mpeg', + 'AAC': 'audio/aac', + 'PCM': 'audio/wav', + 'WAV': 'audio/wav', + 'AIFF': 'audio/aiff', + 'FLAC': 'audio/flac', +} +``` + +Se `Format` corrisponde esattamente (case-insensitive), restituisce il MIME corretto. + +#### ✅ 2.3 Fallback “fuzzy” sul campo `Format` +Se `Format` non è uno di quelli nella mappa, prova a identificare il MIME basandosi su parole chiave nel valore: + +- se contiene `avi` → `video/x-msvideo` +- se contiene `mp4` → `video/mp4` +- se contiene `mpeg` → `video/mpeg` +- se contiene `wav` o `pcm` → `audio/wav` +- se contiene `mp3` → `audio/mpeg` + +#### ✅ 2.4 Ultimo fallback: mappa dalle estensioni +Se tutto il resto fallisce **ricade su**: + +- `get_mime_for_extension(ach_variables['extension'])` + +Quindi qui la mappatura centrale (`EXTENSION_MIME_MAP`) è l’ultima risorsa. + +--- + +## 🧩 In quali casi “sbaglia”? + +Potenziali punti deboli / dove i MIME possono risultare sbagliati: + +1. **`mediainfo.Format` usa valori non previsti** (es. “MPEG-4” per `.mov` → ma abbiamo già fallback `.mov`) +2. **`ach_variables['extension']` è errato** (se non deriva dal file giusto) +3. **`EXTENSION_MIME_MAP` non contiene un’estensione** usata in produzione (ma l’hai già sistemato) + +--- + +## 📍 Dove viene usata la logica +`get_mime_from_mediainfo()` viene chiamata in `add_file_record_and_relationship()` quando si aggiunge il “master” al DB (quindi influisce sul `file_type` del master), mentre lo stream usa `get_mime_for_extension()` (“media” record). + +--- + +Se vuoi, posso evidenziarti: +- Esattamente in quale riga `add_file_record_and_relationship` chiama `get_mime_from_mediainfo` +- Qual è il valore preciso usato in `ach_variables` nelle diverse fasi (per capire esattamente cosa arriva a `get_mime_from_mediainfo`) \ No newline at end of file diff --git a/config.py b/config.py index ddb0dd7..2d1acf0 100644 --- a/config.py +++ b/config.py @@ -67,17 +67,20 @@ def load_config(): return aws_config, db_config, ach_config, bucket_name, ach_variables EXTENSION_MIME_MAP = { - '.avi': 'video/x-msvideo', - '.mov': 'video/quicktime', - '.wav': 'audio/wav', - '.mp4': 'video/mp4', - '.m4v': 'video/mp4', - '.mp3': 'audio/mp3', - '.mxf': 'application/mxf', - '.mpg': 'video/mpeg', '.aif': 'audio/aiff', - '.wmv': 'video/x-ms-asf', + '.aiff': 'audio/aiff', + '.avi': 'video/x-msvideo', + '.jpg': 'image/jpeg', '.m4a': 'audio/mp4', + '.m4v': 'video/mp4', + '.mov': 'video/quicktime', + '.mp3': 'audio/mpeg', + '.mp4': 'video/mp4', + '.mpg': 'video/mpeg', + '.mxf': 'application/mxf', + '.pdf': 'application/pdf', + '.wav': 'audio/wav', + '.wmv': 'video/x-ms-asf', } MIME_TYPES = sorted(list(set(EXTENSION_MIME_MAP.values()))) diff --git a/db_utils.py b/db_utils.py index 5f13640..f70e88e 100644 --- a/db_utils.py +++ b/db_utils.py @@ -54,30 +54,50 @@ def get_mime_from_mediainfo(ach_variables: dict) -> str: This is used to capture the *master* format (conservatory copy) even when the stream copy on S3 is a different container (e.g., _H264.mp4). + Rules: + - If the file is outside `FILE/`, the master must be ProRes (MOV/QuickTime). + If it is not ProRes, this is a fatal error. + - If the file is under `FILE/`, any format is acceptable; MIME is derived + from the JSON metadata. + If mediainfo is missing or cannot be mapped, fall back to extension-based mapping. """ - # Prefer the master (conservative copy) extension when it is explicitly available. - # In some cases MediaInfo reports "MPEG-4" for .mov containers, so the extension - # is a more reliable hint for the correct mime type. - conservative_ext = ach_variables.get('conservative_copy_extension') - if conservative_ext and conservative_ext.lower() == '.mov': + + # Determine whether this is a streaming file (FILE/) or a master file. + file_fullpath = ach_variables.get('file_fullpath', '') or ach_variables.get('objectKeys', {}).get('media', '') + file_fullpath_norm = file_fullpath.replace('\\', '/').lstrip('/') + prefix = file_fullpath_norm.split('/', 1)[0].upper() if file_fullpath_norm else '' + is_file_folder = (prefix == 'FILE') + + # Extract MediaInfo tracks + mediainfo = ach_variables.get('custom_data_in', {}).get('mediainfo', {}) + tracks = mediainfo.get('media', {}).get('track', []) + + # ---- Master (outside FILE/) must be ProRes ---- + if not is_file_folder: + # Find the video track + video_track = None + for t in tracks: + if t.get('@type', '').lower() == 'video': + video_track = t + break + + if not video_track: + raise ValueError("Master file missing a Video track in mediainfo; cannot validate ProRes") + + format_value = (video_track.get('Format') or '').strip() + if 'prores' not in format_value.lower(): + raise ValueError(f"Master file is not ProRes (Format='{format_value}').") + + # Strict rule: master is always QuickTime/ProRes return 'video/quicktime' - # Also check the actual conservative copy object key (from JSON @ref). This is the - # name that will be stored in the DB as the master file, so it should drive the MIME. - conservative_copy_key = ach_variables.get('objectKeys', {}).get('conservative_copy', '') - if conservative_copy_key and os.path.splitext(conservative_copy_key)[1].lower() == '.mov': - return 'video/quicktime' - - # Try to find the General track format from mediainfo + # ---- FILE/ folder: do not enforce ProRes; derive MIME from JSON (or extension fallback) ---- try: - mediainfo = ach_variables.get('custom_data_in', {}).get('mediainfo', {}) - tracks = mediainfo.get('media', {}).get('track', []) for track in tracks: if track.get('@type', '') == 'General': format_value = track.get('Format', '') if format_value: - # Map common MediaInfo format values to MIME types mapping = { 'AVI': 'video/x-msvideo', 'MOV': 'video/quicktime', @@ -96,11 +116,9 @@ def get_mime_from_mediainfo(ach_variables: dict) -> str: 'AIFF': 'audio/aiff', 'FLAC': 'audio/flac', } - # Do a case-insensitive match for k, v in mapping.items(): if format_value.lower() == k.lower(): return v - # Try a fuzzy match based on known substrings if 'avi' in format_value.lower(): return 'video/x-msvideo' if 'mp4' in format_value.lower(): diff --git a/file_utils.py b/file_utils.py index 72b3ea8..f7dcdfc 100644 --- a/file_utils.py +++ b/file_utils.py @@ -146,7 +146,7 @@ def check_related_files(s3, file_name_with_path, file, bucket_name): if check_file_exists_in_s3(s3, pdf_file,bucket_name): pdf_file_size = get_file_size(s3, bucket_name, pdf_file) ach_pdf_disk_size = pdf_file_size - logging.info(f"PDF disk size: {ach_pdf_disk_size}") + # logging.info(f"PDF disk size: {ach_pdf_disk_size}") else: logging.error(f"PDF file {pdf_file} not found.") raise FileNotFoundError(f"PDF file {pdf_file} not found.") diff --git a/main.py b/main.py index 7bb25bb..44645fe 100644 --- a/main.py +++ b/main.py @@ -91,6 +91,8 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables): list_s3_files = list_s3_bucket(s3_client, bucket_name) # Define valid extensions and excluded folders + # NOTE: This list is used only for the initial S3 filtering step (Phase 1). + # It determines which object keys are considered for further processing. valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'} # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'} # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'FILE/', 'DVD/', 'UMT/'} @@ -141,7 +143,12 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables): # Generic sanity check: prefix (folder name) should equal the media type in the code is_valid_prefix = (folder_prefix == media_type_in_code) - # Special folder allowance rules + # Some folders are allowed to contain multiple media types. + # This is a relaxation of the strict prefix==code rule for known cases + # where the folder is effectively a container of multiple media formats. + # E.g.: + # - DVD folder may contain both DVD and BRD files + # - FILE folder is used for retrievals and may contain many container types folder_allowances = { 'DVD': ['DVD', 'BRD'], 'FILE': ['M4V', 'AVI', 'MOV', 'MP4', 'MXF', 'AIF', 'WMV', 'M4A', 'MPG'], diff --git a/s3_utils.py b/s3_utils.py index 88ae08f..c47eb50 100644 --- a/s3_utils.py +++ b/s3_utils.py @@ -63,46 +63,58 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]): error_files_count = 0 warning_files_count = 0 uploaded_files_count = 0 + total_files = len(filtered_files) #for file in s3_files: - for file in filtered_files: - if file.endswith(('.mp4', '.mp3')): # Check for both .mp4 and .mp3 - logging.info("Processing file: %s in the bucket: %s", file, bucket_name) - # check if file exists in db - result = check_objkey_in_file_db( cur, file) - # Check the result and proceed accordingly - if result: - # logging.warning(f"File {file} already exists in the database.") - warning_files_count += 1 - if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true': - logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (file already exists in DB): %s", file) - raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3") - continue + for idx, file in enumerate(filtered_files, start=1): + # Display progress to console only (not written to log files) + print(f"--------------\n--- file {idx} of {total_files} ---\n--------------", flush=True) + + # Use a savepoint per file to allow rollback on individual failures + # without aborting the full batch. + cur.execute("SAVEPOINT file_save") + try: + if file.endswith(('.mp4', '.mp3')): # Check for both .mp4 and .mp3 + logging.info("Processing file: %s in the bucket: %s", file, bucket_name) + # check if file exists in db + result = check_objkey_in_file_db(cur, file) + # Check the result and proceed accordingly + if result: + # logging.warning(f"File {file} already exists in the database.") + warning_files_count += 1 + if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true': + logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (file already exists in DB): %s", file) + raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3") + # Rollback to savepoint to undo any partial changes for this file + cur.execute("ROLLBACK TO SAVEPOINT file_save") + continue - ach_variables['file_fullpath'] = file # is the Object key - ach_variables['inventory_code'] = os.path.splitext(os.path.basename(file))[0][:12] - logging.info(f"ach_variables['inventory_code'] {ach_variables['inventory_code']}: {file}") - # Extract the file extension - ach_variables['objectKeys']['media'] = file - ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf" - ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '') - from config import EXTENSION_MIME_MAP - if file.endswith('.mp4'): - ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov" # remove _H264 is done later - elif file.endswith('.mp3'): - ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav" - else: - logging.error(f"Unsupported file type: {file}") - error_files_count +=1 - continue - - # Extract the file extension - file_extension = os.path.splitext(file)[1] - ach_variables['extension'] = file_extension # Store the file extension in ach_variables - logging.info(f"the file File extension: {file_extension}") - - # Extract the file name with directory part - file_name_with_path = os.path.splitext(file)[0] # Remove the extension but keep path - logging.info(f"File name with path: {file_name_with_path}") + ach_variables['file_fullpath'] = file # is the Object key + ach_variables['inventory_code'] = os.path.splitext(os.path.basename(file))[0][:12] + logging.info(f"ach_variables['inventory_code'] {ach_variables['inventory_code']}: {file}") + # Extract the file extension + ach_variables['objectKeys']['media'] = file + ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf" + ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '') + from config import EXTENSION_MIME_MAP + if file.endswith('.mp4'): + ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov" # remove _H264 is done later + elif file.endswith('.mp3'): + ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav" + else: + logging.error(f"Unsupported file type: {file}") + error_files_count +=1 + # Rollback to savepoint for this file + cur.execute("ROLLBACK TO SAVEPOINT file_save") + continue + + # Extract the file extension + file_extension = os.path.splitext(file)[1] + ach_variables['extension'] = file_extension # Store the file extension in ach_variables + logging.info(f"the file File extension: {file_extension}") + + # Extract the file name with directory part + file_name_with_path = os.path.splitext(file)[0] # Remove the extension but keep path + logging.info(f"File name with path: {file_name_with_path}") # Extract the base name from the file name base_name = os.path.basename(file_name_with_path) # Extract the base name with path removed @@ -227,6 +239,22 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]): # Commit to the database (conn, cur) only if everything is okay; otherwise, perform a rollback. conn.commit() uploaded_files_count +=1 + except Exception as e: + # Roll back the changes done for this file only and continue processing others + logging.error(f"Error processing {file}: {e}. Rolling back this file's changes.") + try: + cur.execute("ROLLBACK TO SAVEPOINT file_save") + except Exception as rollback_err: + logging.error(f"Failed to rollback savepoint for {file}: {rollback_err}") + error_files_count += 1 + continue + finally: + # Release the savepoint so it doesn't linger in the session + try: + cur.execute("RELEASE SAVEPOINT file_save") + except Exception: + # Ignore release errors; rollback already cleaned up state if needed + pass cur.close() conn.close() except ValueError as e: diff --git a/tests/test_db_utils_mime.py b/tests/test_db_utils_mime.py new file mode 100644 index 0000000..021e84a --- /dev/null +++ b/tests/test_db_utils_mime.py @@ -0,0 +1,64 @@ +import unittest + +from db_utils import get_mime_from_mediainfo + + +class TestGetMimeFromMediainfo(unittest.TestCase): + + def test_master_prores_returns_quicktime(self): + ach_variables = { + 'file_fullpath': 'DBT/VO-DBT-00001.mov', + 'custom_data_in': { + 'mediainfo': { + 'media': { + 'track': [ + {'@type': 'General', 'Format': 'MPEG-4'}, + {'@type': 'Video', 'Format': 'ProRes', 'Format_Profile': '4444'}, + ] + } + } + } + } + + mime = get_mime_from_mediainfo(ach_variables) + self.assertEqual(mime, 'video/quicktime') + + def test_master_not_prores_raises(self): + ach_variables = { + 'file_fullpath': 'DBT/VO-DBT-00002.mov', + 'custom_data_in': { + 'mediainfo': { + 'media': { + 'track': [ + {'@type': 'General', 'Format': 'MPEG-4'}, + {'@type': 'Video', 'Format': 'AVC', 'Format_Profile': 'High'}, + ] + } + } + } + } + + with self.assertRaises(ValueError): + get_mime_from_mediainfo(ach_variables) + + def test_file_folder_allows_non_prores(self): + ach_variables = { + 'file_fullpath': 'FILE/VO-DBT-00003.mp4', + 'custom_data_in': { + 'mediainfo': { + 'media': { + 'track': [ + {'@type': 'General', 'Format': 'MPEG-4'}, + {'@type': 'Video', 'Format': 'AVC', 'Format_Profile': 'High'}, + ] + } + } + } + } + + mime = get_mime_from_mediainfo(ach_variables) + self.assertEqual(mime, 'video/mp4') + + +if __name__ == '__main__': + unittest.main() diff --git a/utils.py b/utils.py index 72c782a..fbebdf8 100644 --- a/utils.py +++ b/utils.py @@ -21,10 +21,13 @@ def check_video_info(media_info): parent_dir = os.path.basename(os.path.dirname(file_name)) logging.info(f"Parent directory: {parent_dir}") - # If the parent directory is 'FILE' accept multiple container types + # If the parent directory is 'FILE' we are in the "streaming" / "retrieval" path. + # In this context we only accept a predefined set of container extensions. + # This is not the MASTER copy validation (which requires ProRes and is handled elsewhere). if parent_dir.lower() == 'file': - # Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive) - # video alowed extension + # Allowed video container extensions for FILE/ paths. + # These are used purely as a whitelist to reject unknown/unsupported containers + # before we attempt to parse the mediainfo JSON. video_allowed_extensions = ['.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg', '.wmv'] if not any(file_name.lower().endswith(ext) for ext in video_allowed_extensions): return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg, .mpeg or .wmv file." @@ -88,9 +91,11 @@ def check_audio_info(media_info): file_name = media_info.get('media', {}).get('@ref', '') parent_dir = os.path.basename(os.path.dirname(file_name)) - # If the file lives under FILE/, allow MP3/WAV/M4A/AIF as valid audio containers + # If the file lives under FILE/, we treat it as a streaming/retrieval file. + # In this path we whitelist only a specific set of audio containers. + # (Master validation is handled elsewhere and requires ProRes.) if parent_dir.lower() == 'file': - audio_allowed_extensions = ['.wav', '.mp3', '.m4a', '.aif'] + audio_allowed_extensions = ['.wav', '.mp3', '.m4a', '.aif', '.aiff'] if not any(file_name.lower().endswith(ext) for ext in audio_allowed_extensions): return False, f"The file is not one of the allowed audio containers: {', '.join(audio_allowed_extensions)}."