feat: Enhance media import functionality with centralized MIME type management and improved validation

- Added EXTENSION_MIME_MAP to config.py for centralized MIME type mapping. - Updated countfiles.py to utilize centralized MIME types for file counting. - Refactored db_utils.py to import EXTENSION_MIME_MAP and added get_mime_from_mediainfo function for better MIME type determination. - Enhanced file_utils.py with prefix matching for S3 files and JSON references. - Improved logging in logging_config.py for better error tracking. - Added security checks in main.py for development environment imports. - Updated validation_utils.py to include stricter extension validation and improved error handling. - Introduced new instructions for developers in .github/copilot-instructions.md for project setup and coding standards.
2026-03-15 15:04:50 +01:00 · 2026-03-15 15:04:50 +01:00 · ccd700a4a8
parent 6030cc3f84
commit ccd700a4a8
11 changed files with 500 additions and 215 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@ -0,0 +1,65 @@
 # ACH Server Media Import - Agent Instructions
 Guidelines and standards for the ACH Media Import project.
 ## Project Overview
 This project is a Python-based utility that imports media files from an S3-compatible bucket into a PostgreSQL database, enforcing specific naming conventions and metadata validation.
 ## Technical Stack
 - **Language**: Python 3.8+
 - **Database**: PostgreSQL (via `psycopg2`)
 - **Cloud Storage**: AWS S3/S3-compatible storage (via `boto3`)
 - **Containerization**: Docker & Docker Compose
 - **Environment**: Managed via `.env` and `config.py`
 ## Architecture & Modular Design
 The project uses a utility-based modular architecture orchestrated by `main.py`.
 - [main.py](main.py): Entry point and workflow orchestrator.
 - [s3_utils.py](s3_utils.py): S3 client operations and bucket listing.
 - [db_utils.py](db_utils.py): Database connectivity and SQL execution.
 - [validation_utils.py](validation_utils.py): Pattern matching and business logic validation.
 - [logging_config.py](logging_config.py): Centralized logging configuration.
 - [error_handler.py](error_handler.py): Error handling and notifications.
 - [email_utils.py](email_utils.py): SMTP integration for alerts.
 ## Domain Logic: Inventory Codes
 The core validation revolves around "Inventory Codes" which MUST follow a strict 12-character format:
 - `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`
 - Examples: `VA-C01-12345`, `OC-A99-67890`.
 - Files not matching this pattern in S3 are logged but skipped.
 ## Development Workflows
 ### Environment Setup
 - **Windows**: Use `. .venv\Scripts\Activate.ps1`
 - **Linux/macOS**: Use `source .venv/bin/activate`
 - **Dependency installation**: `pip install -r requirements.txt`
 ### Local Execution
 - **Run script**: `python main.py`
 - **Verify Configuration**: Ensure `.env` is populated with `DB_`, `AWS_`, and `SMTP_` variables.
 ### Docker Operations
 - **Build/Up**: `docker compose up -d --build`
 - **Logs**: `docker compose logs -f app`
 - **Stop**: `docker compose stop`
 ## Coding Standards & Conventions
 ### Logging
 - Use the custom logger from `logging_config.py`.
 - **Log Levels**: Use `logging.INFO`, `logging.WARNING`, and the custom `CUSTOM_ERROR_LEVEL` (35) via `error_handler.py`.
 - Logs are rotated and stored in the `logs/` directory.
 ### Error Handling
 - Wrap critical operations that should trigger notifications in try-except blocks that call `error_handler.notify_error()`.
 - Avoid silent failures; ensure errors are logged to the appropriate file sync.
 ### Configuration
 - Access settings exclusively via the `config.py` module's dictionaries: `db_config`, `aws_config`, `ach_config`.
 - Never hardcode credentials or endpoints.
 ## Related Files
 - [query-sql.md](query-sql.md): Reference for database schema and SQL logic.
 - [requirements.txt](requirements.txt): Project dependencies.
 - [docker-compose.yml](docker-compose.yml): Deployment configuration.
--- a/README.md
+++ b/README.md
@ -1,138 +1,126 @@
-# Project Setup
+# ACH Server Media Import
-## Setting up a Virtual Environment
+This repository contains a script that imports media files from an S3-compatible bucket into a PostgreSQL database. It supports both local execution (Python virtual environment) and Docker deployment via `docker-compose`.
-1. **Create a virtual environment:**
+---
-   ### For Linux/macOS:
+## Overview
   ```bash
   python3 -m venv .venv
   ```
-   ### For Windows:
+### Asset hierarchy
-   ```bash
+- **Conservatory Copy (Master)**: High-quality source (e.g., `.mov`, `.wav`). This is the primary record in the database.
-   ## ACH-server-import-media
+- **Streaming Copy (Derivative)**: Transcoded versions (`.mp4`, `.mp3`) linked to the master.
 - **Sidecar Metadata (`.json`)**: Contains technical metadata (`mediainfo` / `ffprobe`) used for validation and to determine the correct MIME type.
 - **Sidecar QC (`.pdf`, `.md5`)**: Quality control and checksum files.
-   This repository contains a script to import media files from an S3-compatible bucket into a database. It supports both local execution (virtual environment) and Docker-based deployment via `docker-compose`.
+> **Important:** all files belonging to the same asset must share the same 12-character inventory code (e.g., `VO-UMT-14387`).
-   Contents
+---
   - `main.py` - main import script
   - `docker-compose.yml` - docker-compose service for running the importer in a container
   - `requirements.txt` - Python dependencies
   - `config.py`, `.env` - configuration and environment variables
-   Prerequisites
+## Process Phases
-   - Docker & Docker Compose (or Docker Desktop)
+
 The importer runs in three clearly separated phases (each phase is logged in detail):
 ### Phase 1 – S3 discovery + initial validation
 - List objects in the configured S3 bucket.
 - Keep only allowed extensions: `.mp4`, `.mp3`, `.json`, `.pdf`, `.md5`.
 - Exclude configured folders (e.g., `TEST-FOLDER-DEV/`, `DOCUMENTAZIONE_FOTOGRAFICA/`, `UMT/`).
 - Validate the inventory code format and ensure the folder prefix matches the type encoded in the inventory code.
 - Files failing validation are rejected **before** any database interaction.
 ### Phase 2 – Database cross-reference + filtering
 - Load existing filenames from the database.
 - Skip files already represented in the DB, including sidecar records.
 - Build the final list of S3 objects to parse.
 ### Phase 3 – Parse & insert
 - Read and validate sidecars (`.json`, `.md5`, `.pdf`) alongside the media file.
 - Use metadata (from `mediainfo` / `ffprobe`) to derive the **master mime type** and enforce container rules.
 - Insert new records into the database (unless `ACH_DRY_RUN=true`).
 ---
 ## Validation Policy
 The import pipeline enforces strict validation to prevent bad data from entering the database.
 ### Inventory Code & Folder Prefix
 - Expected inventory code format: `^[VA][OC]-[A-Z0-9]{3}-\d{5}$`.
 - The folder prefix (e.g., `BRD/`, `DVD/`, `FILE/`) must match the code type.
 - If the prefix does not match the inventory code, the file is rejected in Phase 1.
 ### Safe Run (`ACH_SAFE_RUN`)
 - When `ACH_SAFE_RUN=true`, **any warning during Phase 3 causes an immediate abort**.
 - This prevents partial inserts when the importer detects inconsistent or already-present data.
 ### MIME Type Determination
 - The MIME type for master files is derived from the JSON sidecar metadata (`mediainfo` / `ffprobe`), not from the streaming derivative extension.
 ---
 ## Quick Start (Local)
 ### Prerequisites
 - Python 3.8+
-   - Git (optional)
+- Virtual environment support (`venv`)
-   Quick local setup (virtual environment)
+### Setup
   Linux / macOS
 ```bash
 python3 -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
 ```
-   Windows (PowerShell)
+### Run
   ```powershell
   python -m venv .venv
   . .venv\Scripts\Activate.ps1
   pip install -r requirements.txt
   ```
   Running locally
   1. Ensure your configuration is available (see `config.py` or provide a `.env` file with the environment variables used by the project).
   2. Run the script (from the project root):
   Linux / macOS
 ```bash
 python main.py
 ```
-   Windows (PowerShell)
+---
   ```powershell
   & .venv\Scripts\python.exe main.py
   ```
-   Docker Compose
+## Docker (docker-compose)
-   This project includes a `docker-compose.yml` with a service named `app` (container name `ACH_server_media_importer`). The compose file reads environment variables from `.env` and mounts a `logs` named volume.
+The project includes a `docker-compose.yml` with an `app` service (container name `ACH_server_media_importer`). It reads environment variables from `.env` and mounts a `logs` volume.
-   Build and run (detached)
+### Build & run
   ```powershell
   # Docker Compose v2 syntax (recommended)
   # From the repository root
 ```bash
 docker compose up -d --build
   # OR if your environment uses the v1 binary
   # docker-compose up -d --build
 ```
-   Show logs
+### Logs
-   ```powershell
+
-   # Follow logs for the 'app' service
+```bash
 docker compose logs -f app
   # Or use the container name
   docker logs -f ACH_server_media_importer
 ```
-   Stop / start / down
+### Stop
-   ```powershell
+
-   # Stop containers
+```bash
 docker compose stop
   # Start again
   docker compose start
   # Take down containers and network
   docker compose down
 ```
-   Rebuild when already running
+### Rebuild (clean)
-   There are two safe, common ways to rebuild a service when the containers are already running:
+```bash
   1) Rebuild in-place and recreate changed containers (recommended for most changes):
   ```powershell
   # Rebuild images and recreate services in the background
   docker compose up -d --build
   ```
   This tells Compose to rebuild the image(s) and recreate containers for services whose image or configuration changed.
   2) Full clean rebuild (use when you need to remove volumes or ensure a clean state):
   ```powershell
   # Stop and remove containers, networks, and optionally volumes & images, then rebuild
 docker compose down --volumes --rmi local
 docker compose up -d --build
 ```
-   Notes
+---
   - `docker compose up -d --build` will recreate containers for services that need updating; it does not destroy named volumes unless you pass `--volumes` to `down`.
   - If you need to execute a shell inside the running container:
-   ```powershell
+## Configuration
   # run a shell inside the 'app' service
   docker compose exec app /bin/sh
   # or (if bash is available)
   docker compose exec app /bin/bash
   ```
-   Environment and configuration
+Configuration is driven by `.env` and `config.py`. Key variables include:
-   - Provide sensitive values via a `.env` file (the `docker-compose.yml` already references `.env`).
+- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`
-   - Typical variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`, `DB_HOST`, `DB_NAME`, `DB_USER`, `SMTP_SERVER`, etc.
+- `DB_HOST`, `DB_NAME`, `DB_USER`, `DB_PASSWORD`, `DB_PORT`
-
+- `ACH_DRY_RUN` (`true` / `false`)
-   Troubleshooting
+- `ACH_SAFE_RUN` (`true` / `false`)
   - If Compose fails to pick up code changes, ensure your local Dockerfile `COPY` commands include the source files and that `docker compose up -d --build` is run from the repository root.
   - Use `docker compose logs -f app` to inspect runtime errors.
   If you'd like, I can add a short `Makefile` or a PowerShell script to wrap the common Docker Compose commands (build, rebuild, logs) for convenience.
 ---
-   Edited to add clear docker-compose rebuild and run instructions.
+
 ## Troubleshooting
 - If Docker does not pick up changes, ensure `docker compose up -d --build` is run from the repo root.
 - Inspect runtime errors via `docker compose logs -f app`.
--- a/config.py
+++ b/config.py
@ -66,5 +66,21 @@ def load_config():
    return aws_config, db_config, ach_config, bucket_name, ach_variables
 EXTENSION_MIME_MAP = {
    '.avi': 'video/x-msvideo',
    '.mov': 'video/quicktime',
    '.wav': 'audio/wav',
    '.mp4': 'video/mp4',
    '.m4v': 'video/mp4',
    '.mp3': 'audio/mp3',
    '.mxf': 'application/mxf',
    '.mpg': 'video/mpeg',
    '.aif': 'audio/aiff',
    '.wmv': 'video/x-ms-asf',
    '.m4a': 'audio/mp4',
 }
 MIME_TYPES = sorted(list(set(EXTENSION_MIME_MAP.values())))
 # Consider using a class for a more structured approach (optional)
--- a/countfiles.py
+++ b/countfiles.py
@ -130,33 +130,13 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
        # connect to database
        conn = psycopg2.connect(**db_config)
        cur = conn.cursor()
        # function count_files that are wav and mov in db
        # Map file extensions (include leading dot) to mime types
        EXTENSION_MIME_MAP = {
            '.avi': 'video/x-msvideo',
            '.mov': 'video/mov',
            '.wav': 'audio/wav',
            '.mp4': 'video/mp4',
            '.m4v': 'video/mp4',
            '.mp3': 'audio/mp3',
            '.mxf': 'application/mxf',
            '.mpg': 'video/mpeg',
        }
-        # populate mime_type list with all relevant MediaInfo/MIME values
+        # Use centralized mime types from config
-        mime_type = [
+        from config import EXTENSION_MIME_MAP, MIME_TYPES
            'video/x-msvideo',  # .avi
            'video/mov',        # .mov
            'audio/wav',        # .wav
            'video/mp4',        # .mp4, .m4v
            'audio/mp3',        # .mp3
            'application/mxf',  # .mxf
            'video/mpeg',       # .mpg
        ]
-        logging.info(f"Mime types for counting files: {mime_type}")
+        logging.info(f"Mime types for counting files: {MIME_TYPES}")
-        all_files_on_db = count_files(cur, mime_type,'*', False)
+        all_files_on_db = count_files(cur, MIME_TYPES,'*', False)
        mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
        mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
        mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
--- a/db_utils.py
+++ b/db_utils.py
@ -34,18 +34,7 @@ from email_utils import handle_error
 import json
 import os
 import config
-
+from config import EXTENSION_MIME_MAP
 # Map file extensions (include leading dot) to mime types
 EXTENSION_MIME_MAP = {
    '.avi': 'video/x-msvideo',
    '.mov': 'video/mov',
    '.wav': 'audio/wav',
    '.mp4': 'video/mp4',
    '.m4v': 'video/mp4',
    '.mp3': 'audio/mp3',
    '.mxf': 'application/mxf',
    '.mpg': 'video/mpeg',
 }
 def get_mime_for_extension(extension: str) -> str:
    """Return the mime type for an extension. Accepts with or without leading dot.
@ -58,6 +47,76 @@ def get_mime_for_extension(extension: str) -> str:
        extension = f'.{extension}'
    return EXTENSION_MIME_MAP.get(extension.lower(), 'application/octet-stream')
 def get_mime_from_mediainfo(ach_variables: dict) -> str:
    """Determine a MIME type from the JSON sidecar mediainfo.
    This is used to capture the *master* format (conservatory copy) even when
    the stream copy on S3 is a different container (e.g., _H264.mp4).
    If mediainfo is missing or cannot be mapped, fall back to extension-based mapping.
    """
    # Prefer the master (conservative copy) extension when it is explicitly available.
    # In some cases MediaInfo reports "MPEG-4" for .mov containers, so the extension
    # is a more reliable hint for the correct mime type.
    conservative_ext = ach_variables.get('conservative_copy_extension')
    if conservative_ext and conservative_ext.lower() == '.mov':
        return 'video/quicktime'
    # Also check the actual conservative copy object key (from JSON @ref). This is the
    # name that will be stored in the DB as the master file, so it should drive the MIME.
    conservative_copy_key = ach_variables.get('objectKeys', {}).get('conservative_copy', '')
    if conservative_copy_key and os.path.splitext(conservative_copy_key)[1].lower() == '.mov':
        return 'video/quicktime'
    # Try to find the General track format from mediainfo
    try:
        mediainfo = ach_variables.get('custom_data_in', {}).get('mediainfo', {})
        tracks = mediainfo.get('media', {}).get('track', [])
        for track in tracks:
            if track.get('@type', '') == 'General':
                format_value = track.get('Format', '')
                if format_value:
                    # Map common MediaInfo format values to MIME types
                    mapping = {
                        'AVI': 'video/x-msvideo',
                        'MOV': 'video/quicktime',
                        'QuickTime': 'video/quicktime',
                        'MPEG-4': 'video/mp4',
                        'MP4': 'video/mp4',
                        'MXF': 'application/mxf',
                        'MPEG': 'video/mpeg',
                        'MPEG-PS': 'video/mpeg',
                        'MPEG-TS': 'video/MP2T',
                        'MPEG Audio': 'audio/mpeg',
                        'MPEG Audio/Layer 3': 'audio/mpeg',
                        'AAC': 'audio/aac',
                        'PCM': 'audio/wav',
                        'WAV': 'audio/wav',
                        'AIFF': 'audio/aiff',
                        'FLAC': 'audio/flac',
                    }
                    # Do a case-insensitive match
                    for k, v in mapping.items():
                        if format_value.lower() == k.lower():
                            return v
                    # Try a fuzzy match based on known substrings
                    if 'avi' in format_value.lower():
                        return 'video/x-msvideo'
                    if 'mp4' in format_value.lower():
                        return 'video/mp4'
                    if 'mpeg' in format_value.lower():
                        return 'video/mpeg'
                    if 'wav' in format_value.lower() or 'pcm' in format_value.lower():
                        return 'audio/wav'
                    if 'mp3' in format_value.lower():
                        return 'audio/mpeg'
        # Fall back to extension-based mapping when metadata doesn't yield a mime
        extension = ach_variables.get('extension')
        return get_mime_for_extension(extension)
    except Exception:
        return get_mime_for_extension(ach_variables.get('extension'))
 def get_distinct_filenames_from_db():
    """Retrieve distinct digital file names from the Postgres DB.
@ -94,15 +153,19 @@ def check_inventory_in_db(s3_client, cur, base_name):
    # Define the pattern for the inventory code
    media_tipology_A = ['MCC', 'OA4', 'DAT']
    # FOR FILE add to media_tipology_A for readibily
    media_tipology_A += ['M4A','AIF'] # add for "FILE" folders 04112025
    # TODO add other tipologies: AVI, M4V, MOV, MP4, MXF, MPG (done 04112025)
    media_tipology_V = [
        'OV1', 'OV2', 'UMT', 'VHS', 'HI8', 'VD8', 'BTC', 'DBT', 'IMX', 'DVD',
-        'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV',
+        'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV'    
        'AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG' # add for "file" folders 04112025
    ] 
    # FOR FILE add to media_tipology_V for readibily
    media_tipology_V += ['AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG', 'WMV'] # add for "FILE" folders 04112025
-    # list of known mime types (derived from EXTENSION_MIME_MAP)
+    # Use centralized mime types from config
-    mime_type = list({v for v in EXTENSION_MIME_MAP.values()})
+    from config import MIME_TYPES
    try:
        logging.info(f"SUPPORT TYPOLOGY : {base_name[3:6]}")
@ -270,8 +333,8 @@ def add_file_record_and_relationship(s3_client, cur, base_name,ach_variables):
            file_availability_dict = 7  # Place Holder
            # add a new file record for the "copia conservativa"
            ach_variables['custom_data_in']['media_usage'] = 'master'  # can be "copia conservativa"
-            # determine master mime type from the file extension
+            # determine master mime type using the JSON sidecar metadata (preferred)
-            master_mime_type = get_mime_for_extension(ach_variables.get('extension'))
+            master_mime_type = get_mime_from_mediainfo(ach_variables)
            new_file_id = add_file_record(
                cur, 
--- a/file_utils.py
+++ b/file_utils.py
@ -286,6 +286,30 @@ def extract_and_validate_file_info(file_contents, file, ach_variables):
    else:
        logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON ffprobe file name '{basename_ffprobe}'.")
    # Check folder prefixes (e.g. FILE/ vs DBT/) match between S3 file and JSON refs
    def _extract_prefix(path: str) -> str:
        if not path:
            return ''
        # Normalize separators to '/' so we can reliably split on a single character
        normalized = path.replace('\\', '/').lstrip('/')
        return normalized.split('/', 1)[0] if '/' in normalized else normalized
    prefix_fullpath = _extract_prefix(ach_variables['file_fullpath'])
    prefix_mediainfo = _extract_prefix(json_ref_mediainfo_path)
    prefix_ffprobe = _extract_prefix(json_ref_ffprobe_path)
    if prefix_fullpath != prefix_mediainfo or prefix_fullpath != prefix_ffprobe:
        logging.warning(
            "Prefix mismatch for S3 file '%s': S3 prefix='%s' (fullpath='%s') vs JSON prefixes (mediainfo='%s' [%s], ffprobe='%s' [%s]).",
            ach_variables.get('file_fullpath'),
            prefix_fullpath,
            ach_variables.get('file_fullpath'),
            prefix_mediainfo,
            json_ref_mediainfo_path,
            prefix_ffprobe,
            json_ref_ffprobe_path,
        )
    if basename_fullpath != basename_mediainfo and basename_fullpath != basename_ffprobe:
         logging.error(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
         raise ValueError(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
--- a/logging_config.py
+++ b/logging_config.py
@ -33,8 +33,8 @@ def _create_timed_handler(path: str, level=None, when='midnight', interval=1, ba
    """
    _ensure_dir_for_file(path)
    handler = TimedRotatingFileHandler(path, when=when, interval=interval, backupCount=backupCount, encoding='utf-8')
-    # Use a readable suffix for rotated files (handler will append this after the filename)
+    # Default behavior is to append to the existing log file.
-    handler.suffix = "%Y%m%d_%H%M%S"
+    # Rotation happens when 'when' occurs.
    if fmt:
        handler.setFormatter(fmt)
    if level is not None:
@ -58,6 +58,8 @@ def setup_logging():
    error_log_path = os.getenv('ERROR_LOG_FILE_PATH', "./logs/ACH_media_import_errors.log")
    warning_log_path = os.getenv('WARNING_LOG_FILE_PATH', "./logs/ACH_media_import_warnings.log")
    if os.getenv('WARING_LOG_FILE_PATH'):  # Fix typo in .env if present
        warning_log_path = os.getenv('WARING_LOG_FILE_PATH')
    info_log_path = os.getenv('INFO_LOG_FILE_PATH', "./logs/ACH_media_import_info.log")
    # Create three handlers: info (all), warning (warning+), error (error+)
--- a/main.py
+++ b/main.py
@ -12,7 +12,7 @@ from error_handler import handle_general_error, handle_file_not_found_error, han
 from file_utils import is_file_empty
 from db_utils import count_files, get_distinct_filenames_from_db
 from dotenv import load_dotenv
-from validation_utils import validate_inventory_code, analyze_pattern_match, validate_icode_extension
+from validation_utils import validate_inventory_code, analyze_pattern_match, validate_icode_extension, list_s3_not_in_db, validate_mp4_file, validate_mp3_file
 import config
 import psycopg2
@ -33,6 +33,25 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
    logging.info(f"bucket_name: {bucket_name}")
    # SECURITY CHECK: If DRY_RUN is false and ENV is development, ask for confirmation
    dry_run_env = os.getenv('ACH_DRY_RUN', 'true').lower()
    ach_env = os.getenv('ACH_ENV', 'development').lower()
    if dry_run_env == 'false' and ach_env == 'development':
        print("\n" + "!"*60)
        print("!!! SECURITY CHECK: RUNNING IMPORT ON DEVELOPMENT ENVIRONMENT !!!")
        print(f"DB_HOST: {db_config.get('host')}")
        print(f"DB_NAME: {db_config.get('database')}")
        print(f"DB_USER: {db_config.get('user')}")
        print(f"DB_PORT: {db_config.get('port')}")
        print("!"*60 + "\n")
        user_input = input(f"Please type the DB_NAME '{db_config.get('database')}' to proceed: ")
        if user_input != db_config.get('database'):
            print("Action aborted by user. Database name did not match.")
            logging.error("Process aborted: User failed to confirm DB_NAME for development import.")
            return
    # Ensure timing variables are always defined so later error-email logic
    # won't fail if an exception is raised before end_time/elapsed_time is set.
    start_time = time.time()
@ -44,6 +63,20 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
    try:
        logging.info("Starting the main process...")
        # ---------------------------------------------------------------------
        # PHASE 1: S3 OBJECT DISCOVERY + INITIAL VALIDATION
        #
        # 1) List objects in the configured S3 bucket.
        # 2) Filter objects by allowed extensions and excluded folders.
        # 3) Validate the inventory code format (e.g. VA-C01-12345) and ensure the
        #    folder prefix matches the code type (e.g. "BRD" folder for BRD code).
        # 4) Reject files that violate naming conventions before any DB interaction.
        #
        # This phase is intentionally descriptive so the workflow can be understood
        # from logs even if the function names are not immediately clear.
        # ---------------------------------------------------------------------
        logging.info("PHASE 1: S3 object discovery + initial validation")
        # Helper to make spaces visible in filenames for logging (replace ' ' with open-box char)
        def _visible_spaces(name: str) -> str:
            try:
@ -60,7 +93,9 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
        # Define valid extensions and excluded folders
        valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
        # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
-        excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/'}
+        # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'FILE/', 'DVD/', 'UMT/'}
        excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/', 'UMT/'}
        # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'TST/',}
        # included_folders = {'FILE/'} # uncomment this to NOT use excluded folders 
        # included_folders = {'TEST-FOLDER-DEV/'} # uncomment this to NOT use excluded folders 
@ -98,9 +133,33 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
            # s3_file_names contains the object keys (strings), not dicts.
            base_name = os.path.basename(s3file)
            logging.info(f"S3 Base name: {base_name}")
            # extract folder prefix and media type from inventory code
            folder_prefix = os.path.dirname(s3file).rstrip('/')
            media_type_in_code = base_name[3:6] if len(base_name) >= 6 else None
            # Generic sanity check: prefix (folder name) should equal the media type in the code
            is_valid_prefix = (folder_prefix == media_type_in_code)
            # Special folder allowance rules
            folder_allowances = {
                'DVD': ['DVD', 'BRD'],
                'FILE': ['M4V', 'AVI', 'MOV', 'MP4', 'MXF', 'AIF', 'WMV', 'M4A', 'MPG'],
            }
            if folder_prefix in folder_allowances:
                if media_type_in_code in folder_allowances[folder_prefix]:
                    is_valid_prefix = True
            if folder_prefix and media_type_in_code and not is_valid_prefix:
                logging.warning(f"Prefix mismatch for {s3file}: Folder '{folder_prefix}' does not match code type '{media_type_in_code}'")
                # we only warning here but still proceed with standard validation
            if validate_inventory_code(base_name): # truncated to first 12 char in the function
                logging.info(f"File {base_name} matches pattern.")
-                # if valid check extension too
+                # only check inventory code extension for media files (.mp4, .mp3)
                # sidecars (.json, .pdf, .md5) only need their base validated
                if s3file.lower().endswith(('.mp4', '.mp3')):
                    if not validate_icode_extension(s3file):
                        logging.warning(f"File {s3file} has invalid extension for its inventory code.")
                        continue  # skip adding this file to validated contents
@ -112,6 +171,15 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
                folder_name = os.path.dirname(s3file)               
                logging.warning(f"File {s3file} in folder {folder_name} does not match pattern.")
        # ---------------------------------------------------------------------
        # PHASE 2: DATABASE CROSS-REFERENCE + FILTERING
        #
        # 1) Fetch existing filenames from the database.
        # 2) Skip files already represented in the DB (including sidecar records).
        # 3) Produce the final list of S3 object keys that should be parsed/inserted.
        # ---------------------------------------------------------------------
        logging.info("PHASE 2: Database cross-reference + filtering")
        # filter_s3_files_not_in_db
        # --- Get all DB filenames in one call ---
        db_file_names = get_distinct_filenames_from_db()
@ -138,6 +206,14 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
        total_files = len(filtered_file_names)
        logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
        # Log the files that need to be updated (those not yet in DB)
        if total_files > 0:
            logging.info("List of files to be updated in the database:")
            for f in filtered_file_names:
                logging.info(f"  - {f}")
        else:
            logging.info("No new files found to update in the database.")
        # Count files with .mp4 and .mp3 extensions
        mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
        mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
@ -167,6 +243,10 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
        for file in s3_file_names:
            if file.endswith('.mp4'):
                validate_mp4_file(file) # validation_utils.py   - check also _H264 at the end
            elif file.endswith('.mp3'):
                validate_mp3_file(file) # validation_utils.py
        # Count by CODE media type (e.g. OA4, MCC) and log the counts for each type
        # If ACH_SAFE_RUN is 'false' we enforce strict mp4/pdf parity and abort
        # when mismatched. Default is 'true' which skips this abort to allow
@ -177,7 +257,7 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
                logging.error("Number of .mp4 files is not equal to number of .pdf files")
                # MOD 20251103
                # add a check to find the missing pdf or mp4 files and report them
-                # use file_names to find missing files
+                # use filtered_file_names to find missing files
                # store tuples (source_file, expected_counterpart) for clearer logging
                missing_pdfs = []  # list of (mp4_file, expected_pdf)
                missing_mp4s = []  # list of (pdf_file, expected_mp4)
@ -187,7 +267,7 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
                        base_name = os.path.splitext(file)[0]
                        # if the mp4 is an H264 variant (e.g. name_H264.mp4) remove the suffix
                        if base_name.endswith('_H264'):
-                            # must check if has extra number for DBT and DVD amd [FILE]
+                            # must check if has extra number for DBT and DVD and [FILE]
                            base_name = base_name[:-5]
                        expected_pdf = base_name + '.pdf'
                        if expected_pdf not in filtered_file_names:
@ -221,13 +301,9 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
            if mp3_count + mp4_count != json_count:
                logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
                # add check of mp3 +6 mp4 vs json and md5 file like above for mp4 and pdf
                logging.error("Abort Import Process due to missing files")  
                # search wich file dont match TODO
                raise ValueError("Inconsistent file counts mp3+mp4 vs json")  
@ -237,11 +313,20 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
                # search wich file dont match TODO
                raise ValueError("Inconsistent file counts mp3+mp4 vs md5")
        # ---------------------------------------------------------------------
        # PHASE 3: PARSE & INSERT INTO DATABASE
        #
        # 1) Process each remaining S3 object and validate its associated metadata.
        # 2) Insert new records into the database (unless running in DRY_RUN).
        # 3) Report counts of successful uploads, warnings, and errors.
        # ---------------------------------------------------------------------
        logging.info("PHASE 3: Parse S3 objects and insert new records into the database")
        # Try to parse S3 files
        try:
            # If DRY RUN is set to True, the files will not be uploaded to the database
            if os.getenv('ACH_DRY_RUN', 'true') == 'false':
-                uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, file_names, ach_variables, excluded_folders)
+                uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, filtered_file_names, ach_variables, excluded_folders)
            else:
                logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
                # set the tuples to zero
@ -258,33 +343,13 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
        # connect to database
        conn = psycopg2.connect(**db_config)
        cur = conn.cursor()
        # function count_files that are wav and mov in db
        # Map file extensions (include leading dot) to mime types
        EXTENSION_MIME_MAP = {
            '.avi': 'video/x-msvideo',
            '.mov': 'video/mov',
            '.wav': 'audio/wav',
            '.mp4': 'video/mp4',
            '.m4v': 'video/mp4',
            '.mp3': 'audio/mp3',
            '.mxf': 'application/mxf',
            '.mpg': 'video/mpeg',
        }
-        # populate mime_type list with all relevant MediaInfo/MIME values
+        # Use centralized mime types from config
-        mime_type = [
+        from config import EXTENSION_MIME_MAP, MIME_TYPES
            'video/x-msvideo',  # .avi
            'video/mov',        # .mov
            'audio/wav',        # .wav
            'video/mp4',        # .mp4, .m4v
            'audio/mp3',        # .mp3
            'application/mxf',  # .mxf
            'video/mpeg',       # .mpg
        ]
-        logging.info(f"Mime types for counting files: {mime_type}")
+        logging.info(f"Mime types for counting files: {MIME_TYPES}")
-        all_files_on_db = count_files(cur, mime_type,'*', False)
+        all_files_on_db = count_files(cur, MIME_TYPES,'*', False)
        mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
        mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
        mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
@ -300,18 +365,18 @@ def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
        logging.info(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
        # compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
-        missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names]
+        missing_mp4s = [f for f in s3_file_names if f.endswith('.mp4') and f not in db_file_names]
        # if missing_mp4s empty do not return a warning
        if missing_mp4s:       
-            logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}")
+            logging.warning(f"Missing {len(missing_mp4s)} .mp4 files in DB compared to S3: {missing_mp4s}")
        logging.info(f"Number of .wav files in the database: {wav_files_on_db} ")
        logging.info(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
-        missing_mp3s = [f for f in file_names if f.endswith('.mp3') and f not in db_file_names]
+        missing_mp3s = [f for f in s3_file_names if f.endswith('.mp3') and f not in db_file_names]
        # if missing_mp3s empty do not return a warning
        if missing_mp3s:
-            logging.warning(f"Missing .mp3 files in DB compared to S3: {missing_mp3s}")
+            logging.warning(f"Missing {len(missing_mp3s)} .mp3 files in DB compared to S3: {missing_mp3s}")
        logging.info(f"Number of .avi files in the database: {avi_files_on_db}  ")
        logging.info(f"Number of .m4v files in the database: {m4v_files_on_db}  ")
--- a/s3_utils.py
+++ b/s3_utils.py
@ -31,10 +31,16 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
    # logg ach_variables
    logging.info(f"ach_variables: {ach_variables}")
-    logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
+    # ---------------------------------------------------------------------
    # PHASE 3: PARSE & INSERT INTO DATABASE (DETAILS)
    #
    # 3.1) Filter out excluded prefixes and keep only files we care about.
    # 3.2) Validate each media file alongside its related sidecars (.json, .md5, .pdf).
    # 3.3) Cross-check the inventory code in the database and insert new records.
    # ---------------------------------------------------------------------
    logging.info("PHASE 3: Parse & insert - starting detailed file processing")
    try:
        logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
        # Ensure db_config is not None
        if db_config is None:
            raise ValueError("Database configuration is not loaded")
@ -67,6 +73,9 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
                if result:
                    # logging.warning(f"File {file} already exists in the database.")
                    warning_files_count += 1
                    if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true':
                        logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (file already exists in DB): %s", file)
                        raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3")
                    continue
                ach_variables['file_fullpath'] = file #  is the Object key
@ -76,12 +85,13 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
                ach_variables['objectKeys']['media'] = file
                ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf"
                ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '')
                from config import EXTENSION_MIME_MAP
                if file.endswith('.mp4'):
                    ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov"  #  remove _H264 is done later
                elif file.endswith('.mp3'):
                    ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav"
                else:
-                    logging.KeyError(f"Unsupported file type: {file}")
+                    logging.error(f"Unsupported file type: {file}")
                    error_files_count +=1
                    continue
@ -116,6 +126,9 @@ def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
                    else:
                        logging.warning("Could not retrieve file size for %s.", file)
                        warning_files_count += 1
                        if os.getenv('ACH_SAFE_RUN', 'true').lower() == 'true':
                            logging.error("ACH_SAFE_RUN=true: aborting Phase 3 due to warnings (missing file size): %s", file)
                            raise ValueError("ACH_SAFE_RUN=true: aborting due to warnings in Phase 3")
                        continue  # Skip to the next file in the loop
                    logging.info("Start Validating files for %s...", base_name)
--- a/utils.py
+++ b/utils.py
@ -24,8 +24,10 @@ def check_video_info(media_info):
        # If the parent directory is 'FILE' accept multiple container types
        if parent_dir.lower() == 'file':
            # Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive)
-            if not any(file_name.lower().endswith(ext) for ext in ('.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg')):
+            # video  alowed extension
-                return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg or .mpeg file."
+            video_allowed_extensions = ['.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg', '.wmv']
            if not any(file_name.lower().endswith(ext) for ext in video_allowed_extensions):
                return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg, .mpeg or .wmv file."
            # Map file extensions to lists of acceptable general formats (video)
            general_formats = {
@ -58,18 +60,23 @@ def check_video_info(media_info):
            # Outside FILE/ directory require .mov specifically
            if not file_name.lower().endswith('.mov'):
                return False, "The file is not a .mov file."
-                    # Check if track 1's format is ProRes
+
            # Strict master MOV rule: track[1] must be ProRes
            tracks = media_info.get('media', {}).get('track', [])
-            if len(tracks) > 1:
+            if len(tracks) <= 1:
                track_1 = tracks[1]  # Assuming track 1 is the second element (index 1)
                logging.info(f"Track 1: {track_1}")
                if track_1.get('@type', '') == 'Video' and track_1.get('Format', '') == 'ProRes' and track_1.get('Format_Profile', '') == '4444':
                    return True, "The file is a .mov file with ProRes format in track 1."
                else:
                    return False, "Track 1 format is not ProRes."
            else:
                return False, "No track 1 found."
            track_1 = tracks[1]  # track[1] should represent the video stream
            logging.info(f"Track 1: {track_1}")
            if track_1.get('@type', '') != 'Video':
                return False, "Track 1 is not a video track."
            if track_1.get('Format', '') != 'ProRes':
                return False, "Track 1 format is not ProRes."
            if track_1.get('Format_Profile', '') != '4444':
                return False, "Track 1 format profile is not 4444."
            return True, "The file is a .mov master with ProRes track 1."
        return True, "The file passed the video format checks."
    except Exception as e:
        return False, f"Error processing the content: {e}"
@ -77,10 +84,32 @@ def check_video_info(media_info):
 # result, message = check_audio_info(json_content)
 def check_audio_info(media_info):
    try:
-        # Check if the file name ends with .wav
+        # Determine source filename (from JSON) and its parent folder
        file_name = media_info.get('media', {}).get('@ref', '')
-        if not file_name.endswith('.wav'):
+        parent_dir = os.path.basename(os.path.dirname(file_name))
-            logging.info(f"File name in JSON: {file_name}")
+
        # If the file lives under FILE/, allow MP3/WAV/M4A/AIF as valid audio containers
        if parent_dir.lower() == 'file':
            audio_allowed_extensions = ['.wav', '.mp3', '.m4a', '.aif']
            if not any(file_name.lower().endswith(ext) for ext in audio_allowed_extensions):
                return False, f"The file is not one of the allowed audio containers: {', '.join(audio_allowed_extensions)}."
            # For WAV, do the strict Wave/PCM validation
            if file_name.lower().endswith('.wav'):
                tracks = media_info.get('media', {}).get('track', [])
                if len(tracks) > 1:
                    track_1 = tracks[1]
                    if track_1.get('@type', '') == 'Audio' and track_1.get('Format', '') == 'PCM' and track_1.get('SamplingRate', '') == '96000' and track_1.get('BitDepth', '') == '24':
                        return True, "The file is a .wav file with Wave format in track 1."
                    else:
                        return False, f"Track 1 format is not Wave. Format: {track_1.get('Format', '')}, SamplingRate: {track_1.get('SamplingRate', '')}, BitDepth: {track_1.get('BitDepth', '')}"
                return False, "No track 1 found."
            # For MP3/M4A we accept it without strict Wave validation
            return True, "The file is an accepted audio container under FILE/ (mp3/m4a/wav)."
        # Outside FILE/ directory require .wav specifically
        if not file_name.lower().endswith('.wav'):
            return False, "The file is not a .wav file."
        # Check if track 1's format is Wave
--- a/validation_utils.py
+++ b/validation_utils.py
@ -110,17 +110,39 @@ def validate_icode_extension(file_inventory_code):
        'BTC': r'_\d{4}',
        'OA4': r'_\d{2}',
        'DVD': r'_\d{2}',
-        'MCC': r'_[AB]'
+        'BRD': r'_\d{2}',
        'MCC': r'_[AB]',
        'DBT': r'_\d{4}',
        'M4V': r'_\d{2}',
        'AVI': r'_\d{2}',
        'MOV': r'_\d{2}',
        'MP4': r'_\d{2}',
        'MXF': r'_\d{2}',
        'MPG': r'_\d{2}'
    }
    if not isinstance(file_inventory_code, str) or file_inventory_code == '':
        logging.warning("Empty or non-string inventory code provided to validate_icode_extension")
        return False
    # security: ignore any folder prefix or path components that might be
    # passed in from external sources.
    file_inventory_code = os.path.basename(file_inventory_code)
    # remove common file extensions and only remove _H264 if it's an .mp4 file
    if file_inventory_code.lower().endswith('.mp4'):
        file_inventory_code = file_inventory_code.replace("_H264", "")
    file_inventory_code = os.path.splitext(file_inventory_code)[0]
    # Enforce maximum length (12 base + up to 5 extension chars)
    if len(file_inventory_code) > 17:
        logging.warning("Inventory code '%s' exceeds maximum allowed length (17).", file_inventory_code)
        # Only raise the error if DRY RUN is false; otherwise, just log it as a warning
        if os.getenv('ACH_DRY_RUN', 'true').lower() == 'false':
            raise ValueError("Inventory code with extension exceeds maximum length of 17 characters.")
        else:
            return False
    # Validate base first (first 12 chars). If base invalid -> reject.
    base = file_inventory_code[:12]
@ -132,6 +154,22 @@ def validate_icode_extension(file_inventory_code):
    support_type = base[3:6]
    extension = file_inventory_code[12:]
    if extension == '':
        logging.info("Extension for '%s' is empty (valid).", file_inventory_code)
        return True
    expected_ext_pattern = file_type_to_regex.get(support_type)
    if expected_ext_pattern is None:
        logging.warning("Unsupported type '%s' for extension validation in '%s'.", support_type, file_inventory_code)
        return False
    if not re.fullmatch(expected_ext_pattern, extension):
        logging.warning("Extension '%s' does not match expected pattern '%s' for type '%s'.", extension, expected_ext_pattern, support_type)
        return False
    logging.info("Inventory code with extension '%s' is valid.", file_inventory_code)
    return True
 def analyze_extension_pattern(file_inventory_code):
    """Analyze extension and base-length issues for a full inventory code.
@ -338,11 +376,13 @@ if __name__ == "__main__":
        # validate_icode_extension (valid and invalid)
        {"name": "validate_icode_extension no ext", "fn": lambda: validate_icode_extension("VO-DVD-12345"), "expect": True},
        {"name": "validate_icode_extension BARE prefix", "fn": lambda: validate_icode_extension("BTC/VO-DVD-12345"), "expect": True},
        {"name": "validate_icode_extension BTC valid", "fn": lambda: validate_icode_extension("VO-BTC-12345_1234"), "expect": True},
        {"name": "validate_icode_extension DVD valid", "fn": lambda: validate_icode_extension("VO-DVD-12345_12"), "expect": True},
        {"name": "validate_icode_extension MCC valid", "fn": lambda: validate_icode_extension("VO-MCC-12345_A"), "expect": True},
        {"name": "validate_icode_extension unsupported type", "fn": lambda: validate_icode_extension("VO-XYZ-12345_12"), "expect": False},
        {"name": "validate_icode_extension too long extension (raises)", "fn": lambda: validate_icode_extension("VO-DVD-12345_12345"), "expect_exception": ValueError},
        {"name": "validate_icode_extension prefix too long (raises)", "fn": lambda: validate_icode_extension("XYZ/VO-DVD-12345_12345"), "expect_exception": ValueError},
        {"name": "validate_icode_extension BTC invalid pattern", "fn": lambda: validate_icode_extension("VO-BTC-12345_12"), "expect": False},
        # mp4/mp3 validators - return lists