Initial release of V02

Add SQL queries for file record analysis and S3 utility functions - Introduced SQL queries to identify records with specific file types, including H264 variants, non-FILE audio and video files, and non-image digital files. - Added aggregate queries to count unique base records per file type. - Implemented S3 utility functions for file operations, including uploading, downloading, and checking file existence. - Enhanced error handling and logging throughout the S3 file processing workflow. - Updated requirements.txt with necessary dependencies for S3 and database interactions. - Created utility functions for media validation, focusing on video and audio file checks.
2025-11-17 09:02:53 +01:00 · 2025-11-17 09:02:53 +01:00 · 74afbad9a8
commit 74afbad9a8
18 changed files with 2710 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,9 @@
 .venv/
 .env
 .vscode/
 __pycache__/
 *.pyc
 *.pyo
 logs/
 *.logs
 *.log
--- a/40
+++ b/40
@ -0,0 +1,40 @@
 # Use the official Python 3.11 image from the Docker Hub
 FROM python:3.11-slim
 # Set the working directory inside the container
 WORKDIR /app
 # Copy the requirements file into the container
 COPY ./requirements.txt .
 # Install the required Python packages
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy the .env file into the /app directory
 # COPY ./.env /app/.env
 # Copy the rest of the application code into the container
 COPY . .
 RUN chmod +x /app/cron_launch.sh
 # Install cron
 RUN apt-get update && apt-get install -y cron
 # Add the cron job
 # evry 10 min
 # RUN echo "*/10 * * * * /usr/local/bin/python /app/main.py >> /var/log/cron.log 2>&1" > /etc/cron.d/your_cron_job
 # 1 AM 
 RUN echo "0 1 * * * /bin/bash /app/cron_launch.sh >> /var/log/cron.log 2>&1" > /etc/cron.d/your_cron_job
 # Give execution rights on the cron job
 RUN chmod 0644 /etc/cron.d/your_cron_job
 # Apply the cron job
 RUN crontab /etc/cron.d/your_cron_job
 # Create the log file to be able to run tail
 RUN touch /var/log/cron.log
 # Run the command on container startup
 CMD cron && tail -f /var/log/cron.log
--- a/README.md
+++ b/README.md
@ -0,0 +1,138 @@
 # Project Setup
 ## Setting up a Virtual Environment
 1. **Create a virtual environment:**
   ### For Linux/macOS:
   ```bash
   python3 -m venv .venv
   ```
   ### For Windows:
   ```bash
   ## ACH-server-import-media
   This repository contains a script to import media files from an S3-compatible bucket into a database. It supports both local execution (virtual environment) and Docker-based deployment via `docker-compose`.
   Contents
   - `main.py` - main import script
   - `docker-compose.yml` - docker-compose service for running the importer in a container
   - `requirements.txt` - Python dependencies
   - `config.py`, `.env` - configuration and environment variables
   Prerequisites
   - Docker & Docker Compose (or Docker Desktop)
   - Python 3.8+
   - Git (optional)
   Quick local setup (virtual environment)
   Linux / macOS
   ```bash
   python3 -m venv .venv
   source .venv/bin/activate
   pip install -r requirements.txt
   ```
   Windows (PowerShell)
   ```powershell
   python -m venv .venv
   . .venv\Scripts\Activate.ps1
   pip install -r requirements.txt
   ```
   Running locally
   1. Ensure your configuration is available (see `config.py` or provide a `.env` file with the environment variables used by the project).
   2. Run the script (from the project root):
   Linux / macOS
   ```bash
   python main.py
   ```
   Windows (PowerShell)
   ```powershell
   & .venv\Scripts\python.exe main.py
   ```
   Docker Compose
   This project includes a `docker-compose.yml` with a service named `app` (container name `ACH_server_media_importer`). The compose file reads environment variables from `.env` and mounts a `logs` named volume.
   Build and run (detached)
   ```powershell
   # Docker Compose v2 syntax (recommended)
   # From the repository root
   docker compose up -d --build
   # OR if your environment uses the v1 binary
   # docker-compose up -d --build
   ```
   Show logs
   ```powershell
   # Follow logs for the 'app' service
   docker compose logs -f app
   # Or use the container name
   docker logs -f ACH_server_media_importer
   ```
   Stop / start / down
   ```powershell
   # Stop containers
   docker compose stop
   # Start again
   docker compose start
   # Take down containers and network
   docker compose down
   ```
   Rebuild when already running
   There are two safe, common ways to rebuild a service when the containers are already running:
   1) Rebuild in-place and recreate changed containers (recommended for most changes):
   ```powershell
   # Rebuild images and recreate services in the background
   docker compose up -d --build
   ```
   This tells Compose to rebuild the image(s) and recreate containers for services whose image or configuration changed.
   2) Full clean rebuild (use when you need to remove volumes or ensure a clean state):
   ```powershell
   # Stop and remove containers, networks, and optionally volumes & images, then rebuild
   docker compose down --volumes --rmi local
   docker compose up -d --build
   ```
   Notes
   - `docker compose up -d --build` will recreate containers for services that need updating; it does not destroy named volumes unless you pass `--volumes` to `down`.
   - If you need to execute a shell inside the running container:
   ```powershell
   # run a shell inside the 'app' service
   docker compose exec app /bin/sh
   # or (if bash is available)
   docker compose exec app /bin/bash
   ```
   Environment and configuration
   - Provide sensitive values via a `.env` file (the `docker-compose.yml` already references `.env`).
   - Typical variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`, `DB_HOST`, `DB_NAME`, `DB_USER`, `SMTP_SERVER`, etc.
   Troubleshooting
   - If Compose fails to pick up code changes, ensure your local Dockerfile `COPY` commands include the source files and that `docker compose up -d --build` is run from the repository root.
   - Use `docker compose logs -f app` to inspect runtime errors.
   If you'd like, I can add a short `Makefile` or a PowerShell script to wrap the common Docker Compose commands (build, rebuild, logs) for convenience.
   ---
   Edited to add clear docker-compose rebuild and run instructions.
--- a/build.sh
+++ b/build.sh
@ -0,0 +1 @@
 docker-compose down && docker-compose build && docker-compose up -d
--- a/config.py
+++ b/config.py
@ -0,0 +1,70 @@
 import os
 from dotenv import load_dotenv
 # import logging
 import json
 def load_config():
    """
    Loads configuration from environment variables.
    """
    # Load environment variables from .env file
    load_dotenv()
    # Define configuration dictionaries
    aws_config = {
        'aws_access_key_id': os.getenv('AWS_ACCESS_KEY_ID', ''),
        'aws_secret_access_key': os.getenv('AWS_SECRET_ACCESS_KEY', ''),
        'region_name': os.getenv('AWS_REGION', ''),
        'endpoint_url': os.getenv('AWS_ENDPOINT_URL', ''),
    }
    db_config = {
        'host': os.getenv('DB_HOST', ''),
        'database': os.getenv('DB_NAME', ''),
        'user': os.getenv('DB_USER', ''),
        'password': os.getenv('DB_PASSWORD', ''),
        'port': os.getenv('DB_PORT', ''),
    }
    ach_config = {
        'ach_editor_id': int(os.getenv('ACH_EDITOR_ID', '0')),
        'ach_approver_id': int(os.getenv('ACH_APPROVER_ID', '0')),
        'ach_notes': os.getenv('ACH_NOTES', ''),
        'ach_storage_location': json.loads(os.getenv('ACH_STORAGE_LOCATION', '{}')),
        'ach_file_type': json.loads(os.getenv('ACH_FILE_TYPE', '{}')), # unused
    }
    # Define ach_variables dictionary - consider moving these values to a separate configuration file (e.g., ach_variables.json)
    ach_variables = {
        'custom_data_in': {},
        'disk_size': 0,
        'media_disk_size': 0,
        'pdf_disk_size': 0,
        'extension': '.',
        'conservative_copy_extension': '.',
        'file_fullpath': '',
        'objectKeys': {
            'media': None,
            'pdf': None,
            'conservative_copy': None,
        },
        'inventory_code': '',
    }
    bucket_name = os.getenv('BUCKET_NAME', 'artchive-dev')
    # Load configurations from environment variables
    # (No need to create namedtuple instances here)
    # Log configuration loading status
    '''    logging.info(f'AWS config loaded: {aws_config}')
    logging.info(f'DB config loaded: {db_config}')
    logging.info(f'ACH config loaded: {ach_config}')
    logging.info(f'ACH variables loaded: {ach_variables}')
    logging.info(f'Bucket name loaded: {bucket_name}')'''
    return aws_config, db_config, ach_config, bucket_name, ach_variables
 # Consider using a class for a more structured approach (optional)
--- a/countfiles.py
+++ b/countfiles.py
@ -0,0 +1,219 @@
 # v20251103 - Main script to import media files from S3 to the database
 import logging
 import time
 from datetime import datetime
 import pytz
 import os
 from logging_config import setup_logging, CUSTOM_ERROR_LEVEL
 from email_utils import handle_error, send_email_with_attachment
 from s3_utils import create_s3_client, list_s3_bucket, parse_s3_files
 from error_handler import handle_general_error, handle_file_not_found_error, handle_value_error
 from file_utils import is_file_empty
 from db_utils import count_files, get_distinct_filenames_from_db
 from dotenv import load_dotenv
 import config
 import psycopg2
 load_dotenv()
 # MAIN PROCESS
 def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
    # import global variables
    #from config import load_config, aws_config, db_config, ach_config, bucket_name
    #global aws_config, db_config, ach_config, bucket_name
    #config import load_config , aws_config, db_config, ach_config, bucket_name
    #load_config()
    logging.info(f"bucket_name: {bucket_name}")
    # Ensure timing variables are always defined so later error-email logic
    # won't fail if an exception is raised before end_time/elapsed_time is set.
    start_time = time.time()
    end_time = start_time
    elapsed_time = 0.0
    try:
        logging.info("Starting the main process...")
        # Create the S3 client
        s3_client = create_s3_client(aws_config)
        # List S3 bucket contents
        contents = list_s3_bucket(s3_client, bucket_name)
        # Define valid extensions and excluded folders
        valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
        excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
        # Extract and filter file names
        s3_file_names = [
            content['Key'] for content in contents
            if any(content['Key'].endswith(ext) for ext in valid_extensions) and
            not any(content['Key'].startswith(folder) for folder in excluded_folders)
        ]
        s3_only_mp4_file_names = [
            content['Key'] for content in contents  
            if content['Key'].endswith('.mp4') and
            not any(content['Key'].startswith(folder) for folder in excluded_folders)
        ]
        total_file_s3mp4 = len(s3_only_mp4_file_names)
        logging.info(f"Total number of distinct .mp4 files in the S3 bucket before import: {total_file_s3mp4}")
        # filter_s3_files_not_in_db
        # --- Get all DB filenames in one call ---
        db_file_names = get_distinct_filenames_from_db()
        # --- Keep only those not in DB ---
        file_names = [f for f in s3_file_names if f not in db_file_names]
        # Print the total number of files
        total_file_db = len(db_file_names)
        logging.info(f"Total number of distinct files in the database before import: {total_file_db}")
        total_files_s3 = len(s3_file_names)
        logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files in the S3 bucket before DB filter: {total_files_s3}")
        total_files = len(file_names)
        logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
        # Count files with .mp4 and .mp3 extensions
        mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
        mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
        md5_count = sum(1 for file in s3_file_names if file.endswith('.md5'))
        pdf_count = sum(1 for file in s3_file_names if file.endswith('.pdf'))
        json_count = sum(1 for file in s3_file_names if file.endswith('.json'))
        mov_count = sum(1 for file in s3_file_names if file.endswith('.mov'))
        # jpg_count = sum(1 for file in file_names if file.endswith('.jpg'))
        # file directory
        avi_count = sum(1 for file in s3_file_names if file.endswith('.avi'))
        m4v_count = sum(1 for file in s3_file_names if file.endswith('.m4v'))
        # Log the counts
        # Get the logger instance    
        logger = logging.getLogger()
        # Use the logger instance to log custom info
        logging.warning("Number of .mp4 files on S3 bucket (%s): %s", bucket_name, mp4_count)
        logging.warning("Number of .mp3 files on S3 bucket (%s): %s", bucket_name, mp3_count)
        logging.warning("Number of .md5 files on S3 bucket (%s): %s", bucket_name, md5_count)
        logging.warning("Number of .pdf files on S3 bucket (%s): %s", bucket_name, pdf_count)
        logging.warning("Number of .json files on S3 bucket (%s): %s", bucket_name, json_count)
        logging.warning("Number of .mov files on S3 bucket (%s): %s", bucket_name, mov_count)
        if mp4_count != pdf_count:
            logging.error("Number of .mp4 files is not equal to number of .pdf files")
            logging.error("Abort Import Process due to missing files")
            # return
        if mp3_count + mp4_count != json_count:
            logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
            logging.error("Abort Import Process due to missing files")     
            # return    
        if mp3_count + mp4_count != md5_count:
            logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .md5 files")        
            logging.error("Abort Import Process due to missing files")
            # return
        # Try to parse S3 files
        try:
            # if DRY RUN is set to True, the files will not be uploaded to the database
            logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
            # set the tuples to zero
            uploaded_files_count, warning_files_count, error_files_count = (0, 0, 0)
            logging.warning("Total number of files (mp3+mp4) with warnings: %s. (Probably already existing in the DB)", warning_files_count)
            logging.warning("Total number of files with errors: %s", error_files_count)
            logging.warning("Total number of files uploaded: %s", uploaded_files_count)
            logging.warning("All files parsed")
        except Exception as e:
            logging.error(f"An error occurred while parsing S3 files: {e}")
            handle_general_error(e)
        # Check results
        # connect to database
        conn = psycopg2.connect(**db_config)
        cur = conn.cursor()
        # function count_files that are wav and mov in db
        # Map file extensions (include leading dot) to mime types
        EXTENSION_MIME_MAP = {
            '.avi': 'video/x-msvideo',
            '.mov': 'video/mov',
            '.wav': 'audio/wav',
            '.mp4': 'video/mp4',
            '.m4v': 'video/mp4',
            '.mp3': 'audio/mp3',
            '.mxf': 'application/mxf',
            '.mpg': 'video/mpeg',
        }
        # populate mime_type list with all relevant MediaInfo/MIME values
        mime_type = [
            'video/x-msvideo',  # .avi
            'video/mov',        # .mov
            'audio/wav',        # .wav
            'video/mp4',        # .mp4, .m4v
            'audio/mp3',        # .mp3
            'application/mxf',  # .mxf
            'video/mpeg',       # .mpg
        ]
        logging.info(f"Mime types for counting files: {mime_type}")
        all_files_on_db = count_files(cur, mime_type,'*', False)
        mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
        mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
        mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
        avi_files_on_db = count_files(cur,['video/x-msvideo'],'.avi', False )
        m4v_files_on_db = count_files(cur,['video/mp4'],'.m4v', False ) 
        mp4_files_on_db = count_files(cur,['video/mp4'],'.mp4', False )
        wav_files_on_db = count_files(cur,['audio/wav'],'.wav', False )
        mp3_files_on_db = count_files(cur,['audio/mp3'],'.mp3', False )
        # mov + m4v + avi + mxf + mpg 
        logging.warning(f"Number of all video files in the database: {all_files_on_db}")
        logging.warning(f"Number of .mov files in the database: {mov_files_on_db} and S3: {mov_count} ")
        logging.warning(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
        # compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
        missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names]
        logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}") 
        logging.warning(f"Number of .wav files in the database: {wav_files_on_db} ")
        logging.warning(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
        logging.warning(f"Number of .avi files in the database: {avi_files_on_db}  ")
        logging.warning(f"Number of .m4v files in the database: {m4v_files_on_db}  ")
        logging.warning(f"Number of .mxf files in the database: {mxf_files_on_db}  ")
        logging.warning(f"Number of .mpg files in the database: {mpg_files_on_db}  ")
        logging.warning(f"Total file in s3 before import {total_files}")
        # time elapsed
        end_time = time.time()  # Record end time
        elapsed_time = end_time - start_time
        logging.warning(f"Processing completed. Time taken: {elapsed_time:.2f} seconds")
    except Exception as e:
        handle_general_error(e)
    except FileNotFoundError as e:
        handle_file_not_found_error(e)
    except ValueError as e:
        handle_value_error(e)     
 if __name__ == "__main__":
    try:
        # Setup logging using standard TimedRotatingFileHandler handlers.
        # Rely on the handler's built-in rotation; don't call doRollover manually.
        logger, rotating_handler, error_handler, warning_handler = setup_logging()
        # Load configuration settings
        aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
        logging.info("Config loaded, and logging setup")
        # Run the main process    
        main_process(aws_config, db_config, ach_config, bucket_name, ach_variables)
    except Exception as e:
        logging.error(f"An error occurred: {e}")
--- a/cron_launch.sh
+++ b/cron_launch.sh
@ -0,0 +1,12 @@
 #!/bin/bash
 # Set the working directory
 cd /app
 # Source the environment variables
 set -a
 [ -f /app/.env ] && source /app/.env
 set +a
 # Run the Python script
 /usr/local/bin/python /app/main.py >> /var/log/cron.log 2>&1
--- a/db_utils.py
+++ b/db_utils.py
@ -0,0 +1,618 @@
 """ db utils.py
 Description:
    This module provides utility functions for interacting with a PostgreSQL database using psycopg2. 
    It includes functions for checking inventory codes, adding file records and relationships, 
    retrieving support IDs, and executing queries.
 Functions:
    check_inventory_in_db(s3_client, cur, base_name):
        Checks if the inventory code exists in the database and validates its format.
    check_objkey_in_file_db(cur, base_name):
        Checks if the object key exists in the database.
    add_file_record_and_relationship(s3_client, cur, base_name, ach_variables):
        Adds a file record and its relationships to the database and uploads a log file to S3.
    add_file_record(cur, editor_id, approver_id, disk_size, file_availability_dict, notes, base_name, extension, storage_location, file_type, custom_data_in=None):
        Adds a file record to the database.
    add_file_support_relationship(cur, editor_id, approver_id, file_id, support_id, status):
        Adds a relationship between a file and a support record in the database.
    add_file_relationship(cur, file_a_id, file_b_id, file_relation_dict, custom_data, status, editor_id, approver_id):
        Adds a relationship between two files in the database.
    retrieve_support_id(cur, inventory_code):
        Retrieves the support ID for a given inventory code from the database.
    retrieve_digital_file_names(s3_client, cur, base_name, digital_file_name_in):
        Retrieves digital file names from the database that match the given base_name.
    get_db_connection(db_config):
        Establishes a connection to the PostgreSQL database using the provided configuration.
    execute_query(conn, query, params=None):
        Executes a SQL query on the database.
 """
 import psycopg2        
 from psycopg2 import sql
 import logging
 from datetime import datetime
 import re
 from email_utils import handle_error
 import json
 import os
 import config
 # Map file extensions (include leading dot) to mime types
 EXTENSION_MIME_MAP = {
    '.avi': 'video/x-msvideo',
    '.mov': 'video/mov',
    '.wav': 'audio/wav',
    '.mp4': 'video/mp4',
    '.m4v': 'video/mp4',
    '.mp3': 'audio/mp3',
    '.mxf': 'application/mxf',
    '.mpg': 'video/mpeg',
 }
 def get_mime_for_extension(extension: str) -> str:
    """Return the mime type for an extension. Accepts with or without leading dot.
    Falls back to 'application/octet-stream' when unknown.
    """
    if not extension:
        return 'application/octet-stream'
    if not extension.startswith('.'):
        extension = f'.{extension}'
    return EXTENSION_MIME_MAP.get(extension.lower(), 'application/octet-stream')
 def get_distinct_filenames_from_db():
    """Retrieve distinct digital file names from the Postgres DB.
    This helper loads DB configuration via `config.load_config()` and then
    opens a connection with `get_db_connection(db_config)`.
    """
    # load db_config from project config helper (aws_config, db_config, ach_config, bucket_name, ach_variables)
    try:
        _, db_config, _, _, _ = config.load_config()
    except Exception:
        # If config.load_config isn't available or fails, re-raise with a clearer message
        raise RuntimeError("Unable to load DB configuration via config.load_config()")
    conn = get_db_connection(db_config)
    try:
        with conn.cursor() as cur:
            # cur.execute("SELECT DISTINCT digital_file_name FROM file;")
            cur.execute(
                """SELECT DISTINCT digital_file_name
                   FROM file
                   WHERE digital_file_name ~ '\\.(mp3|mp4|md5|json|pdf)$';"""
            )
            rows = cur.fetchall()
            # Flatten list of tuples -> simple set of names
            return {row[0] for row in rows if row[0] is not None}
    finally:
        conn.close()
 # Function to check if the inventory code exists in the database
 def check_inventory_in_db(s3_client, cur, base_name): 
    logging.debug("Executing check_inventory_in_db")
    # Load the configuration from the .env file
    # aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
    # Define the pattern for the inventory code
    media_tipology_A = ['MCC', 'OA4', 'DAT']
    # TODO add other tipologies: AVI, M4V, MOV, MP4, MXF, MPG (done 04112025)
    media_tipology_V = [
        'OV1', 'OV2', 'UMT', 'VHS', 'HI8', 'VD8', 'BTC', 'DBT', 'IMX', 'DVD',
        'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV',
        'AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG' # add for "file" folders 04112025
    ] 
    # list of known mime types (derived from EXTENSION_MIME_MAP)
    mime_type = list({v for v in EXTENSION_MIME_MAP.values()})
    try:
        logging.info(f"SUPPORT TYPOLOGY : {base_name[3:6]}")
        if base_name[3:6] == 'OA4':
            pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}_\d{2}$' # include the _\d{2} for OA4
            truncated_base_name = base_name[:15]
            logging.info(f"type is OA4: {truncated_base_name}")
        elif base_name[3:6] == 'MCC':
            pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}_[AB]$' # include the _[AB] for MCC
            truncated_base_name = base_name[:14]
            logging.info(f"type is MCC: {truncated_base_name}")
        else:
            # Check the base_name format with regex pattern first
            pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$'
            truncated_base_name = base_name[:12]
            logging.info(f"type is default: {truncated_base_name}")
        logging.info(f"Checking inventory code {truncated_base_name} with pattern {pattern}...")
        # Validate the string 
        try:
            if not re.match(pattern, truncated_base_name):
                error_message = f"Invalid format for base_name {truncated_base_name}"
                logging.error(error_message)
                raise ValueError(error_message)  # Create and raise the exception
            else :
                # Extract the first character and the 3 central characters
                first_char = truncated_base_name[0]
                central_chars = truncated_base_name[3:6]
                # Check the corresponding list based on the first character
                if first_char == 'A': # Check the corresponding list based on the first character
                    if central_chars not in media_tipology_A:
                        logging.error(f"Invalid media tipology for base_name {truncated_base_name}")
                        return False, None
                elif first_char == 'V': # Check the corresponding list based on the first character
                    if central_chars not in media_tipology_V:
                        logging.error(f"Invalid media tipology for base_name {truncated_base_name}")
                        return False, None
                else: # Invalid first character
                    logging.error(f"Invalid first character for base_name {truncated_base_name}")
                    return False, None
                logging.info(f"Valid format for base_name {truncated_base_name}")
        except ValueError as e:
            # Handle the specific ValueError exception
            logging.error(f"Caught a ValueError: {e}")
            # Optionally, take other actions or clean up
            return False, None
        except Exception as e:
            # Handle any other exceptions
            logging.error(f"Caught an unexpected exception: {e}")
            # Optionally, take other actions or clean up
            return False, None 
        # First query: Check if the truncated base_name matches an inventory code in the support table
        check_query = sql.SQL("""
            SELECT 1 
            FROM support 
            WHERE inventory_code LIKE %s 
            LIMIT 1;
        """)
        cur.execute(check_query, (f"{truncated_base_name[:12]}%",))
        result = cur.fetchone()
        if result:
            logging.info(f"Inventory code {truncated_base_name[:12]} found in the database.")
            # Call the function to retrieve digital file names, assuming this function is implemented
            return True, truncated_base_name
        else:
            logging.info(f"Inventory code {truncated_base_name} not found in the database.")
            handle_error(f"Inventory code {truncated_base_name} not found in the database.")
            #raise ValueError(f"Inventory code {truncated_base_name} not found in the database.")
            return False, None
    except Exception as e:
        logging.error(f'Error checking inventory code {base_name}:', {e})
        raise e
 # Function to check if the object key exists in the database
 def check_objkey_in_file_db(cur, base_name):
    """
    Checks if the base_name matches digital_file_name in the file table.
    Args:
        cur (cursor): The database cursor.
        base_name (str): The base name to check in the database.
    Returns:
        tuple: A tuple containing a boolean indicating if the base_name was found and the base_name itself or None.
    """
    logging.debug("Executing check_objkey_in_file_db")
    try:
        # First query: Check if the base_name matches digital_file_name in the file table
        check_query = sql.SQL("""
            SELECT 1 
            FROM file 
            WHERE digital_file_name LIKE %s 
            LIMIT 1;
        """)
        cur.execute(check_query, (f"{base_name}%",))
        result = cur.fetchone()
        if result:
            logging.info(f"Inventory code {base_name} found in the database.")
            # Call the function to retrieve digital file names, assuming this function is implemented
            return True
        else:
            logging.info(f"Inventory code {base_name} not found in the database.")
            return False
    except Exception as e:
        logging.error(f"Error checking inventory code {base_name}: {e}")
        raise e
 # Function to add a file record and its relationship to the support record
 def add_file_record_and_relationship(s3_client, cur, base_name,ach_variables):
    """
    Adds a file record and its relationships to the database and uploads a log file to S3.
    This function performs the following steps:
    1. Loads configuration from the .env file.
    2. Retrieves the support ID for the given base name.
    3. Adds a new file record for the conservative copy.
    4. Adds a relationship between the new file and the support ID.
    5. If the file extension is .mp4 or .mp3, adds a new MP4/MP3 file record and its relationships.
    6. If the file extension is .mp4 and a PDF exists, adds a PDF file record and its relationship to the master file.
    7. Uploads a log file to S3 if all operations are successful.
    Args:
        s3_client (boto3.client): The S3 client used to upload the log file.
        cur (psycopg2.cursor): The database cursor used to execute SQL queries.
        base_name (str): The base name of the file.
        ach_variables (dict): A dictionary containing various variables and configurations needed for the operation.
    Returns:
        bool: True if the operation is successful, False otherwise.
    Raises:
        Exception: If any error occurs during the operation, it is logged and re-raised.
    """
    # Load the configuration from the .env file
    aws_config, db_config, ach_config, bucket_name, _ = config.load_config()
    editor_id = ach_config['ach_editor_id']
    approver_id = ach_config['ach_approver_id']
    # Append current date and time to notes in format: yyyy mm dd HH MM SS
    now_dt = datetime.now()
    date_part = now_dt.strftime('%Y %m %d')
    time_part = now_dt.strftime('%H %M %S')
    notes = f"{ach_config.get('ach_notes','') } {date_part} {time_part}"
    ach_variables['file_copia_conservativa'] = ach_variables['custom_data_in'].get('mediainfo', {}).get("media", {}).get("@ref", "")
    logging.info(f"ach_variables['file_copia_conservativa']a: {ach_variables['file_copia_conservativa']}")
    logging.debug("Executing add_file_record_and_relationship")
    try:
        # Retrieve the support ID for the given base name
        support_id = retrieve_support_id(cur, ach_variables['inventory_code'])
        if support_id:
            logging.info(f"Found support_id {support_id} for base_name {base_name}.")
            ach_variables['objectKeys']['conservative_copy'] = ach_variables['file_copia_conservativa']  # must remove _H264
            # replace in ach_variables['file_copia_conservativa'] _H264 with empty string
            ach_variables['objectKeys']['conservative_copy'] = ach_variables['objectKeys']['conservative_copy'].replace('_H264', '')
            # Add a new file record and get the new file ID
            file_availability_dict = 7  # Place Holder
            # add a new file record for the "copia conservativa"
            ach_variables['custom_data_in']['media_usage'] = 'master'  # can be "copia conservativa"
            # determine master mime type from the file extension
            master_mime_type = get_mime_for_extension(ach_variables.get('extension'))
            new_file_id = add_file_record(
                cur, 
                editor_id, 
                approver_id, 
                ach_variables['disk_size'], 
                file_availability_dict, 
                notes,
                ach_variables['objectKeys']['conservative_copy'], 
                ach_variables['conservative_copy_extension'],
                ach_config['ach_storage_location'], 
                master_mime_type, 
                ach_variables['custom_data_in']
            )
            if new_file_id:
                logging.info(f"Added file record for {base_name} with file_id {new_file_id}.")
                # Add a relationship between the new file and the support ID
                status = '{"saved": true, "status": "approved"}'  # Define the status JSON
                add_file_support_relationship(cur, editor_id, approver_id, new_file_id, support_id, status)
                # If the file extension is .mp4, add a new MP4/MP3 file record
                mime_type = get_mime_for_extension(ach_variables.get('extension'))
                if ach_variables['extension'] == '.mp4' or ach_variables['extension'] == '.mp3':
                    file_availability_dict = 8  # Hot Storage
                    mp4_file_id = add_file_record(
                        cur,
                        editor_id,
                        approver_id,
                        ach_variables['media_disk_size'],
                        file_availability_dict,
                        notes,
                        ach_variables['objectKeys']['media'],
                        ach_variables['extension'],  # deprecated
                        {"storage_type": "s3", "storage_location_id": 5},
                        mime_type,
                        {"media_usage": "streaming"}
                    )
                    if mp4_file_id:
                        logging.info(f"Added MP4/MP3 file record for {base_name} with file_id {mp4_file_id}.")
                        # Add a relationship between the MP4 file and the support ID
                        status = '{"saved": true, "status": "approved"}'  # Define the status JSON
                        add_file_support_relationship(cur, editor_id, approver_id, mp4_file_id, support_id, status)
                        # Add a relationship between the streming file(mp4_file_id) and the master file(new_file_id)    
                        status = '{"saved": true, "status": "approved"}'  # Define the status JSON
                        file_relation_dict = 10  # Define the relationship dictionary: È re encoding di master
                        add_file_relationship(cur, new_file_id, mp4_file_id, file_relation_dict, '{}', status, editor_id, approver_id)
                    # the .mp4 should also have the QC in pdf format: add file as PDF relation with file MASTER as documentation
                    if ach_variables['extension'] == '.mp4' and ach_variables['pdf_disk_size'] > 0:
                        file_availability_dict = 8  # Hot Storage
                        pdf_file_id = add_file_record(
                            cur,
                            editor_id,
                            approver_id,
                            ach_variables['pdf_disk_size'],
                            file_availability_dict,
                            notes,
                            ach_variables['objectKeys']['pdf'],
                            '.pdf',
                            {"storage_type": "s3", "storage_location_id": 5},
                            "application/pdf",
                            {"media_usage": "documentation"}
                        )
                        if pdf_file_id:
                            logging.info(f"Added PDF file record for {base_name} with file_id {pdf_file_id}.")
                            # Add a relationship between the PDF file and the support ID
                            # If both MP4 and PDF file IDs exist, add a relationship between them
                            if ach_variables['extension'] == '.mp4' and pdf_file_id:
                                file_relation_dict = 11  # Define the relationship dictionary e documentazione di master
                                custom_data = '{}'  # Define any additional custom data if needed
                                status = '{"saved": true, "status": "approved"}'  # Define the status
                                add_file_relationship(cur, new_file_id, pdf_file_id, file_relation_dict, '{}', status,
                                                      editor_id, approver_id)
                # If everything is successful, upload the log file to S3
                # log_file_path = ach_variables['file_fullpath'] + base_name + '.log'
                # logging.info(f"Uploading log file {log_file_path} to S3...")
                # log_data = "import successful"
                # log_to_s3(s3_client, bucket_name, log_file_path, log_data)
                # logging.info(f"Log file {log_file_path} uploaded to S3.")
                return True
        else:
            logging.error(f"No support_id found for base_name {base_name}.")
            return False
    except Exception as e:
        logging.error(f'Error adding file record and relationship: {e}')
        raise e
 # Functio to add a file record 
 def add_file_record(cur, editor_id, approver_id, disk_size, file_availability_dict, notes, base_name, extension,storage_location, file_type, custom_data_in=None):
    try:
        cur.execute("""
            SELECT public.add_file02(
                %s,  -- editor_id_in
                %s,  -- approver_id_in
                %s,  -- disk_size_in
                %s,  -- file_availability_dict_in
                %s,  -- notes_in
                %s,  -- digital_file_name_in
                %s,  -- original_file_name_in
                %s,  -- status_in
                %s,  -- storage_location_in
                %s,  -- file_type_in
                %s   -- custom_data_in
            )
        """, (
            editor_id,
            approver_id,
            disk_size,
            file_availability_dict,  # New parameter
            notes,
            f"{base_name}",  # Digital_file_name_in
            f"{base_name}",  # Original file name
            '{"saved": true, "status": "approved"}',  # Status
            json.dumps(storage_location),  # Storage location
            json.dumps({"type": file_type}),  # File type
            json.dumps(custom_data_in)  # Custom data
        ))
        # Fetch the result returned by the function
        file_result = cur.fetchone()
        return file_result[0] if file_result else None
    except Exception as e:
        logging.error(f'Error adding file record:', e)
        raise e
 # Function to ad a file support relationship
 def add_file_support_relationship(cur, editor_id, approver_id, file_id, support_id, status):
    try:
        # Call the stored procedure using SQL CALL statement with named parameters
        cur.execute("""
            CALL public.add_rel_file_support(
                editor_id_in := %s, 
                approver_id_in := %s, 
                file_id_in := %s, 
                support_id_in := %s, 
                status_in := %s
            )
        """, (
            editor_id,            # editor_id_in
            approver_id,          # approver_id_in
            file_id,              # file_id_in
            support_id,           # support_id_in
            status                # status_in
        ))
        # Since the procedure does not return any results, just #print a confirmation message
        logging.info(f"Added file support relationship for file_id {file_id}.")
    except Exception as e:
        logging.error(f'Error adding file support relationship:', e)
        raise e
 # Function to add file to file relationship
 def add_file_relationship(cur, file_a_id, file_b_id, file_relation_dict, custom_data, status, editor_id, approver_id):
    try:
        # Call the stored procedure using SQL CALL statement with positional parameters
        cur.execute("""
            CALL public.add_rel_file(
                editor_id_in := %s, 
                approver_id_in := %s, 
                file_a_id_in := %s, 
                file_b_id_in := %s, 
                file_relation_dict_in := %s, 
                status_in := %s, 
                custom_data_in := %s
            )
        """, (
            editor_id,           # editor_id_in
            approver_id,         # approver_id_in
            file_a_id,           # file_a_id_in
            file_b_id,           # file_b_id_in
            file_relation_dict,  # file_relation_dict_in
            status,              # status_in
            custom_data          # custom_data_in
        ))
        # Since the procedure does not return a result, just #print a confirmation message
        logging.info(f"Added file relationship between file_id {file_a_id} and file_id {file_b_id}.")
    except Exception as e:
        logging.error(f'Error adding file relationship: {e}', exc_info=True)
        raise e
 # Function to retrieve the support ID for a given inventory code
 def retrieve_support_id(cur, inventory_code):
    try:
        cur.execute("""
            SELECT MAX(s.id) AS id
            FROM support s
            WHERE s.inventory_code LIKE %s::text
                AND (s.support_type ->> 'type' LIKE 'video' OR s.support_type ->> 'type' LIKE 'audio')
            GROUP BY s.h_base_record_id
        """, (f"{inventory_code}%",))
        support_result = cur.fetchone()
        logging.info(f"support_result: {support_result[0] if support_result and support_result[0] else None}")
        return support_result[0] if support_result and support_result[0] else None
    except Exception as e:
        logging.error(f'Error retrieving support_id:', e)
        raise e
 # Function to retrieve digital_file_name from the database
 def retrieve_digital_file_names(s3_client, cur, base_name,digital_file_name_in):
    try:
        logging.info(f"Retrieving digital file names for inventory code {base_name}... and digital_file_name {digital_file_name_in}")
        # Define the query to retrieve digital_file_name
        query = sql.SQL("""
            WITH supprtId AS (
                SELECT s.id 
                FROM support s 
                WHERE s.inventory_code LIKE %s
            ), rel_fs AS (
                SELECT rfs.file_id
                FROM rel_file_support rfs 
                WHERE rfs.support_id IN (SELECT id FROM supprtId)
            )
            SELECT f.id AS file_id, f.digital_file_name
            FROM file f
            WHERE f.id IN (SELECT file_id FROM rel_fs)
            OR f.digital_file_name ILIKE %s
        """)
        # Execute the query
        cur.execute(query, (f"{base_name[:12]}", f"{digital_file_name_in}.%"))
        results = cur.fetchall()
        # Process results (for example, printing)
        for result in results:
            logging.info(f"File ID: {result[0]}, Digital File Name: {result[1]}")
        # Filter results to match the base_name (without extension)
        matching_files = []
        for row in results:
            digital_file_name = row[1] # Second column: digital_file_name
            # Extract base name without extension
            # base_name_from_file = os.path.splitext(digital_file_name)[0]
            base_name_from_file = os.path.splitext(os.path.basename(digital_file_name))[0]
            logging.info(f"base_name_from_file: {base_name_from_file}")
            # Compare with the provided base_name
            if base_name_from_file == os.path.splitext(os.path.basename(digital_file_name_in))[0]:
                matching_files.append(digital_file_name)
        logging.info(f"Matching digital file names: {matching_files}")
        if matching_files:
            logging.info(f"Found the following matching digital file names: {matching_files} do not add")
            return False
        else:
            logging.info(f"No matching digital file names found for inventory code {base_name}. try to add new record")
            # Call the function to add record and relationship
            # uncomment the above line to add the record and relationship
            return True
    except Exception as e:
        logging.error(f'Error retrieving digital file names for inventory code {base_name}:', e)
        raise e
 # Fuction to get db connection
 def get_db_connection(db_config):
    try:
        conn = psycopg2.connect(**db_config)
        return conn
    except psycopg2.Error as e:
        logging.error(f"Error connecting to the database: {e}")
        raise e
 # Function to execute a query
 def execute_query(conn, query, params=None):
    try:
        with conn.cursor() as cur:
            cur.execute(query, params)
            conn.commit()
    except psycopg2.Error as e:
        logging.error(f"Error executing query: {e}")
        raise e
 #Function count files in the database
 def count_files(cur, mime_type, extension='*', file_dir=True):
    """
    Function to count files in the database with various filters
    """
    try:
        # Base query components
        query = (
            "SELECT COUNT(DISTINCT h_base_record_id) "
            "FROM file "
            "WHERE file_type ->> 'type' IS NOT NULL"
        )
        args = ()
        # Handle mime_type list: expand into ARRAY[...] with individual placeholders
        if mime_type:
            if isinstance(mime_type, (list, tuple)):
                # Create placeholders for each mime type and append to args
                placeholders = ','.join(['%s'] * len(mime_type))
                query += f" AND file_type ->> 'type' = ANY(ARRAY[{placeholders}])"
                args += tuple(mime_type)
            else:
                # single value
                query += " AND file_type ->> 'type' = %s"
                args += (mime_type,)
        # Add extension condition if not wildcard
        if extension != '*' and extension is not None:
            query += " AND digital_file_name ILIKE %s"
            args = args + (f'%{extension}',)
        # If extension is '*', no additional condition is needed (matches any extension)
        # Add file directory condition based on file_dir parameter
        if file_dir:
            # Only files in directory (original_file_name starts with 'FILE')
            query += " AND original_file_name ILIKE %s"
            args = args + ('FILE%',)
        else:
            # Exclude files in directory (original_file_name does not start with 'FILE')
            query += " AND original_file_name NOT ILIKE %s"
            args = args + ('FILE%',)
        try:
            logging.debug("Executing count_files SQL: %s -- args: %s", query, args)
            cur.execute(query, args)
            result = cur.fetchone()
            # fetchone() returns a sequence or None; protect against unexpected empty sequences
            if not result:
                return 0
            try:
                return result[0]
            except (IndexError, TypeError) as idx_e:
                logging.exception("Unexpected result shape from count query: %s", result)
                raise
        except Exception as e:
            logging.exception('Error executing count_files with query: %s', query)
            raise
    except Exception as e:
        logging.error('Error counting files: %s', e)
        raise e
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,26 @@
 services:
  app:
    build: .
    container_name: ACH_server_media_importer
    volumes:
      - logs:/app/logs  # Add this line to map the logs volume
    env_file:
      - .env 
    environment:
      - AWS_ACCESS_KEY_I=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
      - AWS_REGION=${AWS_REGION}
      - AWS_ENDPOINT_URL=${AWS_ENDPOINT_URL}
      - BUCKET_NAME=${BUCKET_NAME}
      - DB_HOST=${DB_HOST}
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - SMTP_SERVER=${SMTP_SERVER}
      - SMTP_PORT=${SMTP_PORT}
      - SMTP_USER=${SMTP_USER}
      - SMTP_PASSWORD=${SMTP_PASSWORD}
      - SENDER_EMAIL=${SENDER_EMAIL}
      - EMAIL_RECIPIENTS=${EMAIL_RECIPIENTS}
    restart: unless-stopped
 volumes:
  logs:  # Define the named volume
--- a/email_utils.py
+++ b/email_utils.py
@ -0,0 +1,106 @@
 import os
 import smtplib
 import traceback
 import logging
 from email.mime.multipart import MIMEMultipart
 from email.mime.text import MIMEText
 from email.mime.base import MIMEBase
 from email import encoders
 from email.utils import formataddr
 from dotenv import load_dotenv
 # Load environment variables from .env file
 load_dotenv()
 # Email configuration
 SMTP_SERVER = os.getenv('SMTP_SERVER')
 SMTP_PORT = os.getenv('SMTP_PORT')
 SMTP_USER = os.getenv('SMTP_USER')
 SMTP_PASSWORD = os.getenv('SMTP_PASSWORD')
 SENDER_EMAIL = os.getenv('SENDER_EMAIL')
 # Split env recipient lists safely and strip whitespace; default to empty list
 def _split_env_list(varname):
    raw = os.getenv(varname, '')
    return [s.strip() for s in raw.split(',') if s.strip()]
 EMAIL_RECIPIENTS = _split_env_list('EMAIL_RECIPIENTS')
 ERROR_EMAIL_RECIPIENTS = _split_env_list('ERROR_EMAIL_RECIPIENTS')
 SUCCESS_EMAIL_RECIPIENTS = _split_env_list('SUCCESS_EMAIL_RECIPIENTS')
 # Send email with attachment
 def send_email_with_attachment(subject, body, attachment_path=None, email_recipients=None):
    sender_name="Art.c.hive Support for ARKIVO"
    try:
        # Create a multipart message
        msg = MIMEMultipart()
        msg['From'] =  formataddr((sender_name, SENDER_EMAIL))
        # if email recipent not defined use EMAIL_RECIPIENTS 
        if email_recipients:
            msg['To'] = ', '.join(email_recipients)
        else:
            msg['To'] = ', '.join(EMAIL_RECIPIENTS)
        msg['Subject'] = subject
        # Attach the body with the msg instance
        msg.attach(MIMEText(body, 'plain'))
        # Attach the file if provided
        if attachment_path:
            if os.path.exists(attachment_path):
                with open(attachment_path, "rb") as attachment:
                    part = MIMEBase('application', 'octet-stream')
                    part.set_payload(attachment.read())
                    encoders.encode_base64(part)
                    part.add_header('Content-Disposition', f'attachment; filename= {os.path.basename(attachment_path)}')
                    msg.attach(part)
            else:
                logging.warning(f"Attachment path {attachment_path} does not exist. Skipping attachment.")
        # Create SMTP session for sending the mail
        recipients_list = email_recipients if email_recipients else EMAIL_RECIPIENTS
        if isinstance(recipients_list, str):
            recipients_list = [recipients_list]
        with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
            server.starttls()  # Enable security
            server.login(SMTP_USER, SMTP_PASSWORD)  # Login with mail_id and password
            text = msg.as_string()
            server.sendmail(SENDER_EMAIL, recipients_list, text)
        logging.info("Email sent successfully.")
    except Exception as e:
        logging.error(f"Failed to send email: {e}")
 # Create the email message
 def send_error_email(subject, body, recipients):
    try:
        sender_email = os.getenv('SENDER_EMAIL')
        smtp_server = os.getenv('SMTP_SERVER')
        smtp_port = int(os.getenv('SMTP_PORT'))
        smtp_user = os.getenv('SMTP_USER')
        smtp_password = os.getenv('SMTP_PASSWORD')
        msg = MIMEMultipart()
        msg['From'] = sender_email
        msg['To'] = ", ".join(recipients)
        msg['Subject'] = subject
        msg.attach(MIMEText(body, 'plain'))
        with smtplib.SMTP(smtp_server, smtp_port) as server:
            server.starttls()
            server.login(smtp_user, smtp_password)
            server.sendmail(sender_email, recipients, msg.as_string())
        logging.error("Error email sent successfully")
    except Exception as e:
        logging.error(f"Failed to send error email: {e}")
 # Handle error
 def handle_error(e):
    error_trace = traceback.format_exc()
    subject = "Error Notification"
    body = f"An error occurred:\n\n{error_trace}"
    recipients = os.getenv('EMAIL_RECIPIENTS', '').split(',')
    send_error_email(subject, body, recipients)
    raise e
--- a/error_handler.py
+++ b/error_handler.py
@ -0,0 +1,22 @@
 # error_handler.py
 import logging
 def handle_general_error(e):
    logging.error(f'An error occurred during the process: {e}')
    # Add any additional error handling logic here
 def handle_file_not_found_error(e):
    logging.error(f"File not found error: {e}")
    # Add any additional error handling logic here
 def handle_value_error(e):
    logging.error(f"Value error: {e}")
    # Add any additional error handling logic here
 def handle_error(error_message):
    logging.error(f"Error: {error_message}")
 class ClientError(Exception):
    """Custom exception class for client errors."""
    pass
--- a/file_utils.py
+++ b/file_utils.py
@ -0,0 +1,349 @@
 import os
 import logging
 from logging.handlers import RotatingFileHandler
 import json
 from utils import check_video_info, check_audio_info
 from error_handler import handle_error
 from botocore.exceptions import ClientError
 #from config import load_config, aws_config, bucket_name
 import config
 def retrieve_file_contents(s3, base_name):
    file_contents = {}
    # Retrieve the configuration values
    # aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
    _, _, _, bucket_name, _ = config.load_config()
    try:
        # Define the file extensions as pairs
        file_extensions = [['json', 'json'], ['md5', 'md5']]
        for ext_pair in file_extensions:
            file_name = f"{base_name}.{ext_pair[0]}"
            try:
                response = s3.get_object(Bucket=bucket_name, Key=file_name)
                file_contents[ext_pair[1]] = response['Body'].read().decode('utf-8')
                logging.info(f"Retrieved {ext_pair[1]} file content for base_name {base_name}.")
            except ClientError as e:
                # S3 returns a NoSuchKey error code when the key is missing.
                code = e.response.get('Error', {}).get('Code', '')
                if code in ('NoSuchKey', '404', 'NotFound'):
                    logging.warning(f"{file_name} not found in S3 (code={code}).")
                    # treat missing sidecars as non-fatal; continue
                    continue
                else:
                    logging.error(f"Error retrieving {file_name}: {e}", exc_info=True)
                    # Re-raise other ClientError types
                    raise
    except Exception as e:
        logging.error(f'Error retrieving file contents for {base_name}: {e}', exc_info=True)
        # Return empty JSON structure instead of raising to avoid tracebacks in callers
        try:
            return json.dumps({})
        except Exception:
            return '{}'
    # Clean and format file_contents as proper JSON
    try:
        cleaned_contents = {}
        # Clean the contents
        for key, value in file_contents.items():
            if isinstance(value, str):
                # Remove trailing newlines or any other unwanted characters
                cleaned_value = value.strip()
                # Attempt to parse JSON
                try:
                    cleaned_contents[key] = json.loads(cleaned_value)
                except json.JSONDecodeError:
                    cleaned_contents[key] = cleaned_value
            else:
                cleaned_contents[key] = value
        # Return the cleaned and formatted JSON
        return json.dumps(cleaned_contents, indent=4)
    except (TypeError, ValueError) as e:
        logging.error(f'Error formatting file contents as JSON: {e}', exc_info=True)
        raise e
 def check_related_files(s3, file_name_with_path, file, bucket_name):
    """
    Check for related files in S3 based on the given file type.
    Parameters:
    - s3: The S3 client object.
    - file_name_with_path: The name of the file with its path.
    - file: The file name.
    - bucket_name: The name of the S3 bucket.
    Returns:
    None
    Raises:
    - FileNotFoundError: If a required file is not found in S3.
    - ValueError: If a file has zero size.
    - Exception: If an unexpected exception occurs.
    """
    from s3_utils import check_file_exists_in_s3, get_file_size # avoid circular import
    import config
    # Load the configuration from the .env file
    # aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
    _, _, _, bucket_name, _ = config.load_config()
    ach_pdf_disk_size = 0
    # Set required extensions based on the file type
    if file.endswith('.mp4'):
        required_extensions = ['json', 'md5', 'pdf']
    elif file.endswith('.mp3'):
        required_extensions = ['json', 'md5']
    else:
        required_extensions = []
    logging.info(f"Required extensions: {required_extensions}")
    for ext in required_extensions:
        related_file = f"{file_name_with_path}.{ext}"
        logging.info(f"Checking for related file: {related_file}")
        try:
            if not check_file_exists_in_s3(s3, related_file,bucket_name):
                error_message = f"Required file {related_file} not found in S3."
                logging.error(error_message)
                raise FileNotFoundError(error_message)
            else:
                logging.info(f"Found related file: {related_file}")
        except FileNotFoundError as e:
            logging.error(f"Caught a FileNotFoundError: {e}")
        except Exception as e:
            logging.error(f"Caught an unexpected exception: {e}")
        # Check the size of the related file
        try:
            if ext in ['json', 'md5', 'pdf']: 
                file_size = get_file_size(s3, bucket_name, related_file)
                if file_size == 0:
                    error_message = f"File {related_file} has zero size."
                    logging.error(error_message)
                    raise ValueError(error_message)
                else:
                    logging.info(f"File {related_file} size: {file_size}")                   
        except ValueError as e:
            logging.error(f"Caught a ValueError file Size is zero: {e}")
            raise ValueError(f"File {related_file} has zero size.")
        except Exception as e:
            logging.error(f"Caught an unexpected exception: {e}")
        # If the required file is a .pdf, get its size and update ach_pdf_disk_size
        if ext =='pdf': 
            pdf_file = f"{file_name_with_path}.pdf"
            if  check_file_exists_in_s3(s3, pdf_file,bucket_name):
                pdf_file_size = get_file_size(s3, bucket_name, pdf_file)
                ach_pdf_disk_size = pdf_file_size
                logging.info(f"PDF disk size: {ach_pdf_disk_size}")
            else:
                logging.error(f"PDF file {pdf_file} not found.")
                raise FileNotFoundError(f"PDF file {pdf_file} not found.")   
    return ach_pdf_disk_size        
 def extract_and_validate_file_info(file_contents, file, ach_variables):
    # Load the configuration from the .env file
    #aws_config, db_config, ach_config, bucket_name, _ = config.load_config()
    # Extract relevant information from nested JSON
    ach_custom_data_in = file_contents
    # Check if json contain mediainfo metadata or ffprobe metadata or both    
    logging.info(f"Extracted JSON contents: {ach_custom_data_in['json']}")
    # Check for keys at the first level
    if 'mediainfo' in ach_custom_data_in['json'] and 'ffprobe' in ach_custom_data_in['json']:
        ach_variables['custom_data_in'] = {
            "mediainfo": ach_custom_data_in['json'].get('mediainfo', {}),
            "ffprobe": ach_custom_data_in['json'].get('ffprobe', {}),
            "filename": ach_custom_data_in['json'].get('filename', ''),
            "md5": ach_custom_data_in.get('md5', '')
        }
        logging.info("mediainfo and ffprobe metadata found in JSON file.")
    # Check for keys at the second level if it is not already ordered
    elif 'creatingLibrary' in ach_custom_data_in['json'] and ach_custom_data_in['json'].get('creatingLibrary','').get('name','') == 'MediaInfoLib':
        ach_variables['custom_data_in'] = {
            "mediainfo": ach_custom_data_in['json'],
            "md5": ach_custom_data_in.get('md5', '')
        }
        logging.info("mediainfo metadata found in JSON file.")
    elif 'streams' in ach_custom_data_in['json']:
        ach_variables['custom_data_in'] = {
            "ffprobe": ach_custom_data_in['json'],
            "md5": ach_custom_data_in.get('md5', '')
        }
        logging.info("ffprobe metadata found in JSON file.")
    else:
        ach_variables['custom_data_in'] = {
            "md5": ach_custom_data_in.get('md5', '')
        }
        logging.error(f"No recognized data found in JSON file.{ach_custom_data_in} - {file_contents}")
        # trhow an error
        raise ValueError("No recognized data found in JSON file.")
    logging.info(f"Extracted JSON contents: {ach_variables['custom_data_in']}")
    # Extract FileExtension and FileSize if "@type" is "General"
    ach_disk_size = None
    tracks = ach_variables['custom_data_in'].get('mediainfo', {}).get('media', {}).get('track', [])
    for track in tracks:
        # Check if @type is "General"
        if track.get('@type') == 'General':  
            # Retrieve the disk size from the General track
            ach_disk_size = track.get('FileSize', None)
            logging.info(f"Disk size from JSON media.track.General: {ach_disk_size}")
            # Retrieve the file extension from the General track
            ach_conservative_copy_extension = '.' + track.get('FileExtension', None)
            logging.info(f"FileExtension JSON media.track.General: {ach_conservative_copy_extension}")
            # Exit loop after finding the General track
            break  # Exit the loop after finding the General track
    # Convert ach_disk_size to an integer if found
    if ach_disk_size is not None:
        ach_disk_size = int(ach_disk_size)
    # MEDIAINFO
    if "mediainfo" in ach_variables['custom_data_in'] and "media" in ach_variables['custom_data_in'].get("mediainfo"):
        # Extract the media_ref field from the JSON file contents
        media_ref = ach_variables['custom_data_in'].get('mediainfo', {}).get("media", {}).get("@ref", "")
        #STRIP DOUBLE BACK SLASKS FROM PATH
        media_ref = media_ref.replace("\\", "/")
        logging.info(f"Media ref medianfo: {media_ref}")
        # Split the path using '/' and get the last part (file name)
        file_name = media_ref.split('/')[-2] + '/' + media_ref.split('/')[-1]
        logging.info(f"Media file name (copia conservativa): {file_name}")
        # Update the @ref field with the new file name
        ach_variables['custom_data_in']["mediainfo"]["media"]["@ref"] = file_name
        logging.info(f"Updated the truncated file_name at mediainfo.media.@ref {ach_variables['custom_data_in']['mediainfo']['media']['@ref']}")
    else:
        logging.warning(f"mediainfo.media.@ref not found in JSON file.")
    # FFPROBE
    if "ffprobe" in ach_variables['custom_data_in'] and "format" in ach_variables['custom_data_in'].get("ffprobe"):
        # Extract the media_ref field from the JSON file contents
        media_ref = ach_variables['custom_data_in'].get('ffprobe', {}).get("format", {}).get("filename", "")
        #STRIP DOUBLE BACK SLASKS FROM PATH
        media_ref = media_ref.replace("\\", "/")
        logging.info(f"Media ref medianfo: {media_ref}")
        # Split the path using '/' and get the last part (file name)
        file_name = media_ref.split('/')[-2] + '/' + media_ref.split('/')[-1]
        logging.info(f"Media file name (copia conservativa): {file_name}")
        # Update the @ref field with the new file name
        ach_variables['custom_data_in']["ffprobe"]["format"]["filename"] = file_name
        logging.info(f"Updated the truncated file_name at ffprobe.format.filename {ach_variables['custom_data_in']['mediainfo']['media']['@ref']}")   
    else:
        logging.warning(f"ffprobe.format.filename not found in JSON file.")
    logging.info(f"Updated the truncated file_name at mediainfo.media.@ref {file_name}")
    logging.info(f"JSON contents: {ach_variables['custom_data_in']}")
    # Check if file_contents is a string
    if isinstance(ach_variables['custom_data_in'], str):
        # Parse the JSON string into a dictionary
        ach_custom_data_in = json.loads(ach_variables['custom_data_in'])
    else:
        # Assume file_contents is already a dictionary
        ach_custom_data_in = ach_variables['custom_data_in']
    # Check if basename is equal to name in the json file
    json_ref_mediainfo_path = ach_custom_data_in.get('mediainfo', {}).get("media", {}).get("@ref", "") 
    json_ref_ffprobe_path = ach_custom_data_in.get('ffprobe', {}).get("format", {}).get("filename", "")
    logging.info(f"JSON file names: mediainfo: '{json_ref_mediainfo_path}', ffprobe: '{json_ref_ffprobe_path}', ach_file_fullpath: '{ach_variables['file_fullpath']}'")
    # Extract base names
    basename_fullpath = os.path.splitext(os.path.basename(ach_variables['file_fullpath']))[0]
    basename_fullpath = basename_fullpath.replace('_H264', '')
    basename_mediainfo = os.path.splitext(os.path.basename(json_ref_mediainfo_path))[0]
    basename_ffprobe = os.path.splitext(os.path.basename(json_ref_ffprobe_path))[0]
    # Check if the basenames are equal
    if basename_fullpath != basename_mediainfo:
        logging.warning(f"ach_file_fullpath '{basename_fullpath}' does not match JSON mediainfo file name '{basename_mediainfo}'.")
    else:
        logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON mediainfo file name '{basename_mediainfo}'.")
    # Check if the basename matches the ffprobe path
    if basename_fullpath != basename_ffprobe:
        logging.warning(f"ach_file_fullpath '{basename_fullpath}' does not match JSON ffprobe file name '{basename_ffprobe}'.")
    else:
        logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON ffprobe file name '{basename_ffprobe}'.")
    if basename_fullpath != basename_mediainfo and basename_fullpath != basename_ffprobe:
         logging.error(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
         raise ValueError(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
    # Check if the file is a video or audio file
    try:
        if file.endswith('.mp4'):
            result, message = check_video_info(ach_custom_data_in.get('mediainfo', {}))
            logging.info(f"Validation result for {file}: {message}")
        elif file.endswith('.mp3'):
            result, message = check_audio_info(ach_custom_data_in.get('mediainfo', {}))
            logging.info(f"Validation result for {file}: {message}")
        else:
            # Handle cases where the file type is not supported
            raise ValueError(f"Unsupported file type: {file}")
        # Handle the error if validation fails
        if not result:
            error_message = f"Validation failed for {file}: {message}"
            logging.error(error_message)
            # handle_error(ValueError(error_message))  # Create and handle the exception
    except ValueError as e:
        # Handle specific ValueError exceptions
        logging.error(f"Caught a ValueError: {e}")
        #handle_error(e)  # Pass the ValueError to handle_error
    except Exception as e:
        # Handle any other unexpected exceptions
        logging.error(f"Caught an unexpected exception: {e}")
        #handle_error(e)  # Pass unexpected exceptions to handle_error
    # Return the updated ach_custom_data_in dictionary
    ach_custom_data_in.pop('filename', None)  # Remove 'filename' key if it exists
    # logging.info(f"ach_custom_data_in: {json.dumps(ach_custom_data_in, indent=4)}")
    return ach_custom_data_in, ach_disk_size, ach_conservative_copy_extension
 def is_file_empty(file_path):
    return os.path.exists(file_path) and os.path.getsize(file_path) == 0
 # unused function
 def read_file(file_path):
    try:
        with open(file_path, 'r') as file:
            return file.read()
    except FileNotFoundError as e:
        logging.error(f"File not found: {e}")
        raise e
    except IOError as e:
        logging.error(f"IO error: {e}")
        raise e
 def write_file(file_path, content):
    try:
        with open(file_path, 'w') as file:
            file.write(content)
    except IOError as e:
        logging.error(f"IO error: {e}")
        raise e
--- a/logging_config.py
+++ b/logging_config.py
@ -0,0 +1,82 @@
 import logging
 from logging.handlers import TimedRotatingFileHandler
 import os
 from pathlib import Path
 from dotenv import load_dotenv
 load_dotenv()
 # Custom log level (optional). Keep for backward compatibility.
 CUSTOM_ERROR_LEVEL = 35
 logging.addLevelName(CUSTOM_ERROR_LEVEL, "CUSTOM_ERROR")
 def custom_error(self, message, *args, **kwargs):
    """Helper to log at the custom error level."""
    if self.isEnabledFor(CUSTOM_ERROR_LEVEL):
        self._log(CUSTOM_ERROR_LEVEL, message, args, **kwargs)
 # Attach helper to the Logger class so callers can do: logging.getLogger().custom_error(...)
 logging.Logger.custom_error = custom_error
 def _ensure_dir_for_file(path: str):
    """Ensure the parent directory for `path` exists."""
    Path(path).resolve().parent.mkdir(parents=True, exist_ok=True)
 def _create_timed_handler(path: str, level=None, when='midnight', interval=1, backupCount=7, fmt=None):
    """
    Create and configure a TimedRotatingFileHandler.
    Uses the handler's built-in rotation logic which is more robust and easier
    to maintain than a custom doRollover implementation.
    """
    _ensure_dir_for_file(path)
    handler = TimedRotatingFileHandler(path, when=when, interval=interval, backupCount=backupCount, encoding='utf-8')
    # Use a readable suffix for rotated files (handler will append this after the filename)
    handler.suffix = "%Y%m%d_%H%M%S"
    if fmt:
        handler.setFormatter(fmt)
    if level is not None:
        handler.setLevel(level)
    return handler
 def setup_logging():
    """
    Configure logging for the application and return (logger, info_handler, error_handler, warning_handler).
    This version uses standard TimedRotatingFileHandler to keep the logic simple and
    avoid fragile file-renaming on Windows.
    """
    # Select a format depending on environment for easier debugging in dev
    if os.getenv('ACH_ENV') == 'development':
        log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s - %(pathname)s:%(lineno)d')
    elif os.getenv('ACH_ENV') == 'production':
        log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    else:
        log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s')
    error_log_path = os.getenv('ERROR_LOG_FILE_PATH', "./logs/ACH_media_import_errors.log")
    warning_log_path = os.getenv('WARNING_LOG_FILE_PATH', "./logs/ACH_media_import_warnings.log")
    info_log_path = os.getenv('INFO_LOG_FILE_PATH', "./logs/ACH_media_import_info.log")
    # Create three handlers: info (all), warning (warning+), error (error+)
    info_handler = _create_timed_handler(info_log_path, level=logging.INFO, fmt=log_formatter, backupCount=int(os.getenv('LOG_BACKUP_COUNT', '7')))
    warning_handler = _create_timed_handler(warning_log_path, level=logging.WARNING, fmt=log_formatter, backupCount=int(os.getenv('LOG_BACKUP_COUNT', '7')))
    error_handler = _create_timed_handler(error_log_path, level=logging.ERROR, fmt=log_formatter, backupCount=int(os.getenv('LOG_BACKUP_COUNT', '7')))
    console_handler = logging.StreamHandler()
    console_handler.setFormatter(log_formatter)
    # Configure root logger explicitly
    root_logger = logging.getLogger()
    root_logger.setLevel(logging.INFO)
    # Clear existing handlers to avoid duplicate logs when unit tests or reloads occur
    root_logger.handlers = []
    root_logger.addHandler(info_handler)
    root_logger.addHandler(warning_handler)
    root_logger.addHandler(error_handler)
    root_logger.addHandler(console_handler)
    # Return the root logger and handlers (so callers can do manual rollovers if they truly need to)
    return root_logger, info_handler, error_handler, warning_handler
--- a/main.py
+++ b/main.py
@ -0,0 +1,503 @@
 # v20251103 - Main script to import media files from S3 to the database
 import logging
 import time
 from datetime import datetime
 import pytz
 import os
 import re
 from logging_config import setup_logging, CUSTOM_ERROR_LEVEL
 from email_utils import handle_error, send_email_with_attachment
 from s3_utils import create_s3_client, list_s3_bucket, parse_s3_files
 from error_handler import handle_general_error, handle_file_not_found_error, handle_value_error
 from file_utils import is_file_empty
 from db_utils import count_files, get_distinct_filenames_from_db
 from dotenv import load_dotenv
 import config
 import psycopg2
 load_dotenv()
 import re
 import logging
 import os
 def analyze_pattern_match(text, description):
    """Analyze which part of the 12-char pattern is not matching.
    The code currently truncates base/folder names to the first 12 characters and
    uses the pattern r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$' which is 12 characters long.
    This function therefore validates a 12-character string and avoids
    indexing beyond its length.
    """
    if not text:
        return [f"{description}: Empty or None text"]
    issues = []
    expected_length = 12  # Pattern: [VA][OC]-[3chars]-[5digits]
    # Check length
    if len(text) != expected_length:
        issues.append(f"Length mismatch: expected {expected_length}, got {len(text)}")
        return issues
    # Step 1: Check 1st character - V or A
    if text[0] not in ['V', 'A']:
        issues.append(f"Position 1: Expected [V,A], got '{text[0]}'")
    # Step 2: Check 2nd character - O or C
    if text[1] not in ['O', 'C']:
        issues.append(f"Position 2: Expected [O,C], got '{text[1]}'")
    # Step 3: Check 3rd character - dash
    if text[2] != '-':
        issues.append(f"Position 3: Expected '-', got '{text[2]}'")
    # Step 4: Check positions 4,5,6 - [A-Z0-9]
    for i in range(3, 6):
        if not re.match(r'^[A-Z0-9]$', text[i]):
            issues.append(f"Position {i+1}: Expected [A-Z0-9], got '{text[i]}'")
    # Step 5: Check 7th character - dash
    if text[6] != '-':
        issues.append(f"Position 7: Expected '-', got '{text[6]}'")
    # Step 6: Check positions 8-12 - digits
    for i in range(7, 12):
        if not text[i].isdigit():
            issues.append(f"Position {i+1}: Expected digit, got '{text[i]}'")
    return issues
 # MAIN PROCESS
 def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
    # import global variables
    #from config import load_config, aws_config, db_config, ach_config, bucket_name
    #global aws_config, db_config, ach_config, bucket_name
    #config import load_config , aws_config, db_config, ach_config, bucket_name
    #load_config()
    logging.info(f"bucket_name: {bucket_name}")
    # Ensure timing variables are always defined so later error-email logic
    # won't fail if an exception is raised before end_time/elapsed_time is set.
    start_time = time.time()
    # IN HUMAN READABLE FORMAT
    logging.info(f"Process started at {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time))}")
    end_time = start_time
    elapsed_time = 0.0
    try:
        logging.info("Starting the main process...")
        # Helper to make spaces visible in filenames for logging (replace ' ' with open-box char)
        def _visible_spaces(name: str) -> str:
            try:
                return name.replace(' ', '\u2423')
            except Exception:
                return name
        # Create the S3 client
        s3_client = create_s3_client(aws_config)
        # List S3 bucket contents
        list_s3_files = list_s3_bucket(s3_client, bucket_name)
        # Define valid extensions and excluded folders
        valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
        # excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
        excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/'}
        # included_folders = {'FILE/'} # uncomment this to NOT use excluded folders 
        # included_folders = {'TEST-FOLDER-DEV/'} # uncomment this to NOT use excluded folders 
        # Extract and filter file names
        # s3_file_names: include only files that match valid extensions and
        # (if configured) whose key starts with one of the included_folders.
        # We still skip any explicitly excluded_folders. Guard against the
        # case where `included_folders` isn't defined to avoid NameError.
        try:
            use_included = bool(included_folders)
        except NameError:
            use_included = False
        if use_included:
            s3_file_names = [
                content['Key'] for content in list_s3_files
                if any(content['Key'].endswith(ext) for ext in valid_extensions)
                and any(content['Key'].startswith(folder) for folder in included_folders)
            ]
            logging.info(f"Using included_folders filter: {included_folders}")
        else:
            s3_file_names = [
                content['Key'] for content in list_s3_files
                if any(content['Key'].endswith(ext) for ext in valid_extensions)
                and not any(content['Key'].startswith(folder) for folder in excluded_folders)
            ]
            logging.info("Using excluded_folders filter")
        # check inventory code syntax
        # first check s3_file_names if the file base name and folder name match  pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}_\d{2}$'
        pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$'
        contents = []
        for s3file in s3_file_names:
            # s3_file_names contains the object keys (strings), not dicts.
            base_name = os.path.basename(s3file)
            # keep only first 12 chars
            base_name = base_name[:12]
            logging.info(f"Base name: {base_name}")
            folder_name = os.path.dirname(s3file)
            # keep only first 12 chars of folder name as well
            # folder_name = folder_name[:12]
            # logging.info(f"Folder name: {folder_name}")
            if re.match(pattern, base_name): # and re.match(pattern, folder_name):
                logging.info(f"File {base_name}  matches pattern.")
                contents.append(s3file)
            else:
                # Check base name
                if not re.match(pattern, base_name):
                    base_issues = analyze_pattern_match(base_name, "Base name")
                    logging.warning(f"Base name '{base_name}' does not match pattern. Issues: {base_issues}")
                logging.warning(f"File {base_name} in folder {folder_name} does not match pattern.")
        # filter_s3_files_not_in_db
        # --- Get all DB filenames in one call ---
        db_file_names = get_distinct_filenames_from_db()
        # --- Keep only those not in DB ---
        # Additionally, if the DB already contains a sidecar record for the
        # same basename (for extensions .md5, .json, .pdf), skip the S3 file
        # since the asset is already represented in the DB via those sidecars.
        sidecar_exts = ('.md5', '.json', '.pdf')
        db_sidecar_basenames = set()
        for dbf in db_file_names:
            for ext in sidecar_exts:
                if dbf.endswith(ext):
                    db_sidecar_basenames.add(dbf[:-len(ext)])
                    break
        file_names = []
        for f in contents:
            # exact key present in DB -> skip
            if f in db_file_names:
                continue
            # strip extension to get basename and skip if DB has sidecar for it
            base = os.path.splitext(f)[0]
            if base in db_sidecar_basenames:
                # logging.info("Skipping %s because DB already contains sidecar for basename %s", _visible_spaces(f), _visible_spaces(base))
                continue
            file_names.append(f)
        # Print the total number of files
        total_files_s3 = len(contents)
        logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files in the S3 bucket before DB filter: {total_files_s3}")
        total_files = len(file_names)
        logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
        # Count files with .mp4 and .mp3 extensions
        mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
        mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
        md5_count = sum(1 for file in s3_file_names if file.endswith('.md5'))
        pdf_count = sum(1 for file in s3_file_names if file.endswith('.pdf'))
        json_count = sum(1 for file in s3_file_names if file.endswith('.json'))
        mov_count = sum(1 for file in s3_file_names if file.endswith('.mov'))
        # jpg_count = sum(1 for file in file_names if file.endswith('.jpg'))
        # file directory
        avi_count = sum(1 for file in s3_file_names if file.endswith('.avi'))
        m4v_count = sum(1 for file in s3_file_names if file.endswith('.m4v'))
        # Log the counts
        # Get the logger instance    
        logger = logging.getLogger()
        # Use the logger instance to log custom info
        logging.info("Number of .mp4 files on S3 bucket (%s): %s", bucket_name, mp4_count)
        logging.info("Number of .mp3 files on S3 bucket (%s): %s", bucket_name, mp3_count)
        logging.info("Number of .md5 files on S3 bucket (%s): %s", bucket_name, md5_count)
        logging.info("Number of .pdf files on S3 bucket (%s): %s", bucket_name, pdf_count)
        logging.info("Number of .json files on S3 bucket (%s): %s", bucket_name, json_count)
        logging.info("Number of .mov files on S3 bucket (%s): %s", bucket_name, mov_count)
        # logging.info(f"Number of .jpg files: {jpg_count}")
        # If ACH_SAFE_RUN is 'false' we enforce strict mp4/pdf parity and abort
        # when mismatched. Default is 'true' which skips this abort to allow
        # safer runs during testing or manual reconciliation.
        if os.getenv('ACH_SAFE_RUN', 'true') == 'true':
            if mp4_count != pdf_count:
                logging.error("Number of .mp4 files is not equal to number of .pdf files")
                # MOD 20251103
                # add a check to find the missing pdf or mp4 files and report them
                # use file_names to find missing files
                # store tuples (source_file, expected_counterpart) for clearer logging
                missing_pdfs = []  # list of (mp4_file, expected_pdf)
                missing_mp4s = []  # list of (pdf_file, expected_mp4)
                for file in file_names:
                    if file.endswith('.mp4'):
                        # remove extension
                        base_name = file[:-4]  # keeps any path prefix
                        # if the mp4 is an H264 variant (e.g. name_H264.mp4) remove the suffix
                        if base_name.endswith('_H264'):
                            base_name = base_name[:-5]
                        expected_pdf = base_name + '.pdf'
                        if expected_pdf not in file_names:
                            missing_pdfs.append((file, expected_pdf))
                    elif file.endswith('.pdf'):
                        # Normalize base name and accept either the regular mp4 or the _H264 variant.
                        base_name = file[:-4]
                        expected_mp4 = base_name + '.mp4'
                        h264_variant = base_name + '_H264.mp4'
                        # If neither the regular mp4 nor the H264 variant exists, report missing.
                        if expected_mp4 not in file_names and h264_variant not in file_names:
                            missing_mp4s.append((file, expected_mp4))
                # report missing files
                if missing_pdfs:
                    logging.error("Missing .pdf files (mp4 -> expected pdf):")
                    for mp4_file, expected_pdf in missing_pdfs:
                        logging.error("%s -> %s", _visible_spaces(mp4_file), _visible_spaces(expected_pdf))
                if missing_mp4s:
                    logging.error("Missing .mp4 files (pdf -> expected mp4):")
                    for pdf_file, expected_mp4 in missing_mp4s:
                        logging.error("%s -> %s", _visible_spaces(pdf_file), _visible_spaces(expected_mp4))
                logging.error("Abort Import Process due to missing files")
                raise ValueError("Inconsistent file counts mp4 vs pdf")
            if mp3_count + mp4_count != json_count:
                logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
                logging.error("Abort Import Process due to missing files")   
                # search wich file dont match TODO
                raise ValueError("Inconsistent file counts mp3+mp4 vs json")  
            if mp3_count + mp4_count != md5_count:
                logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .md5 files")        
                logging.error("Abort Import Process due to missing files")
                # search wich file dont match TODO
                raise ValueError("Inconsistent file counts mp3+mp4 vs md5")
        # Try to parse S3 files
        try:
            # If DRY RUN is set to True, the files will not be uploaded to the database
            if os.getenv('ACH_DRY_RUN', 'true') == 'false':
                uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, file_names, ach_variables, excluded_folders)
            else:
                logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
                # set the tuples to zero
                uploaded_files_count, warning_files_count, error_files_count = (0, 0, 0)
            logging.info("Total number of files (mp3+mp4) with warnings: %s. (Probably already existing in the DB)", warning_files_count)
            logging.info("Total number of files with errors: %s", error_files_count)
            logging.info("Total number of files uploaded: %s", uploaded_files_count)
            logging.info("All files parsed")
        except Exception as e:
            logging.error(f"An error occurred while parsing S3 files: {e}")
            handle_general_error(e)
        # Check results
        # connect to database
        conn = psycopg2.connect(**db_config)
        cur = conn.cursor()
        # function count_files that are wav and mov in db
        # Map file extensions (include leading dot) to mime types
        EXTENSION_MIME_MAP = {
            '.avi': 'video/x-msvideo',
            '.mov': 'video/mov',
            '.wav': 'audio/wav',
            '.mp4': 'video/mp4',
            '.m4v': 'video/mp4',
            '.mp3': 'audio/mp3',
            '.mxf': 'application/mxf',
            '.mpg': 'video/mpeg',
        }
        # populate mime_type list with all relevant MediaInfo/MIME values
        mime_type = [
            'video/x-msvideo',  # .avi
            'video/mov',        # .mov
            'audio/wav',        # .wav
            'video/mp4',        # .mp4, .m4v
            'audio/mp3',        # .mp3
            'application/mxf',  # .mxf
            'video/mpeg',       # .mpg
        ]
        logging.info(f"Mime types for counting files: {mime_type}")
        all_files_on_db = count_files(cur, mime_type,'*', False)
        mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
        mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
        mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
        avi_files_on_db = count_files(cur,['video/x-msvideo'],'.avi', False )
        m4v_files_on_db = count_files(cur,['video/mp4'],'.m4v', False ) 
        mp4_files_on_db = count_files(cur,['video/mp4'],'.mp4', False )
        wav_files_on_db = count_files(cur,['audio/wav'],'.wav', False )
        mp3_files_on_db = count_files(cur,['audio/mp3'],'.mp3', False )
        # mov + m4v + avi + mxf + mpg 
        logging.info(f"Number of all video files in the database: {all_files_on_db}")
        logging.info(f"Number of .mov files in the database: {mov_files_on_db} and S3: {mov_count} ")
        logging.info(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
        # compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
        missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names]
        # if missing_mp4s empty do not return a warning
        if missing_mp4s:       
            logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}")
        logging.info(f"Number of .wav files in the database: {wav_files_on_db} ")
        logging.info(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
        missing_mp3s = [f for f in file_names if f.endswith('.mp3') and f not in db_file_names]
        # if missing_mp3s empty do not return a warning
        if missing_mp3s:
            logging.warning(f"Missing .mp3 files in DB compared to S3: {missing_mp3s}")
        logging.info(f"Number of .avi files in the database: {avi_files_on_db}  ")
        logging.info(f"Number of .m4v files in the database: {m4v_files_on_db}  ")
        logging.info(f"Number of .mxf files in the database: {mxf_files_on_db}  ")
        logging.info(f"Number of .mpg files in the database: {mpg_files_on_db}  ")
        logging.info(f"Total file in s3 before import {total_files}")
        # time elapsed
        end_time = time.time()  # Record end time
        elapsed_time = end_time - start_time
        logging.info(f"Processing completed. Time taken: {elapsed_time:.2f} seconds")
    except FileNotFoundError as e:
        handle_file_not_found_error(e)
    except ValueError as e:
        handle_value_error(e)
    except Exception as e:
        handle_general_error(e)
    # Send Email with logs if success or failure
    # Define the CET timezone
    cet = pytz.timezone('CET')
    # Helper to rename a log file by appending a timestamp and return the new path.
    def _rename_log_if_nonempty(path):
        try:
            if not path or not os.path.exists(path):
                return None
            # If file is empty, don't attach/rename it
            if os.path.getsize(path) == 0:
                return None
            dir_name = os.path.dirname(path)
            base_name = os.path.splitext(os.path.basename(path))[0]
            timestamp = datetime.now(cet).strftime("%Y%m%d_%H%M%S")
            new_log_path = os.path.join(dir_name, f"{base_name}_{timestamp}.log")
            # Attempt to move/replace atomically where possible
            try:
                os.replace(path, new_log_path)
            except Exception:
                # Fallback to rename (may raise on Windows if target exists)
                os.rename(path, new_log_path)
            return new_log_path
        except Exception as e:
            logging.error("Failed to rename log %s: %s", path, e)
            return None
    # close logging to flush handlers before moving files
    logging.shutdown()
    logging.info("Preparing summary email")
    error_log = './logs/ACH_media_import_errors.log'
    warning_log = './logs/ACH_media_import_warning.log'
    # Determine presence of errors/warnings
    has_errors = False
    has_warnings = False
    try:
        if os.path.exists(error_log) and os.path.getsize(error_log) > 0:
            with open(error_log, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
                if 'ERROR' in content or len(content.strip()) > 0:
                    has_errors = True
        if os.path.exists(warning_log) and os.path.getsize(warning_log) > 0:
            with open(warning_log, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
                if 'WARNING' in content or len(content.strip()) > 0:
                    has_warnings = True
    except Exception as e:
        logging.error("Error while reading log files: %s", e)
    # from env - split safely and strip whitespace
    def _split_env_list(name):
        raw = os.getenv(name, '')
        return [s.strip() for s in raw.split(',') if s.strip()]
    EMAIL_RECIPIENTS = _split_env_list('EMAIL_RECIPIENTS')
    ERROR_EMAIL_RECIPIENTS = _split_env_list('ERROR_EMAIL_RECIPIENTS') or EMAIL_RECIPIENTS
    SUCCESS_EMAIL_RECIPIENTS = _split_env_list('SUCCESS_EMAIL_RECIPIENTS') or EMAIL_RECIPIENTS
    # Choose subject and attachment based on severity
    if has_errors:
        subject = "ARKIVO Import of Video/Audio Ran with Errors"
        attachment_to_send = _rename_log_if_nonempty(error_log) or error_log
        body = "Please find the attached error log file. Job started at %s and ended at %s, taking %.2f seconds." % (
            datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S'),
            datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M:%S'),
            elapsed_time
        )
        email_recipients=ERROR_EMAIL_RECIPIENTS
    elif has_warnings:
        subject = "ARKIVO Import of Video/Audio Completed with Warnings"
        # Attach the warnings log for investigation
        attachment_to_send = _rename_log_if_nonempty(warning_log) or warning_log
        body = "The import completed with warnings. Please find the attached warning log. Job started at %s and ended at %s, taking %.2f seconds." % (
            datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S'),
            datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M:%S'),
            elapsed_time
        )
        email_recipients=ERROR_EMAIL_RECIPIENTS
    else:
        subject = "ARKIVO Video/Audio Import Completed Successfully"
        # No attachment for clean success
        attachment_to_send = None
        body = "The import of media (video/audio) completed successfully without any errors or warnings. Job started at %s and ended at %s, taking %.2f seconds." % (
            datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S'),
            datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M:%S'),
            elapsed_time    
        )
        email_recipients=SUCCESS_EMAIL_RECIPIENTS 
    logging.info("Sending summary email: %s (attach: %s)", subject, bool(attachment_to_send))
    # Send email
    try:
        send_email_with_attachment(
            subject=subject,
            body=body,
            attachment_path=attachment_to_send,
            email_recipients=email_recipients
        )
    except Exception as e:
        logging.error("Failed to send summary email: %s", e)
    return    
 if __name__ == "__main__":
    try:
        # Setup logging using standard TimedRotatingFileHandler handlers.
        # No manual doRollover calls — rely on the handler's built-in rotation.
        logger, rotating_handler, error_handler, warning_handler = setup_logging()
        # Load configuration settings
        aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
        logging.info("Config loaded, logging setup done")
        # Run the main process    
        main_process(aws_config, db_config, ach_config, bucket_name, ach_variables)
        logging.info("Main process completed at: %s", datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    except Exception as e:
        logging.error(f"An error occurred: {e}")
--- a/query-sql.md
+++ b/query-sql.md
@ -0,0 +1,82 @@
 ### Find records where the original filename indicates an H264 variant
 -- Purpose: list distinct base records that have an "original" filename matching the
 -- pattern FILE%_H264% (i.e. files stored under the FILE... folder or beginning with
 -- "FILE" and containing the "_H264" marker). This helps locate master records that
 -- also have an H264-derived file present.
 -- Columns returned:
 --   base_id          : the parent/base record id (h_base_record_id)
 --   file_type        : the logical file type extracted from the JSON `file_type` column
 --   original_file_name: the stored original filename (may include folder/prefix)
 --   digital_file_name : the current digital filename in the database
 SELECT DISTINCT 
    h_base_record_id AS base_id,
    file_type ->> 'type' AS file_type,
    original_file_name,
    digital_file_name
 FROM file
 WHERE file_type ->> 'type' IS NOT NULL
    AND original_file_name  LIKE 'FILE%_H264%';
 ### Audio files (mp3) that are not in the FILE/ folder
 -- Purpose: find distinct base records for streaming audio (.mp3) where the original
 -- filename is not located in the FILE/... area. Useful to separate ingest/original
 -- conservative copies (often under FILE/) from streaming or derivative objects.
 SELECT DISTINCT 
    h_base_record_id AS base_id,
    file_type ->> 'type' AS file_type
 FROM file
 WHERE file_type ->> 'type' IS NOT NULL
    AND original_file_name NOT LIKE 'FILE%'
    AND digital_file_name LIKE '%mp3';
 ### Video files (mp4) that are not in the FILE/ folder
 -- Purpose: same as the mp3 query but for mp4 streaming/derivative files. This helps
 -- identify which base records currently have mp4 derivatives recorded outside the
 -- FILE/... (master) namespace.
 SELECT DISTINCT 
    h_base_record_id AS base_id,
    file_type ->> 'type' AS file_type
 FROM file
 WHERE file_type ->> 'type' IS NOT NULL
    AND original_file_name NOT LIKE 'FILE%'
    AND digital_file_name LIKE '%mp4';
 ### Records with non-image digital files
 -- Purpose: list base records that have digital files which are not JPEG images. The
 -- `NOT LIKE '%jpg'` filter excludes typical image derivatives; this is useful for
 -- auditing non-image assets attached to records.
 SELECT DISTINCT 
    h_base_record_id AS base_id,
    file_type ->> 'type' AS file_type
 FROM file
 WHERE file_type ->> 'type' IS NOT NULL
    AND original_file_name NOT LIKE 'FILE%'
    AND digital_file_name NOT LIKE '%jpg';
 ### Count of unique base records per file_type
 -- Purpose: aggregate the number of distinct base records (h_base_record_id) associated
 -- with each `file_type` value. This gives an overview of how many unique objects have
 -- files recorded for each logical file type.
 SELECT 
    file_type ->> 'type' AS file_type,
    COUNT(DISTINCT h_base_record_id) AS file_type_unique_record_count
 FROM file
 WHERE file_type ->> 'type' IS NOT NULL
 GROUP BY file_type ->> 'type';
 ### Duplicate of the previous aggregate (kept for convenience)
 -- Note: the query below is identical to the one above and will produce the same
 -- counts; it may be intentional for running in a separate context or as a copy-and-paste
 -- placeholder for further edits.
 SELECT 
    file_type ->> 'type' AS file_type,
    COUNT(DISTINCT h_base_record_id) AS file_type_unique_record_count
 FROM file
 WHERE file_type ->> 'type' IS NOT NULL
 GROUP BY file_type ->> 'type';
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,8 @@
 boto3 >= 1.35.39
 botocore >= 1.35.39
 psycopg2-binary>=2.9.9
 aiohttp >= 3.10.10
 asyncio >= 3.4.3
 python-dotenv >= 1.0.1
 email-validator >= 2.2.0
 pytz >= 2024.2
--- a/s3_utils.py
+++ b/s3_utils.py
@ -0,0 +1,326 @@
 import boto3 # for S3
 from botocore.exceptions import NoCredentialsError, PartialCredentialsError, ClientError  # for exceptions
 import logging # for logging
 import json # for json.loads
 import os # for os.path
 import psycopg2 # for PostgreSQL
 # Import custom modules
 from file_utils import retrieve_file_contents, check_related_files, extract_and_validate_file_info # for file operations
 from email_utils import handle_error # for error handling depecradted?
 from db_utils import get_db_connection, check_inventory_in_db, check_objkey_in_file_db, add_file_record_and_relationship, retrieve_digital_file_names # for database operations
 import config
 # Function to check the existence of related files and validate in PostgreSQL 
 def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
    """
    Parses the S3 files and performs various operations on them.
    Args:
        s3 (S3): The S3 object for accessing S3 services.
        s3_files (list): The list of S3 files to be processed.
    Returns:
        None
    Raises:
        FileNotFoundError: If a required file is not found in S3.
        ValueError: If a file has zero size or if the file type is unsupported.
        Exception: If any other unexpected exception occurs.
    """
    # Load the configuration from the .env file
    _ , db_config, _, bucket_name, _ = config.load_config()
    # logg ach_variables
    logging.info(f"ach_variables: {ach_variables}")
    logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
    try:
        logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
        # Ensure db_config is not None
        if db_config is None:
            raise ValueError("Database configuration is not loaded")
            # return # Exit the function if db_config is None
        conn = psycopg2.connect(**db_config)
        cur = conn.cursor()
        # Filter files with the desired prefix
        # excluded_prefix = ['TEST-FOLDER-DEV/', 'DOCUMENTAZIONE_FOTOGRAFICA/', 'BTC/', 'VHS/', 'UMT/', 'OV2/', 'OA4/']
        excluded_prefix = excluded_folders
        # Exclude files that start with any prefix in the excluded_prefix array
        filtered_files = [file for file in s3_files if not any(file.startswith(prefix) for prefix in excluded_prefix)]
        # Filter files with the desired prefix
        # DEBUG : filtered_files = [file for file in s3_files if file.startswith('TestFolderDev/')]
        # Length of filtered files
        logging.info(f"Array Length of filtered files: {len(filtered_files)}")
        # Counters
        error_files_count = 0
        warning_files_count = 0
        uploaded_files_count = 0  
        #for file in s3_files:
        for file in filtered_files:
            if file.endswith(('.mp4', '.mp3')):  # Check for both .mp4 and .mp3
                logging.info("Processing file: %s in the bucket: %s", file, bucket_name)
                # check if file exists in db
                result = check_objkey_in_file_db( cur, file)
                # Check the result and proceed accordingly
                if result:
                    # logging.warning(f"File {file} already exists in the database.")
                    warning_files_count +=1
                    continue
                ach_variables['file_fullpath'] = file #  is the Object key
                ach_variables['inventory_code'] = os.path.splitext(os.path.basename(file))[0][:12]
                logging.info(f"ach_variables['inventory_code'] {ach_variables['inventory_code']}: {file}")           
                # Extract the file extension
                ach_variables['objectKeys']['media'] = file
                ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf"
                ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '')
                if file.endswith('.mp4'):
                    ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov"  #  remove _H264 is done later
                elif file.endswith('.mp3'):
                    ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav"
                else:
                    logging.KeyError(f"Unsupported file type: {file}")
                    error_files_count +=1
                    continue
                # Extract the file extension
                file_extension = os.path.splitext(file)[1]  
                ach_variables['extension'] = file_extension  # Store the file extension in ach_variables
                logging.info(f"the file File extension: {file_extension}")
                # Extract the file name with directory part
                file_name_with_path = os.path.splitext(file)[0]  # Remove the extension but keep path
                logging.info(f"File name with path: {file_name_with_path}")
                # Extract the base name from the file name
                base_name = os.path.basename(file_name_with_path)  # Extract the base name with path removed
                logging.info(f"Base name: {base_name}")
                # Apply _H264 removal only for .mp4 files
                if file.endswith('.mp4'):
                    logging.info(f"File is an mp4 file: {file}. remove _H264")
                    base_name = base_name.replace('_H264', '')
                    file_name_with_path = file_name_with_path.replace('_H264', '')
                    logging.info(f"Modified base name for mp4: {base_name}")
                    logging.info(f"Modified file name with path for mp4: {file_name_with_path}")
                try:
                    # Retrieve and log the file size
                    file_size = get_file_size(s3, bucket_name, file)
                    # maybe can trow an error inside te get_file_size function and catch it here
                    if file_size is not None:
                        ach_variables['media_disk_size'] = file_size
                        logging.info(f"The media file disk size is: {ach_variables['media_disk_size']}")
                    else:
                        logging.warning("Could not retrieve file size for %s.", file)
                        warning_files_count +=1
                        continue  # Skip to the next file in the loop
                    logging.info("Start Validating files for %s...", base_name)
                    # Check if related file exist and retreive .pdf file size
                    try:
                        # Check if the required files exist in S3
                        ach_variables['pdf_disk_size'] = check_related_files(s3, file_name_with_path, file, bucket_name)
                        logging.info(f"PDF disk size: {ach_variables['pdf_disk_size']}")
                    except FileNotFoundError as e:
                        # Handle case where the file is not found
                        logging.error(f"File not found error: {e}")
                        error_files_count +=1
                        continue  # Move on to the next file in the loop 
                    except ValueError as e:
                        # Handle value errors
                        logging.error(f"Value error: {e} probabli filesize zero")
                        error_files_count +=1
                        continue  # Move on to the next file in the loop
                    except PermissionError as e:
                        # Handle permission errors
                        logging.error(f"Permission error: {e}")
                        error_files_count +=1    
                        continue  # Move on to the next file in the loop      
                    except Exception as e:
                        # Handle any other exceptions
                        logging.error(f"An error occurred: {e}")
                    # Retrieve the file contents for related files: .md5, .json
                    try:
                        # Check if the file exists in S3 and retrieve file contents
                        logging.info(f"Retrieving file contents for {file_name_with_path}...")
                        file_contents = retrieve_file_contents(s3, f"{file_name_with_path}")
                    except Exception as e:
                        # Log the error 
                        logging.error(f"Error retrieving file contents for {file_name_with_path}: {e}")
                        file_contents = None  # Set file_contents to None or handle it as needed
                        error_files_count +=1
                        continue  # Move on to the next file in the loop
                    # if contents dont exists
                    if file_contents is None:
                        logging.error(f"Error retrieving file contents for {file}.")
                        error_files_count +=1
                        continue  # Move on to the next file in the loop
                    # Ensure file_contents is a dictionary
                    if isinstance(file_contents, str):
                        file_contents = json.loads(file_contents)
                    # Extract and validate file information
                    ach_variables['custom_data_in'], ach_variables['disk_size'], ach_variables['conservative_copy_extension'] = extract_and_validate_file_info(file_contents, file, ach_variables)
                    logging.info(f"Custom data extracted: {ach_variables['custom_data_in']}")
                    logging.info(f"Disk size extracted: {ach_variables['disk_size']}")
                    logging.info(f"Conservative copy extension extracted: {ach_variables['conservative_copy_extension']}")  
                    logging.info(f"File {file} file validation completed")
                except Exception as e:
                    logging.error(f"Error processing file {file}: {e}")
                    error_files_count +=1
                    continue  # Move on to the next file in the loop
                # no need truncate base name on this point    
                # base_name = base_name[:12]  # Truncate the base name to 12 characters
                # ////////////////////////////////////////////////////////////////////////////////
                # Check if the base name exists in the database
                logging.info(f"Checking database for {base_name}...")
                try:
                    # Call the function to check the inventory code in the database and get the result
                    result, truncated_base_name = check_inventory_in_db(s3, cur, base_name)
                    logging.info(f"base name {base_name}, truncated_base_name: {truncated_base_name}") 
                    # Check the result and proceed accordingly
                    if result:
                        logging.info(f"Inventory code {base_name} found in the database.")
                        # Call the function to retrieve digital file names
                        if retrieve_digital_file_names(s3, cur, base_name, ach_variables['objectKeys']['media']) == True:
                            # Call the function to add a file record and its relationship to the support record
                            # logging.info(f"ach_variables: {ach_variables}")
                            add_file_record_and_relationship(s3, cur, base_name, ach_variables)
                        else:
                            logging.warning(f"File record already exists for {base_name}.")
                            warning_files_count +=1
                            continue        
                    else:
                        logging.error(f"Inventory code {base_name} not found in the database.")
                        error_files_count +=1
                        continue    
                except ValueError as e:
                    logging.error(f"An error occurred: {e}")
                    error_files_count +=1
                    continue 
                # Commit the changes to the database    
                logging.info(f"commit to databse {base_name}...")
                # Commit to the database (conn, cur) only if everything is okay; otherwise, perform a rollback.
                conn.commit()
                uploaded_files_count +=1          
        cur.close()
        conn.close()
    except ValueError as e:
        # Handle specific validation errors
        logging.error(f"Validation error: {e}")
        #handle_error(e)  # Pass the ValueError to handle_error
        raise e  # Raise the exception to the calling function
    except Exception as e:
        # Handle any other unexpected errors
        logging.error(f"Unexpected error: {e}")
        #handle_error(e)  # Pass unexpected errors to handle_error
        raise e  # Raise the exception to the calling function
    # return the file saved
    return uploaded_files_count, warning_files_count, error_files_count
 # Function to create an S3 client
 def create_s3_client(aws_config):
    logging.info(f'Creating S3 client with endpoint: {aws_config["endpoint_url"]}')
    try:
        s3 = boto3.client(
            's3',
            endpoint_url=aws_config['endpoint_url'],
            aws_access_key_id=aws_config['aws_access_key_id'],
            aws_secret_access_key=aws_config['aws_secret_access_key'],
            region_name=aws_config['region_name'],
            config=boto3.session.Config(
                signature_version='s3v4',
                s3={'addressing_style': 'path'}
            )
        )
        logging.info('S3 client created successfully')
        return s3
    except (NoCredentialsError, PartialCredentialsError) as e:
        logging.error(f'Error creating S3 client: {e}')
        raise e
 # Function to list the contents of an S3 bucket
 def list_s3_bucket(s3_client, bucket_name):
    try:
        paginator = s3_client.get_paginator('list_objects_v2')
        bucket_contents = []
        for page in paginator.paginate(Bucket=bucket_name):
            if 'Contents' in page:
                bucket_contents.extend(page['Contents'])
        logging.info(f"Retrieved {len(bucket_contents)} items from the bucket.")
        return bucket_contents
    except ClientError as e:
        logging.error(f'Error listing bucket contents: {e}')
        raise e
 # Function to get file size from S3
 def get_file_size(s3_client, bucket_name, file_key):
    try:
        response = s3_client.head_object(Bucket=bucket_name, Key=file_key)
        return response['ContentLength']
    except ClientError as e:
        logging.error(f"Failed to retrieve file size for {file_key}: {e}")
        return None  # or an appropriate fallback value
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None
 # Function to check if a file exists in S3
 def check_file_exists_in_s3(s3, file_name,bucket_name):
    """
    Checks if a file exists in an S3 bucket.
    Parameters:
    - s3 (boto3.client): The S3 client object.
    - file_name (str): The name of the file to check.
    Returns:
    - bool: True if the file exists, False otherwise.
    Raises:
    - ClientError: If there is an error checking the file.
    """
    try:
        s3.head_object(Bucket=bucket_name, Key=file_name)
        return True
    except ClientError as e:
        if e.response['Error']['Code'] == '404':
            return False
        else:          
            logging.error(f'Error checking file {file_name}: {e}')
            raise e
 # Function to retrieve file contents from S3
 def upload_file_to_s3(s3_client, file_path, bucket_name, object_name=None):
    if object_name is None:
        object_name = file_path
    try:
        s3_client.upload_file(file_path, bucket_name, object_name)
        logging.info(f"File {file_path} uploaded to {bucket_name}/{object_name}")
    except ClientError as e:
        logging.error(f'Error uploading file {file_path} to bucket {bucket_name}: {e}')
        raise e
 # Function to download a file from S3
 def download_file_from_s3(s3_client, bucket_name, object_name, file_path):
    try:
        s3_client.download_file(bucket_name, object_name, file_path)
        logging.info(f"File {object_name} downloaded from {bucket_name} to {file_path}")
    except ClientError as e:
        logging.error(f'Error downloading file {object_name} from bucket {bucket_name}: {e}')
        raise e 
--- a/utils.py
+++ b/utils.py
@ -0,0 +1,99 @@
 # Description: Utility functions for the media validation service
 # Art.c.hive 2024/09/30 
 # Standard libs
 import logging
 import os
 # import logging_config
 # result, message = check_video_info(audio_json_content)
 def check_video_info(media_info):
    logging.info("Checking video info...")
    logging.info(f"Media info: {media_info}")
    try:
        #('mediainfo', {}).get('media', {}).get('track', [])
        # Check if the file name ends with .mov
        file_name = media_info.get('media', {}).get('@ref', '')
        logging.info(f"File name in JSON: {file_name}")
        # Determine the parent directory (one level above the basename).
        # Example: for 'SOME/FOLDER/filename.mov' -> parent_dir == 'FOLDER'
        parent_dir = os.path.basename(os.path.dirname(file_name))
        logging.info(f"Parent directory: {parent_dir}")
        # If the parent directory is 'FILE' accept multiple container types
        if parent_dir.lower() == 'file':
            # Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive)
            if not any(file_name.lower().endswith(ext) for ext in ('.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg')):
                return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg or .mpeg file."
            # Map file extensions to lists of acceptable general formats (video)
            general_formats = {
                '.avi': ['AVI'],                     # General/Format for AVI files
                '.mov': ['QuickTime', 'MOV', 'MPEG-4'],        # MediaInfo may report QuickTime or MOV for .mov
                '.mp4': ['MPEG-4', 'MP4', 'QuickTime'],           # MPEG-4 container (QuickTime variant ?? VO-MP4-16028_H264.mp4)
                '.m4v': ['MPEG-4', 'MP4'],           # MPEG-4 container (Apple variant)
                '.mxf': ['MXF'],                     # Material eXchange Format
                '.mpg': ['MPEG','MPEG-PS'],          # MPEG program/transport streams
                '.mpeg': ['MPEG','MPEG-PS'], 
            }
            # check that the extension correspond to one of the allowed formats in track 0 in the corresponding json file
            file_ext = os.path.splitext(file_name)[1].lower()
            logging.info(f"File extension: {file_ext}")
            expected_formats = general_formats.get(file_ext)
            logging.info(f"Expected formats for extension {file_ext}: {expected_formats}")
            if not expected_formats:
                return False, f"Unsupported file extension: {file_ext}"
            tracks = media_info.get('media', {}).get('track', [])
            if len(tracks) > 0:
                track_0 = tracks[0]  # Assuming track 0 is the first element (index 0)
                logging.info(f"Track 0: {track_0}")
                actual_format = track_0.get('Format', '')
                if track_0.get('@type', '') == 'General' and actual_format in expected_formats:
                    logging.info(f"File extension {file_ext} matches one of the expected formats {expected_formats} (actual: {actual_format}).")
                else:
                    return False, f"Track 0 format '{actual_format}' does not match any expected formats {expected_formats} for extension {file_ext}."
        else:
            # Outside FILE/ directory require .mov specifically
            if not file_name.lower().endswith('.mov'):
                return False, "The file is not a .mov file."
                    # Check if track 1's format is ProRes
            tracks = media_info.get('media', {}).get('track', [])
            if len(tracks) > 1:
                track_1 = tracks[1]  # Assuming track 1 is the second element (index 1)
                logging.info(f"Track 1: {track_1}")
                if track_1.get('@type', '') == 'Video' and track_1.get('Format', '') == 'ProRes' and track_1.get('Format_Profile', '') == '4444':
                    return True, "The file is a .mov file with ProRes format in track 1."
                else:
                    return False, "Track 1 format is not ProRes."
            else:
                return False, "No track 1 found."
        return True, "The file passed the video format checks."
    except Exception as e:
        return False, f"Error processing the content: {e}"
 # result, message = check_audio_info(json_content)
 def check_audio_info(media_info):
    try:
        # Check if the file name ends with .wav
        file_name = media_info.get('media', {}).get('@ref', '')
        if not file_name.endswith('.wav'):
            logging.info(f"File name in JSON: {file_name}")
            return False, "The file is not a .wav file."
        # Check if track 1's format is Wave
        tracks = media_info.get('media', {}).get('track', [])
        # Ensure there are at least two track entries before accessing index 1
        if len(tracks) > 1:
            track_1 = tracks[1]  # Assuming track 1 is the second element (index 1)
            if track_1.get('@type', '') == 'Audio' and track_1.get('Format', '') == 'PCM' and track_1.get('SamplingRate', '') == '96000' and track_1.get('BitDepth', '') == '24':
                return True, "The file is a .wav file with Wave format in track 1."
            else:
                return False, f"Track 1 format is not Wave. Format: {track_1.get('Format', '')}, SamplingRate: {track_1.get('SamplingRate', '')}, BitDepth: {track_1.get('BitDepth', '')}"     
        return False, "No track 1 found."
    except Exception as e:
        return False, f"Error processing the content: {e}"
		`@ -0,0 +1 @@`
							`docker-compose down && docker-compose build && docker-compose up -d`