Initial release of V02
Add SQL queries for file record analysis and S3 utility functions - Introduced SQL queries to identify records with specific file types, including H264 variants, non-FILE audio and video files, and non-image digital files. - Added aggregate queries to count unique base records per file type. - Implemented S3 utility functions for file operations, including uploading, downloading, and checking file existence. - Enhanced error handling and logging throughout the S3 file processing workflow. - Updated requirements.txt with necessary dependencies for S3 and database interactions. - Created utility functions for media validation, focusing on video and audio file checks.
This commit is contained in:
commit
74afbad9a8
|
|
@ -0,0 +1,9 @@
|
|||
.venv/
|
||||
.env
|
||||
.vscode/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
logs/
|
||||
*.logs
|
||||
*.log
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
# Use the official Python 3.11 image from the Docker Hub
|
||||
FROM python:3.11-slim
|
||||
|
||||
# Set the working directory inside the container
|
||||
WORKDIR /app
|
||||
|
||||
# Copy the requirements file into the container
|
||||
COPY ./requirements.txt .
|
||||
|
||||
# Install the required Python packages
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Copy the .env file into the /app directory
|
||||
# COPY ./.env /app/.env
|
||||
|
||||
# Copy the rest of the application code into the container
|
||||
COPY . .
|
||||
|
||||
RUN chmod +x /app/cron_launch.sh
|
||||
|
||||
# Install cron
|
||||
RUN apt-get update && apt-get install -y cron
|
||||
|
||||
# Add the cron job
|
||||
# evry 10 min
|
||||
# RUN echo "*/10 * * * * /usr/local/bin/python /app/main.py >> /var/log/cron.log 2>&1" > /etc/cron.d/your_cron_job
|
||||
# 1 AM
|
||||
RUN echo "0 1 * * * /bin/bash /app/cron_launch.sh >> /var/log/cron.log 2>&1" > /etc/cron.d/your_cron_job
|
||||
|
||||
# Give execution rights on the cron job
|
||||
RUN chmod 0644 /etc/cron.d/your_cron_job
|
||||
|
||||
# Apply the cron job
|
||||
RUN crontab /etc/cron.d/your_cron_job
|
||||
|
||||
# Create the log file to be able to run tail
|
||||
RUN touch /var/log/cron.log
|
||||
|
||||
# Run the command on container startup
|
||||
CMD cron && tail -f /var/log/cron.log
|
||||
|
|
@ -0,0 +1,138 @@
|
|||
# Project Setup
|
||||
|
||||
## Setting up a Virtual Environment
|
||||
|
||||
1. **Create a virtual environment:**
|
||||
|
||||
### For Linux/macOS:
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
```
|
||||
|
||||
### For Windows:
|
||||
```bash
|
||||
## ACH-server-import-media
|
||||
|
||||
This repository contains a script to import media files from an S3-compatible bucket into a database. It supports both local execution (virtual environment) and Docker-based deployment via `docker-compose`.
|
||||
|
||||
Contents
|
||||
- `main.py` - main import script
|
||||
- `docker-compose.yml` - docker-compose service for running the importer in a container
|
||||
- `requirements.txt` - Python dependencies
|
||||
- `config.py`, `.env` - configuration and environment variables
|
||||
|
||||
Prerequisites
|
||||
- Docker & Docker Compose (or Docker Desktop)
|
||||
- Python 3.8+
|
||||
- Git (optional)
|
||||
|
||||
Quick local setup (virtual environment)
|
||||
|
||||
Linux / macOS
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Windows (PowerShell)
|
||||
```powershell
|
||||
python -m venv .venv
|
||||
. .venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Running locally
|
||||
1. Ensure your configuration is available (see `config.py` or provide a `.env` file with the environment variables used by the project).
|
||||
2. Run the script (from the project root):
|
||||
|
||||
Linux / macOS
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Windows (PowerShell)
|
||||
```powershell
|
||||
& .venv\Scripts\python.exe main.py
|
||||
```
|
||||
|
||||
Docker Compose
|
||||
|
||||
This project includes a `docker-compose.yml` with a service named `app` (container name `ACH_server_media_importer`). The compose file reads environment variables from `.env` and mounts a `logs` named volume.
|
||||
|
||||
Build and run (detached)
|
||||
```powershell
|
||||
# Docker Compose v2 syntax (recommended)
|
||||
# From the repository root
|
||||
|
||||
docker compose up -d --build
|
||||
|
||||
# OR if your environment uses the v1 binary
|
||||
# docker-compose up -d --build
|
||||
```
|
||||
|
||||
Show logs
|
||||
```powershell
|
||||
# Follow logs for the 'app' service
|
||||
docker compose logs -f app
|
||||
|
||||
# Or use the container name
|
||||
docker logs -f ACH_server_media_importer
|
||||
```
|
||||
|
||||
Stop / start / down
|
||||
```powershell
|
||||
# Stop containers
|
||||
docker compose stop
|
||||
|
||||
# Start again
|
||||
docker compose start
|
||||
|
||||
# Take down containers and network
|
||||
docker compose down
|
||||
```
|
||||
|
||||
Rebuild when already running
|
||||
|
||||
There are two safe, common ways to rebuild a service when the containers are already running:
|
||||
|
||||
1) Rebuild in-place and recreate changed containers (recommended for most changes):
|
||||
|
||||
```powershell
|
||||
# Rebuild images and recreate services in the background
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
This tells Compose to rebuild the image(s) and recreate containers for services whose image or configuration changed.
|
||||
|
||||
2) Full clean rebuild (use when you need to remove volumes or ensure a clean state):
|
||||
|
||||
```powershell
|
||||
# Stop and remove containers, networks, and optionally volumes & images, then rebuild
|
||||
docker compose down --volumes --rmi local
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
Notes
|
||||
- `docker compose up -d --build` will recreate containers for services that need updating; it does not destroy named volumes unless you pass `--volumes` to `down`.
|
||||
- If you need to execute a shell inside the running container:
|
||||
|
||||
```powershell
|
||||
# run a shell inside the 'app' service
|
||||
docker compose exec app /bin/sh
|
||||
# or (if bash is available)
|
||||
docker compose exec app /bin/bash
|
||||
```
|
||||
|
||||
Environment and configuration
|
||||
- Provide sensitive values via a `.env` file (the `docker-compose.yml` already references `.env`).
|
||||
- Typical variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `BUCKET_NAME`, `DB_HOST`, `DB_NAME`, `DB_USER`, `SMTP_SERVER`, etc.
|
||||
|
||||
Troubleshooting
|
||||
- If Compose fails to pick up code changes, ensure your local Dockerfile `COPY` commands include the source files and that `docker compose up -d --build` is run from the repository root.
|
||||
- Use `docker compose logs -f app` to inspect runtime errors.
|
||||
|
||||
If you'd like, I can add a short `Makefile` or a PowerShell script to wrap the common Docker Compose commands (build, rebuild, logs) for convenience.
|
||||
|
||||
---
|
||||
Edited to add clear docker-compose rebuild and run instructions.
|
||||
|
|
@ -0,0 +1 @@
|
|||
docker-compose down && docker-compose build && docker-compose up -d
|
||||
|
|
@ -0,0 +1,70 @@
|
|||
import os
|
||||
from dotenv import load_dotenv
|
||||
# import logging
|
||||
import json
|
||||
|
||||
def load_config():
|
||||
"""
|
||||
Loads configuration from environment variables.
|
||||
"""
|
||||
|
||||
# Load environment variables from .env file
|
||||
load_dotenv()
|
||||
|
||||
# Define configuration dictionaries
|
||||
aws_config = {
|
||||
'aws_access_key_id': os.getenv('AWS_ACCESS_KEY_ID', ''),
|
||||
'aws_secret_access_key': os.getenv('AWS_SECRET_ACCESS_KEY', ''),
|
||||
'region_name': os.getenv('AWS_REGION', ''),
|
||||
'endpoint_url': os.getenv('AWS_ENDPOINT_URL', ''),
|
||||
}
|
||||
|
||||
db_config = {
|
||||
'host': os.getenv('DB_HOST', ''),
|
||||
'database': os.getenv('DB_NAME', ''),
|
||||
'user': os.getenv('DB_USER', ''),
|
||||
'password': os.getenv('DB_PASSWORD', ''),
|
||||
'port': os.getenv('DB_PORT', ''),
|
||||
}
|
||||
|
||||
ach_config = {
|
||||
'ach_editor_id': int(os.getenv('ACH_EDITOR_ID', '0')),
|
||||
'ach_approver_id': int(os.getenv('ACH_APPROVER_ID', '0')),
|
||||
'ach_notes': os.getenv('ACH_NOTES', ''),
|
||||
'ach_storage_location': json.loads(os.getenv('ACH_STORAGE_LOCATION', '{}')),
|
||||
'ach_file_type': json.loads(os.getenv('ACH_FILE_TYPE', '{}')), # unused
|
||||
}
|
||||
|
||||
# Define ach_variables dictionary - consider moving these values to a separate configuration file (e.g., ach_variables.json)
|
||||
ach_variables = {
|
||||
'custom_data_in': {},
|
||||
'disk_size': 0,
|
||||
'media_disk_size': 0,
|
||||
'pdf_disk_size': 0,
|
||||
'extension': '.',
|
||||
'conservative_copy_extension': '.',
|
||||
'file_fullpath': '',
|
||||
'objectKeys': {
|
||||
'media': None,
|
||||
'pdf': None,
|
||||
'conservative_copy': None,
|
||||
},
|
||||
'inventory_code': '',
|
||||
}
|
||||
|
||||
bucket_name = os.getenv('BUCKET_NAME', 'artchive-dev')
|
||||
|
||||
# Load configurations from environment variables
|
||||
# (No need to create namedtuple instances here)
|
||||
|
||||
# Log configuration loading status
|
||||
''' logging.info(f'AWS config loaded: {aws_config}')
|
||||
logging.info(f'DB config loaded: {db_config}')
|
||||
logging.info(f'ACH config loaded: {ach_config}')
|
||||
logging.info(f'ACH variables loaded: {ach_variables}')
|
||||
logging.info(f'Bucket name loaded: {bucket_name}')'''
|
||||
|
||||
return aws_config, db_config, ach_config, bucket_name, ach_variables
|
||||
|
||||
|
||||
# Consider using a class for a more structured approach (optional)
|
||||
|
|
@ -0,0 +1,219 @@
|
|||
# v20251103 - Main script to import media files from S3 to the database
|
||||
import logging
|
||||
import time
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import os
|
||||
from logging_config import setup_logging, CUSTOM_ERROR_LEVEL
|
||||
from email_utils import handle_error, send_email_with_attachment
|
||||
from s3_utils import create_s3_client, list_s3_bucket, parse_s3_files
|
||||
from error_handler import handle_general_error, handle_file_not_found_error, handle_value_error
|
||||
from file_utils import is_file_empty
|
||||
from db_utils import count_files, get_distinct_filenames_from_db
|
||||
from dotenv import load_dotenv
|
||||
import config
|
||||
import psycopg2
|
||||
|
||||
load_dotenv()
|
||||
|
||||
# MAIN PROCESS
|
||||
def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
||||
# import global variables
|
||||
#from config import load_config, aws_config, db_config, ach_config, bucket_name
|
||||
#global aws_config, db_config, ach_config, bucket_name
|
||||
#config import load_config , aws_config, db_config, ach_config, bucket_name
|
||||
#load_config()
|
||||
|
||||
logging.info(f"bucket_name: {bucket_name}")
|
||||
|
||||
# Ensure timing variables are always defined so later error-email logic
|
||||
# won't fail if an exception is raised before end_time/elapsed_time is set.
|
||||
start_time = time.time()
|
||||
end_time = start_time
|
||||
elapsed_time = 0.0
|
||||
|
||||
try:
|
||||
logging.info("Starting the main process...")
|
||||
|
||||
# Create the S3 client
|
||||
s3_client = create_s3_client(aws_config)
|
||||
|
||||
# List S3 bucket contents
|
||||
contents = list_s3_bucket(s3_client, bucket_name)
|
||||
|
||||
# Define valid extensions and excluded folders
|
||||
valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
|
||||
excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
|
||||
|
||||
# Extract and filter file names
|
||||
s3_file_names = [
|
||||
content['Key'] for content in contents
|
||||
if any(content['Key'].endswith(ext) for ext in valid_extensions) and
|
||||
not any(content['Key'].startswith(folder) for folder in excluded_folders)
|
||||
]
|
||||
|
||||
s3_only_mp4_file_names = [
|
||||
content['Key'] for content in contents
|
||||
if content['Key'].endswith('.mp4') and
|
||||
not any(content['Key'].startswith(folder) for folder in excluded_folders)
|
||||
]
|
||||
|
||||
total_file_s3mp4 = len(s3_only_mp4_file_names)
|
||||
logging.info(f"Total number of distinct .mp4 files in the S3 bucket before import: {total_file_s3mp4}")
|
||||
|
||||
# filter_s3_files_not_in_db
|
||||
# --- Get all DB filenames in one call ---
|
||||
db_file_names = get_distinct_filenames_from_db()
|
||||
|
||||
# --- Keep only those not in DB ---
|
||||
file_names = [f for f in s3_file_names if f not in db_file_names]
|
||||
|
||||
# Print the total number of files
|
||||
total_file_db = len(db_file_names)
|
||||
logging.info(f"Total number of distinct files in the database before import: {total_file_db}")
|
||||
total_files_s3 = len(s3_file_names)
|
||||
logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files in the S3 bucket before DB filter: {total_files_s3}")
|
||||
total_files = len(file_names)
|
||||
logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
|
||||
|
||||
# Count files with .mp4 and .mp3 extensions
|
||||
mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
|
||||
mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
|
||||
md5_count = sum(1 for file in s3_file_names if file.endswith('.md5'))
|
||||
pdf_count = sum(1 for file in s3_file_names if file.endswith('.pdf'))
|
||||
json_count = sum(1 for file in s3_file_names if file.endswith('.json'))
|
||||
mov_count = sum(1 for file in s3_file_names if file.endswith('.mov'))
|
||||
# jpg_count = sum(1 for file in file_names if file.endswith('.jpg'))
|
||||
# file directory
|
||||
avi_count = sum(1 for file in s3_file_names if file.endswith('.avi'))
|
||||
m4v_count = sum(1 for file in s3_file_names if file.endswith('.m4v'))
|
||||
# Log the counts
|
||||
# Get the logger instance
|
||||
logger = logging.getLogger()
|
||||
# Use the logger instance to log custom info
|
||||
logging.warning("Number of .mp4 files on S3 bucket (%s): %s", bucket_name, mp4_count)
|
||||
logging.warning("Number of .mp3 files on S3 bucket (%s): %s", bucket_name, mp3_count)
|
||||
logging.warning("Number of .md5 files on S3 bucket (%s): %s", bucket_name, md5_count)
|
||||
logging.warning("Number of .pdf files on S3 bucket (%s): %s", bucket_name, pdf_count)
|
||||
logging.warning("Number of .json files on S3 bucket (%s): %s", bucket_name, json_count)
|
||||
logging.warning("Number of .mov files on S3 bucket (%s): %s", bucket_name, mov_count)
|
||||
if mp4_count != pdf_count:
|
||||
logging.error("Number of .mp4 files is not equal to number of .pdf files")
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
# return
|
||||
if mp3_count + mp4_count != json_count:
|
||||
logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
# return
|
||||
if mp3_count + mp4_count != md5_count:
|
||||
logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .md5 files")
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
# return
|
||||
|
||||
# Try to parse S3 files
|
||||
try:
|
||||
# if DRY RUN is set to True, the files will not be uploaded to the database
|
||||
|
||||
logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
|
||||
# set the tuples to zero
|
||||
uploaded_files_count, warning_files_count, error_files_count = (0, 0, 0)
|
||||
|
||||
logging.warning("Total number of files (mp3+mp4) with warnings: %s. (Probably already existing in the DB)", warning_files_count)
|
||||
logging.warning("Total number of files with errors: %s", error_files_count)
|
||||
logging.warning("Total number of files uploaded: %s", uploaded_files_count)
|
||||
logging.warning("All files parsed")
|
||||
except Exception as e:
|
||||
logging.error(f"An error occurred while parsing S3 files: {e}")
|
||||
handle_general_error(e)
|
||||
|
||||
# Check results
|
||||
# connect to database
|
||||
conn = psycopg2.connect(**db_config)
|
||||
cur = conn.cursor()
|
||||
# function count_files that are wav and mov in db
|
||||
# Map file extensions (include leading dot) to mime types
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/mov',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
}
|
||||
|
||||
# populate mime_type list with all relevant MediaInfo/MIME values
|
||||
mime_type = [
|
||||
'video/x-msvideo', # .avi
|
||||
'video/mov', # .mov
|
||||
'audio/wav', # .wav
|
||||
'video/mp4', # .mp4, .m4v
|
||||
'audio/mp3', # .mp3
|
||||
'application/mxf', # .mxf
|
||||
'video/mpeg', # .mpg
|
||||
]
|
||||
|
||||
logging.info(f"Mime types for counting files: {mime_type}")
|
||||
|
||||
all_files_on_db = count_files(cur, mime_type,'*', False)
|
||||
mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
|
||||
mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
|
||||
mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
|
||||
avi_files_on_db = count_files(cur,['video/x-msvideo'],'.avi', False )
|
||||
m4v_files_on_db = count_files(cur,['video/mp4'],'.m4v', False )
|
||||
mp4_files_on_db = count_files(cur,['video/mp4'],'.mp4', False )
|
||||
wav_files_on_db = count_files(cur,['audio/wav'],'.wav', False )
|
||||
mp3_files_on_db = count_files(cur,['audio/mp3'],'.mp3', False )
|
||||
|
||||
# mov + m4v + avi + mxf + mpg
|
||||
logging.warning(f"Number of all video files in the database: {all_files_on_db}")
|
||||
logging.warning(f"Number of .mov files in the database: {mov_files_on_db} and S3: {mov_count} ")
|
||||
logging.warning(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
|
||||
|
||||
# compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
|
||||
missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names]
|
||||
logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}")
|
||||
|
||||
|
||||
logging.warning(f"Number of .wav files in the database: {wav_files_on_db} ")
|
||||
logging.warning(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
|
||||
logging.warning(f"Number of .avi files in the database: {avi_files_on_db} ")
|
||||
logging.warning(f"Number of .m4v files in the database: {m4v_files_on_db} ")
|
||||
logging.warning(f"Number of .mxf files in the database: {mxf_files_on_db} ")
|
||||
logging.warning(f"Number of .mpg files in the database: {mpg_files_on_db} ")
|
||||
|
||||
logging.warning(f"Total file in s3 before import {total_files}")
|
||||
|
||||
# time elapsed
|
||||
end_time = time.time() # Record end time
|
||||
|
||||
elapsed_time = end_time - start_time
|
||||
logging.warning(f"Processing completed. Time taken: {elapsed_time:.2f} seconds")
|
||||
|
||||
|
||||
|
||||
except Exception as e:
|
||||
handle_general_error(e)
|
||||
except FileNotFoundError as e:
|
||||
handle_file_not_found_error(e)
|
||||
except ValueError as e:
|
||||
handle_value_error(e)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
# Setup logging using standard TimedRotatingFileHandler handlers.
|
||||
# Rely on the handler's built-in rotation; don't call doRollover manually.
|
||||
logger, rotating_handler, error_handler, warning_handler = setup_logging()
|
||||
|
||||
# Load configuration settings
|
||||
aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
|
||||
|
||||
logging.info("Config loaded, and logging setup")
|
||||
|
||||
# Run the main process
|
||||
main_process(aws_config, db_config, ach_config, bucket_name, ach_variables)
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"An error occurred: {e}")
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Set the working directory
|
||||
cd /app
|
||||
|
||||
# Source the environment variables
|
||||
set -a
|
||||
[ -f /app/.env ] && source /app/.env
|
||||
set +a
|
||||
|
||||
# Run the Python script
|
||||
/usr/local/bin/python /app/main.py >> /var/log/cron.log 2>&1
|
||||
|
|
@ -0,0 +1,618 @@
|
|||
""" db utils.py
|
||||
Description:
|
||||
This module provides utility functions for interacting with a PostgreSQL database using psycopg2.
|
||||
It includes functions for checking inventory codes, adding file records and relationships,
|
||||
retrieving support IDs, and executing queries.
|
||||
Functions:
|
||||
check_inventory_in_db(s3_client, cur, base_name):
|
||||
Checks if the inventory code exists in the database and validates its format.
|
||||
check_objkey_in_file_db(cur, base_name):
|
||||
Checks if the object key exists in the database.
|
||||
add_file_record_and_relationship(s3_client, cur, base_name, ach_variables):
|
||||
Adds a file record and its relationships to the database and uploads a log file to S3.
|
||||
add_file_record(cur, editor_id, approver_id, disk_size, file_availability_dict, notes, base_name, extension, storage_location, file_type, custom_data_in=None):
|
||||
Adds a file record to the database.
|
||||
add_file_support_relationship(cur, editor_id, approver_id, file_id, support_id, status):
|
||||
Adds a relationship between a file and a support record in the database.
|
||||
add_file_relationship(cur, file_a_id, file_b_id, file_relation_dict, custom_data, status, editor_id, approver_id):
|
||||
Adds a relationship between two files in the database.
|
||||
retrieve_support_id(cur, inventory_code):
|
||||
Retrieves the support ID for a given inventory code from the database.
|
||||
retrieve_digital_file_names(s3_client, cur, base_name, digital_file_name_in):
|
||||
Retrieves digital file names from the database that match the given base_name.
|
||||
get_db_connection(db_config):
|
||||
Establishes a connection to the PostgreSQL database using the provided configuration.
|
||||
execute_query(conn, query, params=None):
|
||||
Executes a SQL query on the database.
|
||||
"""
|
||||
import psycopg2
|
||||
from psycopg2 import sql
|
||||
import logging
|
||||
from datetime import datetime
|
||||
import re
|
||||
from email_utils import handle_error
|
||||
import json
|
||||
import os
|
||||
import config
|
||||
|
||||
# Map file extensions (include leading dot) to mime types
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/mov',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
}
|
||||
|
||||
def get_mime_for_extension(extension: str) -> str:
|
||||
"""Return the mime type for an extension. Accepts with or without leading dot.
|
||||
|
||||
Falls back to 'application/octet-stream' when unknown.
|
||||
"""
|
||||
if not extension:
|
||||
return 'application/octet-stream'
|
||||
if not extension.startswith('.'):
|
||||
extension = f'.{extension}'
|
||||
return EXTENSION_MIME_MAP.get(extension.lower(), 'application/octet-stream')
|
||||
|
||||
def get_distinct_filenames_from_db():
|
||||
"""Retrieve distinct digital file names from the Postgres DB.
|
||||
|
||||
This helper loads DB configuration via `config.load_config()` and then
|
||||
opens a connection with `get_db_connection(db_config)`.
|
||||
"""
|
||||
# load db_config from project config helper (aws_config, db_config, ach_config, bucket_name, ach_variables)
|
||||
try:
|
||||
_, db_config, _, _, _ = config.load_config()
|
||||
except Exception:
|
||||
# If config.load_config isn't available or fails, re-raise with a clearer message
|
||||
raise RuntimeError("Unable to load DB configuration via config.load_config()")
|
||||
|
||||
conn = get_db_connection(db_config)
|
||||
try:
|
||||
with conn.cursor() as cur:
|
||||
# cur.execute("SELECT DISTINCT digital_file_name FROM file;")
|
||||
cur.execute(
|
||||
"""SELECT DISTINCT digital_file_name
|
||||
FROM file
|
||||
WHERE digital_file_name ~ '\\.(mp3|mp4|md5|json|pdf)$';"""
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
# Flatten list of tuples -> simple set of names
|
||||
return {row[0] for row in rows if row[0] is not None}
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
# Function to check if the inventory code exists in the database
|
||||
def check_inventory_in_db(s3_client, cur, base_name):
|
||||
logging.debug("Executing check_inventory_in_db")
|
||||
# Load the configuration from the .env file
|
||||
# aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
|
||||
|
||||
# Define the pattern for the inventory code
|
||||
media_tipology_A = ['MCC', 'OA4', 'DAT']
|
||||
# TODO add other tipologies: AVI, M4V, MOV, MP4, MXF, MPG (done 04112025)
|
||||
media_tipology_V = [
|
||||
'OV1', 'OV2', 'UMT', 'VHS', 'HI8', 'VD8', 'BTC', 'DBT', 'IMX', 'DVD',
|
||||
'CDR', 'MDV', 'DVC', 'HDC', 'BRD', 'CDV',
|
||||
'AVI', 'M4V', 'MOV', 'MP4', 'MXF', 'MPG' # add for "file" folders 04112025
|
||||
]
|
||||
|
||||
# list of known mime types (derived from EXTENSION_MIME_MAP)
|
||||
mime_type = list({v for v in EXTENSION_MIME_MAP.values()})
|
||||
|
||||
try:
|
||||
logging.info(f"SUPPORT TYPOLOGY : {base_name[3:6]}")
|
||||
if base_name[3:6] == 'OA4':
|
||||
pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}_\d{2}$' # include the _\d{2} for OA4
|
||||
truncated_base_name = base_name[:15]
|
||||
logging.info(f"type is OA4: {truncated_base_name}")
|
||||
elif base_name[3:6] == 'MCC':
|
||||
pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}_[AB]$' # include the _[AB] for MCC
|
||||
truncated_base_name = base_name[:14]
|
||||
logging.info(f"type is MCC: {truncated_base_name}")
|
||||
else:
|
||||
# Check the base_name format with regex pattern first
|
||||
pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$'
|
||||
truncated_base_name = base_name[:12]
|
||||
logging.info(f"type is default: {truncated_base_name}")
|
||||
|
||||
logging.info(f"Checking inventory code {truncated_base_name} with pattern {pattern}...")
|
||||
# Validate the string
|
||||
try:
|
||||
if not re.match(pattern, truncated_base_name):
|
||||
error_message = f"Invalid format for base_name {truncated_base_name}"
|
||||
logging.error(error_message)
|
||||
raise ValueError(error_message) # Create and raise the exception
|
||||
else :
|
||||
# Extract the first character and the 3 central characters
|
||||
first_char = truncated_base_name[0]
|
||||
central_chars = truncated_base_name[3:6]
|
||||
|
||||
# Check the corresponding list based on the first character
|
||||
if first_char == 'A': # Check the corresponding list based on the first character
|
||||
if central_chars not in media_tipology_A:
|
||||
logging.error(f"Invalid media tipology for base_name {truncated_base_name}")
|
||||
return False, None
|
||||
elif first_char == 'V': # Check the corresponding list based on the first character
|
||||
if central_chars not in media_tipology_V:
|
||||
logging.error(f"Invalid media tipology for base_name {truncated_base_name}")
|
||||
return False, None
|
||||
else: # Invalid first character
|
||||
logging.error(f"Invalid first character for base_name {truncated_base_name}")
|
||||
return False, None
|
||||
|
||||
logging.info(f"Valid format for base_name {truncated_base_name}")
|
||||
except ValueError as e:
|
||||
# Handle the specific ValueError exception
|
||||
logging.error(f"Caught a ValueError: {e}")
|
||||
# Optionally, take other actions or clean up
|
||||
return False, None
|
||||
except Exception as e:
|
||||
# Handle any other exceptions
|
||||
logging.error(f"Caught an unexpected exception: {e}")
|
||||
# Optionally, take other actions or clean up
|
||||
return False, None
|
||||
|
||||
# First query: Check if the truncated base_name matches an inventory code in the support table
|
||||
check_query = sql.SQL("""
|
||||
SELECT 1
|
||||
FROM support
|
||||
WHERE inventory_code LIKE %s
|
||||
LIMIT 1;
|
||||
""")
|
||||
cur.execute(check_query, (f"{truncated_base_name[:12]}%",))
|
||||
result = cur.fetchone()
|
||||
|
||||
if result:
|
||||
logging.info(f"Inventory code {truncated_base_name[:12]} found in the database.")
|
||||
# Call the function to retrieve digital file names, assuming this function is implemented
|
||||
return True, truncated_base_name
|
||||
else:
|
||||
logging.info(f"Inventory code {truncated_base_name} not found in the database.")
|
||||
handle_error(f"Inventory code {truncated_base_name} not found in the database.")
|
||||
#raise ValueError(f"Inventory code {truncated_base_name} not found in the database.")
|
||||
return False, None
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f'Error checking inventory code {base_name}:', {e})
|
||||
raise e
|
||||
|
||||
# Function to check if the object key exists in the database
|
||||
def check_objkey_in_file_db(cur, base_name):
|
||||
"""
|
||||
Checks if the base_name matches digital_file_name in the file table.
|
||||
|
||||
Args:
|
||||
cur (cursor): The database cursor.
|
||||
base_name (str): The base name to check in the database.
|
||||
|
||||
Returns:
|
||||
tuple: A tuple containing a boolean indicating if the base_name was found and the base_name itself or None.
|
||||
"""
|
||||
logging.debug("Executing check_objkey_in_file_db")
|
||||
|
||||
try:
|
||||
# First query: Check if the base_name matches digital_file_name in the file table
|
||||
check_query = sql.SQL("""
|
||||
SELECT 1
|
||||
FROM file
|
||||
WHERE digital_file_name LIKE %s
|
||||
LIMIT 1;
|
||||
""")
|
||||
cur.execute(check_query, (f"{base_name}%",))
|
||||
result = cur.fetchone()
|
||||
|
||||
if result:
|
||||
logging.info(f"Inventory code {base_name} found in the database.")
|
||||
# Call the function to retrieve digital file names, assuming this function is implemented
|
||||
return True
|
||||
else:
|
||||
logging.info(f"Inventory code {base_name} not found in the database.")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"Error checking inventory code {base_name}: {e}")
|
||||
raise e
|
||||
|
||||
# Function to add a file record and its relationship to the support record
|
||||
def add_file_record_and_relationship(s3_client, cur, base_name,ach_variables):
|
||||
"""
|
||||
Adds a file record and its relationships to the database and uploads a log file to S3.
|
||||
This function performs the following steps:
|
||||
1. Loads configuration from the .env file.
|
||||
2. Retrieves the support ID for the given base name.
|
||||
3. Adds a new file record for the conservative copy.
|
||||
4. Adds a relationship between the new file and the support ID.
|
||||
5. If the file extension is .mp4 or .mp3, adds a new MP4/MP3 file record and its relationships.
|
||||
6. If the file extension is .mp4 and a PDF exists, adds a PDF file record and its relationship to the master file.
|
||||
7. Uploads a log file to S3 if all operations are successful.
|
||||
Args:
|
||||
s3_client (boto3.client): The S3 client used to upload the log file.
|
||||
cur (psycopg2.cursor): The database cursor used to execute SQL queries.
|
||||
base_name (str): The base name of the file.
|
||||
ach_variables (dict): A dictionary containing various variables and configurations needed for the operation.
|
||||
Returns:
|
||||
bool: True if the operation is successful, False otherwise.
|
||||
Raises:
|
||||
Exception: If any error occurs during the operation, it is logged and re-raised.
|
||||
"""
|
||||
# Load the configuration from the .env file
|
||||
aws_config, db_config, ach_config, bucket_name, _ = config.load_config()
|
||||
|
||||
editor_id = ach_config['ach_editor_id']
|
||||
approver_id = ach_config['ach_approver_id']
|
||||
# Append current date and time to notes in format: yyyy mm dd HH MM SS
|
||||
now_dt = datetime.now()
|
||||
date_part = now_dt.strftime('%Y %m %d')
|
||||
time_part = now_dt.strftime('%H %M %S')
|
||||
notes = f"{ach_config.get('ach_notes','') } {date_part} {time_part}"
|
||||
|
||||
ach_variables['file_copia_conservativa'] = ach_variables['custom_data_in'].get('mediainfo', {}).get("media", {}).get("@ref", "")
|
||||
logging.info(f"ach_variables['file_copia_conservativa']a: {ach_variables['file_copia_conservativa']}")
|
||||
|
||||
logging.debug("Executing add_file_record_and_relationship")
|
||||
|
||||
try:
|
||||
# Retrieve the support ID for the given base name
|
||||
support_id = retrieve_support_id(cur, ach_variables['inventory_code'])
|
||||
if support_id:
|
||||
logging.info(f"Found support_id {support_id} for base_name {base_name}.")
|
||||
|
||||
ach_variables['objectKeys']['conservative_copy'] = ach_variables['file_copia_conservativa'] # must remove _H264
|
||||
# replace in ach_variables['file_copia_conservativa'] _H264 with empty string
|
||||
ach_variables['objectKeys']['conservative_copy'] = ach_variables['objectKeys']['conservative_copy'].replace('_H264', '')
|
||||
|
||||
# Add a new file record and get the new file ID
|
||||
file_availability_dict = 7 # Place Holder
|
||||
# add a new file record for the "copia conservativa"
|
||||
ach_variables['custom_data_in']['media_usage'] = 'master' # can be "copia conservativa"
|
||||
# determine master mime type from the file extension
|
||||
master_mime_type = get_mime_for_extension(ach_variables.get('extension'))
|
||||
|
||||
new_file_id = add_file_record(
|
||||
cur,
|
||||
editor_id,
|
||||
approver_id,
|
||||
ach_variables['disk_size'],
|
||||
file_availability_dict,
|
||||
notes,
|
||||
ach_variables['objectKeys']['conservative_copy'],
|
||||
ach_variables['conservative_copy_extension'],
|
||||
ach_config['ach_storage_location'],
|
||||
master_mime_type,
|
||||
ach_variables['custom_data_in']
|
||||
)
|
||||
|
||||
if new_file_id:
|
||||
logging.info(f"Added file record for {base_name} with file_id {new_file_id}.")
|
||||
# Add a relationship between the new file and the support ID
|
||||
status = '{"saved": true, "status": "approved"}' # Define the status JSON
|
||||
add_file_support_relationship(cur, editor_id, approver_id, new_file_id, support_id, status)
|
||||
|
||||
# If the file extension is .mp4, add a new MP4/MP3 file record
|
||||
mime_type = get_mime_for_extension(ach_variables.get('extension'))
|
||||
if ach_variables['extension'] == '.mp4' or ach_variables['extension'] == '.mp3':
|
||||
file_availability_dict = 8 # Hot Storage
|
||||
mp4_file_id = add_file_record(
|
||||
cur,
|
||||
editor_id,
|
||||
approver_id,
|
||||
ach_variables['media_disk_size'],
|
||||
file_availability_dict,
|
||||
notes,
|
||||
ach_variables['objectKeys']['media'],
|
||||
ach_variables['extension'], # deprecated
|
||||
{"storage_type": "s3", "storage_location_id": 5},
|
||||
mime_type,
|
||||
{"media_usage": "streaming"}
|
||||
)
|
||||
if mp4_file_id:
|
||||
logging.info(f"Added MP4/MP3 file record for {base_name} with file_id {mp4_file_id}.")
|
||||
# Add a relationship between the MP4 file and the support ID
|
||||
status = '{"saved": true, "status": "approved"}' # Define the status JSON
|
||||
add_file_support_relationship(cur, editor_id, approver_id, mp4_file_id, support_id, status)
|
||||
|
||||
# Add a relationship between the streming file(mp4_file_id) and the master file(new_file_id)
|
||||
status = '{"saved": true, "status": "approved"}' # Define the status JSON
|
||||
file_relation_dict = 10 # Define the relationship dictionary: È re encoding di master
|
||||
add_file_relationship(cur, new_file_id, mp4_file_id, file_relation_dict, '{}', status, editor_id, approver_id)
|
||||
|
||||
# the .mp4 should also have the QC in pdf format: add file as PDF relation with file MASTER as documentation
|
||||
if ach_variables['extension'] == '.mp4' and ach_variables['pdf_disk_size'] > 0:
|
||||
file_availability_dict = 8 # Hot Storage
|
||||
pdf_file_id = add_file_record(
|
||||
cur,
|
||||
editor_id,
|
||||
approver_id,
|
||||
ach_variables['pdf_disk_size'],
|
||||
file_availability_dict,
|
||||
notes,
|
||||
ach_variables['objectKeys']['pdf'],
|
||||
'.pdf',
|
||||
{"storage_type": "s3", "storage_location_id": 5},
|
||||
"application/pdf",
|
||||
{"media_usage": "documentation"}
|
||||
)
|
||||
if pdf_file_id:
|
||||
logging.info(f"Added PDF file record for {base_name} with file_id {pdf_file_id}.")
|
||||
# Add a relationship between the PDF file and the support ID
|
||||
# If both MP4 and PDF file IDs exist, add a relationship between them
|
||||
if ach_variables['extension'] == '.mp4' and pdf_file_id:
|
||||
file_relation_dict = 11 # Define the relationship dictionary e documentazione di master
|
||||
custom_data = '{}' # Define any additional custom data if needed
|
||||
status = '{"saved": true, "status": "approved"}' # Define the status
|
||||
add_file_relationship(cur, new_file_id, pdf_file_id, file_relation_dict, '{}', status,
|
||||
editor_id, approver_id)
|
||||
|
||||
# If everything is successful, upload the log file to S3
|
||||
# log_file_path = ach_variables['file_fullpath'] + base_name + '.log'
|
||||
# logging.info(f"Uploading log file {log_file_path} to S3...")
|
||||
# log_data = "import successful"
|
||||
# log_to_s3(s3_client, bucket_name, log_file_path, log_data)
|
||||
# logging.info(f"Log file {log_file_path} uploaded to S3.")
|
||||
return True
|
||||
else:
|
||||
logging.error(f"No support_id found for base_name {base_name}.")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f'Error adding file record and relationship: {e}')
|
||||
raise e
|
||||
|
||||
# Functio to add a file record
|
||||
def add_file_record(cur, editor_id, approver_id, disk_size, file_availability_dict, notes, base_name, extension,storage_location, file_type, custom_data_in=None):
|
||||
try:
|
||||
cur.execute("""
|
||||
SELECT public.add_file02(
|
||||
%s, -- editor_id_in
|
||||
%s, -- approver_id_in
|
||||
%s, -- disk_size_in
|
||||
%s, -- file_availability_dict_in
|
||||
%s, -- notes_in
|
||||
%s, -- digital_file_name_in
|
||||
%s, -- original_file_name_in
|
||||
%s, -- status_in
|
||||
%s, -- storage_location_in
|
||||
%s, -- file_type_in
|
||||
%s -- custom_data_in
|
||||
)
|
||||
""", (
|
||||
editor_id,
|
||||
approver_id,
|
||||
disk_size,
|
||||
file_availability_dict, # New parameter
|
||||
notes,
|
||||
f"{base_name}", # Digital_file_name_in
|
||||
f"{base_name}", # Original file name
|
||||
'{"saved": true, "status": "approved"}', # Status
|
||||
json.dumps(storage_location), # Storage location
|
||||
json.dumps({"type": file_type}), # File type
|
||||
json.dumps(custom_data_in) # Custom data
|
||||
))
|
||||
|
||||
# Fetch the result returned by the function
|
||||
file_result = cur.fetchone()
|
||||
return file_result[0] if file_result else None
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f'Error adding file record:', e)
|
||||
raise e
|
||||
|
||||
# Function to ad a file support relationship
|
||||
def add_file_support_relationship(cur, editor_id, approver_id, file_id, support_id, status):
|
||||
try:
|
||||
# Call the stored procedure using SQL CALL statement with named parameters
|
||||
cur.execute("""
|
||||
CALL public.add_rel_file_support(
|
||||
editor_id_in := %s,
|
||||
approver_id_in := %s,
|
||||
file_id_in := %s,
|
||||
support_id_in := %s,
|
||||
status_in := %s
|
||||
)
|
||||
""", (
|
||||
editor_id, # editor_id_in
|
||||
approver_id, # approver_id_in
|
||||
file_id, # file_id_in
|
||||
support_id, # support_id_in
|
||||
status # status_in
|
||||
))
|
||||
|
||||
# Since the procedure does not return any results, just #print a confirmation message
|
||||
logging.info(f"Added file support relationship for file_id {file_id}.")
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f'Error adding file support relationship:', e)
|
||||
raise e
|
||||
|
||||
# Function to add file to file relationship
|
||||
def add_file_relationship(cur, file_a_id, file_b_id, file_relation_dict, custom_data, status, editor_id, approver_id):
|
||||
try:
|
||||
# Call the stored procedure using SQL CALL statement with positional parameters
|
||||
cur.execute("""
|
||||
CALL public.add_rel_file(
|
||||
editor_id_in := %s,
|
||||
approver_id_in := %s,
|
||||
file_a_id_in := %s,
|
||||
file_b_id_in := %s,
|
||||
file_relation_dict_in := %s,
|
||||
status_in := %s,
|
||||
custom_data_in := %s
|
||||
)
|
||||
""", (
|
||||
editor_id, # editor_id_in
|
||||
approver_id, # approver_id_in
|
||||
file_a_id, # file_a_id_in
|
||||
file_b_id, # file_b_id_in
|
||||
file_relation_dict, # file_relation_dict_in
|
||||
status, # status_in
|
||||
custom_data # custom_data_in
|
||||
))
|
||||
|
||||
# Since the procedure does not return a result, just #print a confirmation message
|
||||
logging.info(f"Added file relationship between file_id {file_a_id} and file_id {file_b_id}.")
|
||||
except Exception as e:
|
||||
logging.error(f'Error adding file relationship: {e}', exc_info=True)
|
||||
raise e
|
||||
|
||||
# Function to retrieve the support ID for a given inventory code
|
||||
def retrieve_support_id(cur, inventory_code):
|
||||
try:
|
||||
cur.execute("""
|
||||
SELECT MAX(s.id) AS id
|
||||
FROM support s
|
||||
WHERE s.inventory_code LIKE %s::text
|
||||
AND (s.support_type ->> 'type' LIKE 'video' OR s.support_type ->> 'type' LIKE 'audio')
|
||||
GROUP BY s.h_base_record_id
|
||||
""", (f"{inventory_code}%",))
|
||||
|
||||
support_result = cur.fetchone()
|
||||
logging.info(f"support_result: {support_result[0] if support_result and support_result[0] else None}")
|
||||
return support_result[0] if support_result and support_result[0] else None
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f'Error retrieving support_id:', e)
|
||||
raise e
|
||||
|
||||
# Function to retrieve digital_file_name from the database
|
||||
def retrieve_digital_file_names(s3_client, cur, base_name,digital_file_name_in):
|
||||
try:
|
||||
logging.info(f"Retrieving digital file names for inventory code {base_name}... and digital_file_name {digital_file_name_in}")
|
||||
# Define the query to retrieve digital_file_name
|
||||
query = sql.SQL("""
|
||||
WITH supprtId AS (
|
||||
SELECT s.id
|
||||
FROM support s
|
||||
WHERE s.inventory_code LIKE %s
|
||||
), rel_fs AS (
|
||||
SELECT rfs.file_id
|
||||
FROM rel_file_support rfs
|
||||
WHERE rfs.support_id IN (SELECT id FROM supprtId)
|
||||
)
|
||||
SELECT f.id AS file_id, f.digital_file_name
|
||||
FROM file f
|
||||
WHERE f.id IN (SELECT file_id FROM rel_fs)
|
||||
OR f.digital_file_name ILIKE %s
|
||||
""")
|
||||
|
||||
# Execute the query
|
||||
cur.execute(query, (f"{base_name[:12]}", f"{digital_file_name_in}.%"))
|
||||
results = cur.fetchall()
|
||||
|
||||
# Process results (for example, printing)
|
||||
for result in results:
|
||||
logging.info(f"File ID: {result[0]}, Digital File Name: {result[1]}")
|
||||
|
||||
|
||||
# Filter results to match the base_name (without extension)
|
||||
matching_files = []
|
||||
for row in results:
|
||||
digital_file_name = row[1] # Second column: digital_file_name
|
||||
# Extract base name without extension
|
||||
# base_name_from_file = os.path.splitext(digital_file_name)[0]
|
||||
base_name_from_file = os.path.splitext(os.path.basename(digital_file_name))[0]
|
||||
logging.info(f"base_name_from_file: {base_name_from_file}")
|
||||
# Compare with the provided base_name
|
||||
if base_name_from_file == os.path.splitext(os.path.basename(digital_file_name_in))[0]:
|
||||
matching_files.append(digital_file_name)
|
||||
|
||||
logging.info(f"Matching digital file names: {matching_files}")
|
||||
if matching_files:
|
||||
logging.info(f"Found the following matching digital file names: {matching_files} do not add")
|
||||
return False
|
||||
else:
|
||||
logging.info(f"No matching digital file names found for inventory code {base_name}. try to add new record")
|
||||
# Call the function to add record and relationship
|
||||
# uncomment the above line to add the record and relationship
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f'Error retrieving digital file names for inventory code {base_name}:', e)
|
||||
raise e
|
||||
|
||||
# Fuction to get db connection
|
||||
def get_db_connection(db_config):
|
||||
try:
|
||||
conn = psycopg2.connect(**db_config)
|
||||
return conn
|
||||
except psycopg2.Error as e:
|
||||
logging.error(f"Error connecting to the database: {e}")
|
||||
raise e
|
||||
|
||||
# Function to execute a query
|
||||
def execute_query(conn, query, params=None):
|
||||
try:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(query, params)
|
||||
conn.commit()
|
||||
except psycopg2.Error as e:
|
||||
logging.error(f"Error executing query: {e}")
|
||||
raise e
|
||||
|
||||
#Function count files in the database
|
||||
def count_files(cur, mime_type, extension='*', file_dir=True):
|
||||
"""
|
||||
Function to count files in the database with various filters
|
||||
"""
|
||||
try:
|
||||
# Base query components
|
||||
query = (
|
||||
"SELECT COUNT(DISTINCT h_base_record_id) "
|
||||
"FROM file "
|
||||
"WHERE file_type ->> 'type' IS NOT NULL"
|
||||
)
|
||||
|
||||
args = ()
|
||||
|
||||
# Handle mime_type list: expand into ARRAY[...] with individual placeholders
|
||||
if mime_type:
|
||||
if isinstance(mime_type, (list, tuple)):
|
||||
# Create placeholders for each mime type and append to args
|
||||
placeholders = ','.join(['%s'] * len(mime_type))
|
||||
query += f" AND file_type ->> 'type' = ANY(ARRAY[{placeholders}])"
|
||||
args += tuple(mime_type)
|
||||
else:
|
||||
# single value
|
||||
query += " AND file_type ->> 'type' = %s"
|
||||
args += (mime_type,)
|
||||
|
||||
# Add extension condition if not wildcard
|
||||
if extension != '*' and extension is not None:
|
||||
query += " AND digital_file_name ILIKE %s"
|
||||
args = args + (f'%{extension}',)
|
||||
# If extension is '*', no additional condition is needed (matches any extension)
|
||||
|
||||
# Add file directory condition based on file_dir parameter
|
||||
if file_dir:
|
||||
# Only files in directory (original_file_name starts with 'FILE')
|
||||
query += " AND original_file_name ILIKE %s"
|
||||
args = args + ('FILE%',)
|
||||
else:
|
||||
# Exclude files in directory (original_file_name does not start with 'FILE')
|
||||
query += " AND original_file_name NOT ILIKE %s"
|
||||
args = args + ('FILE%',)
|
||||
|
||||
try:
|
||||
logging.debug("Executing count_files SQL: %s -- args: %s", query, args)
|
||||
cur.execute(query, args)
|
||||
result = cur.fetchone()
|
||||
# fetchone() returns a sequence or None; protect against unexpected empty sequences
|
||||
if not result:
|
||||
return 0
|
||||
try:
|
||||
return result[0]
|
||||
except (IndexError, TypeError) as idx_e:
|
||||
logging.exception("Unexpected result shape from count query: %s", result)
|
||||
raise
|
||||
except Exception as e:
|
||||
logging.exception('Error executing count_files with query: %s', query)
|
||||
raise
|
||||
except Exception as e:
|
||||
logging.error('Error counting files: %s', e)
|
||||
raise e
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
services:
|
||||
app:
|
||||
build: .
|
||||
container_name: ACH_server_media_importer
|
||||
volumes:
|
||||
- logs:/app/logs # Add this line to map the logs volume
|
||||
env_file:
|
||||
- .env
|
||||
environment:
|
||||
- AWS_ACCESS_KEY_I=${AWS_ACCESS_KEY_ID}
|
||||
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
|
||||
- AWS_REGION=${AWS_REGION}
|
||||
- AWS_ENDPOINT_URL=${AWS_ENDPOINT_URL}
|
||||
- BUCKET_NAME=${BUCKET_NAME}
|
||||
- DB_HOST=${DB_HOST}
|
||||
- DB_NAME=${DB_NAME}
|
||||
- DB_USER=${DB_USER}
|
||||
- SMTP_SERVER=${SMTP_SERVER}
|
||||
- SMTP_PORT=${SMTP_PORT}
|
||||
- SMTP_USER=${SMTP_USER}
|
||||
- SMTP_PASSWORD=${SMTP_PASSWORD}
|
||||
- SENDER_EMAIL=${SENDER_EMAIL}
|
||||
- EMAIL_RECIPIENTS=${EMAIL_RECIPIENTS}
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
logs: # Define the named volume
|
||||
|
|
@ -0,0 +1,106 @@
|
|||
import os
|
||||
import smtplib
|
||||
import traceback
|
||||
import logging
|
||||
from email.mime.multipart import MIMEMultipart
|
||||
from email.mime.text import MIMEText
|
||||
from email.mime.base import MIMEBase
|
||||
from email import encoders
|
||||
from email.utils import formataddr
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# Load environment variables from .env file
|
||||
load_dotenv()
|
||||
|
||||
# Email configuration
|
||||
SMTP_SERVER = os.getenv('SMTP_SERVER')
|
||||
SMTP_PORT = os.getenv('SMTP_PORT')
|
||||
SMTP_USER = os.getenv('SMTP_USER')
|
||||
SMTP_PASSWORD = os.getenv('SMTP_PASSWORD')
|
||||
SENDER_EMAIL = os.getenv('SENDER_EMAIL')
|
||||
# Split env recipient lists safely and strip whitespace; default to empty list
|
||||
def _split_env_list(varname):
|
||||
raw = os.getenv(varname, '')
|
||||
return [s.strip() for s in raw.split(',') if s.strip()]
|
||||
|
||||
EMAIL_RECIPIENTS = _split_env_list('EMAIL_RECIPIENTS')
|
||||
ERROR_EMAIL_RECIPIENTS = _split_env_list('ERROR_EMAIL_RECIPIENTS')
|
||||
SUCCESS_EMAIL_RECIPIENTS = _split_env_list('SUCCESS_EMAIL_RECIPIENTS')
|
||||
|
||||
|
||||
# Send email with attachment
|
||||
def send_email_with_attachment(subject, body, attachment_path=None, email_recipients=None):
|
||||
sender_name="Art.c.hive Support for ARKIVO"
|
||||
try:
|
||||
# Create a multipart message
|
||||
msg = MIMEMultipart()
|
||||
msg['From'] = formataddr((sender_name, SENDER_EMAIL))
|
||||
# if email recipent not defined use EMAIL_RECIPIENTS
|
||||
if email_recipients:
|
||||
msg['To'] = ', '.join(email_recipients)
|
||||
else:
|
||||
msg['To'] = ', '.join(EMAIL_RECIPIENTS)
|
||||
|
||||
msg['Subject'] = subject
|
||||
|
||||
# Attach the body with the msg instance
|
||||
msg.attach(MIMEText(body, 'plain'))
|
||||
|
||||
# Attach the file if provided
|
||||
if attachment_path:
|
||||
if os.path.exists(attachment_path):
|
||||
with open(attachment_path, "rb") as attachment:
|
||||
part = MIMEBase('application', 'octet-stream')
|
||||
part.set_payload(attachment.read())
|
||||
encoders.encode_base64(part)
|
||||
part.add_header('Content-Disposition', f'attachment; filename= {os.path.basename(attachment_path)}')
|
||||
msg.attach(part)
|
||||
else:
|
||||
logging.warning(f"Attachment path {attachment_path} does not exist. Skipping attachment.")
|
||||
|
||||
# Create SMTP session for sending the mail
|
||||
recipients_list = email_recipients if email_recipients else EMAIL_RECIPIENTS
|
||||
if isinstance(recipients_list, str):
|
||||
recipients_list = [recipients_list]
|
||||
with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
|
||||
server.starttls() # Enable security
|
||||
server.login(SMTP_USER, SMTP_PASSWORD) # Login with mail_id and password
|
||||
text = msg.as_string()
|
||||
server.sendmail(SENDER_EMAIL, recipients_list, text)
|
||||
|
||||
logging.info("Email sent successfully.")
|
||||
except Exception as e:
|
||||
logging.error(f"Failed to send email: {e}")
|
||||
|
||||
# Create the email message
|
||||
def send_error_email(subject, body, recipients):
|
||||
try:
|
||||
sender_email = os.getenv('SENDER_EMAIL')
|
||||
smtp_server = os.getenv('SMTP_SERVER')
|
||||
smtp_port = int(os.getenv('SMTP_PORT'))
|
||||
smtp_user = os.getenv('SMTP_USER')
|
||||
smtp_password = os.getenv('SMTP_PASSWORD')
|
||||
|
||||
msg = MIMEMultipart()
|
||||
msg['From'] = sender_email
|
||||
msg['To'] = ", ".join(recipients)
|
||||
msg['Subject'] = subject
|
||||
msg.attach(MIMEText(body, 'plain'))
|
||||
|
||||
with smtplib.SMTP(smtp_server, smtp_port) as server:
|
||||
server.starttls()
|
||||
server.login(smtp_user, smtp_password)
|
||||
server.sendmail(sender_email, recipients, msg.as_string())
|
||||
|
||||
logging.error("Error email sent successfully")
|
||||
except Exception as e:
|
||||
logging.error(f"Failed to send error email: {e}")
|
||||
|
||||
# Handle error
|
||||
def handle_error(e):
|
||||
error_trace = traceback.format_exc()
|
||||
subject = "Error Notification"
|
||||
body = f"An error occurred:\n\n{error_trace}"
|
||||
recipients = os.getenv('EMAIL_RECIPIENTS', '').split(',')
|
||||
send_error_email(subject, body, recipients)
|
||||
raise e
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# error_handler.py
|
||||
|
||||
import logging
|
||||
|
||||
def handle_general_error(e):
|
||||
logging.error(f'An error occurred during the process: {e}')
|
||||
# Add any additional error handling logic here
|
||||
|
||||
def handle_file_not_found_error(e):
|
||||
logging.error(f"File not found error: {e}")
|
||||
# Add any additional error handling logic here
|
||||
|
||||
def handle_value_error(e):
|
||||
logging.error(f"Value error: {e}")
|
||||
# Add any additional error handling logic here
|
||||
|
||||
def handle_error(error_message):
|
||||
logging.error(f"Error: {error_message}")
|
||||
|
||||
class ClientError(Exception):
|
||||
"""Custom exception class for client errors."""
|
||||
pass
|
||||
|
|
@ -0,0 +1,349 @@
|
|||
import os
|
||||
import logging
|
||||
from logging.handlers import RotatingFileHandler
|
||||
import json
|
||||
|
||||
from utils import check_video_info, check_audio_info
|
||||
|
||||
from error_handler import handle_error
|
||||
from botocore.exceptions import ClientError
|
||||
|
||||
#from config import load_config, aws_config, bucket_name
|
||||
import config
|
||||
|
||||
def retrieve_file_contents(s3, base_name):
|
||||
file_contents = {}
|
||||
|
||||
# Retrieve the configuration values
|
||||
# aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
|
||||
_, _, _, bucket_name, _ = config.load_config()
|
||||
|
||||
try:
|
||||
# Define the file extensions as pairs
|
||||
file_extensions = [['json', 'json'], ['md5', 'md5']]
|
||||
|
||||
for ext_pair in file_extensions:
|
||||
file_name = f"{base_name}.{ext_pair[0]}"
|
||||
|
||||
try:
|
||||
response = s3.get_object(Bucket=bucket_name, Key=file_name)
|
||||
file_contents[ext_pair[1]] = response['Body'].read().decode('utf-8')
|
||||
logging.info(f"Retrieved {ext_pair[1]} file content for base_name {base_name}.")
|
||||
except ClientError as e:
|
||||
# S3 returns a NoSuchKey error code when the key is missing.
|
||||
code = e.response.get('Error', {}).get('Code', '')
|
||||
if code in ('NoSuchKey', '404', 'NotFound'):
|
||||
logging.warning(f"{file_name} not found in S3 (code={code}).")
|
||||
# treat missing sidecars as non-fatal; continue
|
||||
continue
|
||||
else:
|
||||
logging.error(f"Error retrieving {file_name}: {e}", exc_info=True)
|
||||
# Re-raise other ClientError types
|
||||
raise
|
||||
except Exception as e:
|
||||
logging.error(f'Error retrieving file contents for {base_name}: {e}', exc_info=True)
|
||||
# Return empty JSON structure instead of raising to avoid tracebacks in callers
|
||||
try:
|
||||
return json.dumps({})
|
||||
except Exception:
|
||||
return '{}'
|
||||
|
||||
# Clean and format file_contents as proper JSON
|
||||
try:
|
||||
cleaned_contents = {}
|
||||
|
||||
# Clean the contents
|
||||
for key, value in file_contents.items():
|
||||
if isinstance(value, str):
|
||||
# Remove trailing newlines or any other unwanted characters
|
||||
cleaned_value = value.strip()
|
||||
|
||||
# Attempt to parse JSON
|
||||
try:
|
||||
cleaned_contents[key] = json.loads(cleaned_value)
|
||||
except json.JSONDecodeError:
|
||||
cleaned_contents[key] = cleaned_value
|
||||
else:
|
||||
cleaned_contents[key] = value
|
||||
|
||||
# Return the cleaned and formatted JSON
|
||||
return json.dumps(cleaned_contents, indent=4)
|
||||
except (TypeError, ValueError) as e:
|
||||
logging.error(f'Error formatting file contents as JSON: {e}', exc_info=True)
|
||||
raise e
|
||||
|
||||
def check_related_files(s3, file_name_with_path, file, bucket_name):
|
||||
"""
|
||||
Check for related files in S3 based on the given file type.
|
||||
Parameters:
|
||||
- s3: The S3 client object.
|
||||
- file_name_with_path: The name of the file with its path.
|
||||
- file: The file name.
|
||||
- bucket_name: The name of the S3 bucket.
|
||||
Returns:
|
||||
None
|
||||
Raises:
|
||||
- FileNotFoundError: If a required file is not found in S3.
|
||||
- ValueError: If a file has zero size.
|
||||
- Exception: If an unexpected exception occurs.
|
||||
"""
|
||||
from s3_utils import check_file_exists_in_s3, get_file_size # avoid circular import
|
||||
|
||||
import config
|
||||
|
||||
# Load the configuration from the .env file
|
||||
# aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
|
||||
_, _, _, bucket_name, _ = config.load_config()
|
||||
|
||||
ach_pdf_disk_size = 0
|
||||
|
||||
# Set required extensions based on the file type
|
||||
if file.endswith('.mp4'):
|
||||
required_extensions = ['json', 'md5', 'pdf']
|
||||
elif file.endswith('.mp3'):
|
||||
required_extensions = ['json', 'md5']
|
||||
else:
|
||||
required_extensions = []
|
||||
|
||||
logging.info(f"Required extensions: {required_extensions}")
|
||||
for ext in required_extensions:
|
||||
related_file = f"{file_name_with_path}.{ext}"
|
||||
logging.info(f"Checking for related file: {related_file}")
|
||||
|
||||
try:
|
||||
if not check_file_exists_in_s3(s3, related_file,bucket_name):
|
||||
error_message = f"Required file {related_file} not found in S3."
|
||||
logging.error(error_message)
|
||||
raise FileNotFoundError(error_message)
|
||||
else:
|
||||
logging.info(f"Found related file: {related_file}")
|
||||
|
||||
except FileNotFoundError as e:
|
||||
logging.error(f"Caught a FileNotFoundError: {e}")
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"Caught an unexpected exception: {e}")
|
||||
|
||||
# Check the size of the related file
|
||||
try:
|
||||
if ext in ['json', 'md5', 'pdf']:
|
||||
file_size = get_file_size(s3, bucket_name, related_file)
|
||||
if file_size == 0:
|
||||
error_message = f"File {related_file} has zero size."
|
||||
logging.error(error_message)
|
||||
raise ValueError(error_message)
|
||||
else:
|
||||
logging.info(f"File {related_file} size: {file_size}")
|
||||
except ValueError as e:
|
||||
logging.error(f"Caught a ValueError file Size is zero: {e}")
|
||||
raise ValueError(f"File {related_file} has zero size.")
|
||||
except Exception as e:
|
||||
logging.error(f"Caught an unexpected exception: {e}")
|
||||
|
||||
# If the required file is a .pdf, get its size and update ach_pdf_disk_size
|
||||
if ext =='pdf':
|
||||
pdf_file = f"{file_name_with_path}.pdf"
|
||||
if check_file_exists_in_s3(s3, pdf_file,bucket_name):
|
||||
pdf_file_size = get_file_size(s3, bucket_name, pdf_file)
|
||||
ach_pdf_disk_size = pdf_file_size
|
||||
logging.info(f"PDF disk size: {ach_pdf_disk_size}")
|
||||
else:
|
||||
logging.error(f"PDF file {pdf_file} not found.")
|
||||
raise FileNotFoundError(f"PDF file {pdf_file} not found.")
|
||||
|
||||
return ach_pdf_disk_size
|
||||
|
||||
def extract_and_validate_file_info(file_contents, file, ach_variables):
|
||||
|
||||
# Load the configuration from the .env file
|
||||
#aws_config, db_config, ach_config, bucket_name, _ = config.load_config()
|
||||
|
||||
# Extract relevant information from nested JSON
|
||||
ach_custom_data_in = file_contents
|
||||
|
||||
# Check if json contain mediainfo metadata or ffprobe metadata or both
|
||||
logging.info(f"Extracted JSON contents: {ach_custom_data_in['json']}")
|
||||
# Check for keys at the first level
|
||||
if 'mediainfo' in ach_custom_data_in['json'] and 'ffprobe' in ach_custom_data_in['json']:
|
||||
ach_variables['custom_data_in'] = {
|
||||
"mediainfo": ach_custom_data_in['json'].get('mediainfo', {}),
|
||||
"ffprobe": ach_custom_data_in['json'].get('ffprobe', {}),
|
||||
"filename": ach_custom_data_in['json'].get('filename', ''),
|
||||
"md5": ach_custom_data_in.get('md5', '')
|
||||
}
|
||||
logging.info("mediainfo and ffprobe metadata found in JSON file.")
|
||||
# Check for keys at the second level if it is not already ordered
|
||||
elif 'creatingLibrary' in ach_custom_data_in['json'] and ach_custom_data_in['json'].get('creatingLibrary','').get('name','') == 'MediaInfoLib':
|
||||
ach_variables['custom_data_in'] = {
|
||||
"mediainfo": ach_custom_data_in['json'],
|
||||
"md5": ach_custom_data_in.get('md5', '')
|
||||
}
|
||||
logging.info("mediainfo metadata found in JSON file.")
|
||||
elif 'streams' in ach_custom_data_in['json']:
|
||||
ach_variables['custom_data_in'] = {
|
||||
"ffprobe": ach_custom_data_in['json'],
|
||||
"md5": ach_custom_data_in.get('md5', '')
|
||||
}
|
||||
logging.info("ffprobe metadata found in JSON file.")
|
||||
else:
|
||||
ach_variables['custom_data_in'] = {
|
||||
"md5": ach_custom_data_in.get('md5', '')
|
||||
}
|
||||
logging.error(f"No recognized data found in JSON file.{ach_custom_data_in} - {file_contents}")
|
||||
# trhow an error
|
||||
raise ValueError("No recognized data found in JSON file.")
|
||||
|
||||
logging.info(f"Extracted JSON contents: {ach_variables['custom_data_in']}")
|
||||
# Extract FileExtension and FileSize if "@type" is "General"
|
||||
ach_disk_size = None
|
||||
tracks = ach_variables['custom_data_in'].get('mediainfo', {}).get('media', {}).get('track', [])
|
||||
|
||||
for track in tracks:
|
||||
# Check if @type is "General"
|
||||
if track.get('@type') == 'General':
|
||||
|
||||
# Retrieve the disk size from the General track
|
||||
ach_disk_size = track.get('FileSize', None)
|
||||
logging.info(f"Disk size from JSON media.track.General: {ach_disk_size}")
|
||||
|
||||
# Retrieve the file extension from the General track
|
||||
ach_conservative_copy_extension = '.' + track.get('FileExtension', None)
|
||||
logging.info(f"FileExtension JSON media.track.General: {ach_conservative_copy_extension}")
|
||||
|
||||
# Exit loop after finding the General track
|
||||
break # Exit the loop after finding the General track
|
||||
|
||||
# Convert ach_disk_size to an integer if found
|
||||
if ach_disk_size is not None:
|
||||
ach_disk_size = int(ach_disk_size)
|
||||
|
||||
# MEDIAINFO
|
||||
if "mediainfo" in ach_variables['custom_data_in'] and "media" in ach_variables['custom_data_in'].get("mediainfo"):
|
||||
# Extract the media_ref field from the JSON file contents
|
||||
media_ref = ach_variables['custom_data_in'].get('mediainfo', {}).get("media", {}).get("@ref", "")
|
||||
|
||||
#STRIP DOUBLE BACK SLASKS FROM PATH
|
||||
media_ref = media_ref.replace("\\", "/")
|
||||
logging.info(f"Media ref medianfo: {media_ref}")
|
||||
# Split the path using '/' and get the last part (file name)
|
||||
file_name = media_ref.split('/')[-2] + '/' + media_ref.split('/')[-1]
|
||||
logging.info(f"Media file name (copia conservativa): {file_name}")
|
||||
|
||||
# Update the @ref field with the new file name
|
||||
ach_variables['custom_data_in']["mediainfo"]["media"]["@ref"] = file_name
|
||||
logging.info(f"Updated the truncated file_name at mediainfo.media.@ref {ach_variables['custom_data_in']['mediainfo']['media']['@ref']}")
|
||||
else:
|
||||
logging.warning(f"mediainfo.media.@ref not found in JSON file.")
|
||||
|
||||
# FFPROBE
|
||||
if "ffprobe" in ach_variables['custom_data_in'] and "format" in ach_variables['custom_data_in'].get("ffprobe"):
|
||||
# Extract the media_ref field from the JSON file contents
|
||||
media_ref = ach_variables['custom_data_in'].get('ffprobe', {}).get("format", {}).get("filename", "")
|
||||
|
||||
#STRIP DOUBLE BACK SLASKS FROM PATH
|
||||
media_ref = media_ref.replace("\\", "/")
|
||||
logging.info(f"Media ref medianfo: {media_ref}")
|
||||
# Split the path using '/' and get the last part (file name)
|
||||
file_name = media_ref.split('/')[-2] + '/' + media_ref.split('/')[-1]
|
||||
logging.info(f"Media file name (copia conservativa): {file_name}")
|
||||
# Update the @ref field with the new file name
|
||||
ach_variables['custom_data_in']["ffprobe"]["format"]["filename"] = file_name
|
||||
logging.info(f"Updated the truncated file_name at ffprobe.format.filename {ach_variables['custom_data_in']['mediainfo']['media']['@ref']}")
|
||||
else:
|
||||
logging.warning(f"ffprobe.format.filename not found in JSON file.")
|
||||
|
||||
logging.info(f"Updated the truncated file_name at mediainfo.media.@ref {file_name}")
|
||||
logging.info(f"JSON contents: {ach_variables['custom_data_in']}")
|
||||
|
||||
# Check if file_contents is a string
|
||||
if isinstance(ach_variables['custom_data_in'], str):
|
||||
# Parse the JSON string into a dictionary
|
||||
ach_custom_data_in = json.loads(ach_variables['custom_data_in'])
|
||||
else:
|
||||
# Assume file_contents is already a dictionary
|
||||
ach_custom_data_in = ach_variables['custom_data_in']
|
||||
|
||||
# Check if basename is equal to name in the json file
|
||||
json_ref_mediainfo_path = ach_custom_data_in.get('mediainfo', {}).get("media", {}).get("@ref", "")
|
||||
json_ref_ffprobe_path = ach_custom_data_in.get('ffprobe', {}).get("format", {}).get("filename", "")
|
||||
logging.info(f"JSON file names: mediainfo: '{json_ref_mediainfo_path}', ffprobe: '{json_ref_ffprobe_path}', ach_file_fullpath: '{ach_variables['file_fullpath']}'")
|
||||
|
||||
# Extract base names
|
||||
basename_fullpath = os.path.splitext(os.path.basename(ach_variables['file_fullpath']))[0]
|
||||
basename_fullpath = basename_fullpath.replace('_H264', '')
|
||||
basename_mediainfo = os.path.splitext(os.path.basename(json_ref_mediainfo_path))[0]
|
||||
basename_ffprobe = os.path.splitext(os.path.basename(json_ref_ffprobe_path))[0]
|
||||
|
||||
# Check if the basenames are equal
|
||||
if basename_fullpath != basename_mediainfo:
|
||||
logging.warning(f"ach_file_fullpath '{basename_fullpath}' does not match JSON mediainfo file name '{basename_mediainfo}'.")
|
||||
else:
|
||||
logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON mediainfo file name '{basename_mediainfo}'.")
|
||||
|
||||
# Check if the basename matches the ffprobe path
|
||||
if basename_fullpath != basename_ffprobe:
|
||||
logging.warning(f"ach_file_fullpath '{basename_fullpath}' does not match JSON ffprobe file name '{basename_ffprobe}'.")
|
||||
else:
|
||||
logging.info(f"ach_file_fullpath '{basename_fullpath}' matches JSON ffprobe file name '{basename_ffprobe}'.")
|
||||
|
||||
if basename_fullpath != basename_mediainfo and basename_fullpath != basename_ffprobe:
|
||||
logging.error(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
|
||||
raise ValueError(f"ach_file_fullpath '{basename_fullpath}' does not match either JSON file name '{basename_mediainfo}' or '{basename_ffprobe}'.")
|
||||
|
||||
# Check if the file is a video or audio file
|
||||
try:
|
||||
if file.endswith('.mp4'):
|
||||
result, message = check_video_info(ach_custom_data_in.get('mediainfo', {}))
|
||||
logging.info(f"Validation result for {file}: {message}")
|
||||
elif file.endswith('.mp3'):
|
||||
result, message = check_audio_info(ach_custom_data_in.get('mediainfo', {}))
|
||||
logging.info(f"Validation result for {file}: {message}")
|
||||
else:
|
||||
# Handle cases where the file type is not supported
|
||||
raise ValueError(f"Unsupported file type: {file}")
|
||||
|
||||
# Handle the error if validation fails
|
||||
if not result:
|
||||
error_message = f"Validation failed for {file}: {message}"
|
||||
logging.error(error_message)
|
||||
# handle_error(ValueError(error_message)) # Create and handle the exception
|
||||
except ValueError as e:
|
||||
# Handle specific ValueError exceptions
|
||||
logging.error(f"Caught a ValueError: {e}")
|
||||
#handle_error(e) # Pass the ValueError to handle_error
|
||||
|
||||
except Exception as e:
|
||||
# Handle any other unexpected exceptions
|
||||
logging.error(f"Caught an unexpected exception: {e}")
|
||||
#handle_error(e) # Pass unexpected exceptions to handle_error
|
||||
|
||||
|
||||
# Return the updated ach_custom_data_in dictionary
|
||||
ach_custom_data_in.pop('filename', None) # Remove 'filename' key if it exists
|
||||
# logging.info(f"ach_custom_data_in: {json.dumps(ach_custom_data_in, indent=4)}")
|
||||
return ach_custom_data_in, ach_disk_size, ach_conservative_copy_extension
|
||||
|
||||
def is_file_empty(file_path):
|
||||
return os.path.exists(file_path) and os.path.getsize(file_path) == 0
|
||||
|
||||
# unused function
|
||||
|
||||
def read_file(file_path):
|
||||
try:
|
||||
with open(file_path, 'r') as file:
|
||||
return file.read()
|
||||
except FileNotFoundError as e:
|
||||
logging.error(f"File not found: {e}")
|
||||
raise e
|
||||
except IOError as e:
|
||||
logging.error(f"IO error: {e}")
|
||||
raise e
|
||||
|
||||
def write_file(file_path, content):
|
||||
try:
|
||||
with open(file_path, 'w') as file:
|
||||
file.write(content)
|
||||
except IOError as e:
|
||||
logging.error(f"IO error: {e}")
|
||||
raise e
|
||||
|
||||
|
|
@ -0,0 +1,82 @@
|
|||
import logging
|
||||
from logging.handlers import TimedRotatingFileHandler
|
||||
import os
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
# Custom log level (optional). Keep for backward compatibility.
|
||||
CUSTOM_ERROR_LEVEL = 35
|
||||
logging.addLevelName(CUSTOM_ERROR_LEVEL, "CUSTOM_ERROR")
|
||||
|
||||
|
||||
def custom_error(self, message, *args, **kwargs):
|
||||
"""Helper to log at the custom error level."""
|
||||
if self.isEnabledFor(CUSTOM_ERROR_LEVEL):
|
||||
self._log(CUSTOM_ERROR_LEVEL, message, args, **kwargs)
|
||||
|
||||
# Attach helper to the Logger class so callers can do: logging.getLogger().custom_error(...)
|
||||
logging.Logger.custom_error = custom_error
|
||||
|
||||
|
||||
def _ensure_dir_for_file(path: str):
|
||||
"""Ensure the parent directory for `path` exists."""
|
||||
Path(path).resolve().parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def _create_timed_handler(path: str, level=None, when='midnight', interval=1, backupCount=7, fmt=None):
|
||||
"""
|
||||
Create and configure a TimedRotatingFileHandler.
|
||||
Uses the handler's built-in rotation logic which is more robust and easier
|
||||
to maintain than a custom doRollover implementation.
|
||||
"""
|
||||
_ensure_dir_for_file(path)
|
||||
handler = TimedRotatingFileHandler(path, when=when, interval=interval, backupCount=backupCount, encoding='utf-8')
|
||||
# Use a readable suffix for rotated files (handler will append this after the filename)
|
||||
handler.suffix = "%Y%m%d_%H%M%S"
|
||||
if fmt:
|
||||
handler.setFormatter(fmt)
|
||||
if level is not None:
|
||||
handler.setLevel(level)
|
||||
return handler
|
||||
|
||||
|
||||
def setup_logging():
|
||||
"""
|
||||
Configure logging for the application and return (logger, info_handler, error_handler, warning_handler).
|
||||
This version uses standard TimedRotatingFileHandler to keep the logic simple and
|
||||
avoid fragile file-renaming on Windows.
|
||||
"""
|
||||
# Select a format depending on environment for easier debugging in dev
|
||||
if os.getenv('ACH_ENV') == 'development':
|
||||
log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s - %(pathname)s:%(lineno)d')
|
||||
elif os.getenv('ACH_ENV') == 'production':
|
||||
log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
||||
else:
|
||||
log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s')
|
||||
|
||||
error_log_path = os.getenv('ERROR_LOG_FILE_PATH', "./logs/ACH_media_import_errors.log")
|
||||
warning_log_path = os.getenv('WARNING_LOG_FILE_PATH', "./logs/ACH_media_import_warnings.log")
|
||||
info_log_path = os.getenv('INFO_LOG_FILE_PATH', "./logs/ACH_media_import_info.log")
|
||||
|
||||
# Create three handlers: info (all), warning (warning+), error (error+)
|
||||
info_handler = _create_timed_handler(info_log_path, level=logging.INFO, fmt=log_formatter, backupCount=int(os.getenv('LOG_BACKUP_COUNT', '7')))
|
||||
warning_handler = _create_timed_handler(warning_log_path, level=logging.WARNING, fmt=log_formatter, backupCount=int(os.getenv('LOG_BACKUP_COUNT', '7')))
|
||||
error_handler = _create_timed_handler(error_log_path, level=logging.ERROR, fmt=log_formatter, backupCount=int(os.getenv('LOG_BACKUP_COUNT', '7')))
|
||||
|
||||
console_handler = logging.StreamHandler()
|
||||
console_handler.setFormatter(log_formatter)
|
||||
|
||||
# Configure root logger explicitly
|
||||
root_logger = logging.getLogger()
|
||||
root_logger.setLevel(logging.INFO)
|
||||
# Clear existing handlers to avoid duplicate logs when unit tests or reloads occur
|
||||
root_logger.handlers = []
|
||||
root_logger.addHandler(info_handler)
|
||||
root_logger.addHandler(warning_handler)
|
||||
root_logger.addHandler(error_handler)
|
||||
root_logger.addHandler(console_handler)
|
||||
|
||||
# Return the root logger and handlers (so callers can do manual rollovers if they truly need to)
|
||||
return root_logger, info_handler, error_handler, warning_handler
|
||||
|
|
@ -0,0 +1,503 @@
|
|||
# v20251103 - Main script to import media files from S3 to the database
|
||||
import logging
|
||||
import time
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import os
|
||||
import re
|
||||
from logging_config import setup_logging, CUSTOM_ERROR_LEVEL
|
||||
from email_utils import handle_error, send_email_with_attachment
|
||||
from s3_utils import create_s3_client, list_s3_bucket, parse_s3_files
|
||||
from error_handler import handle_general_error, handle_file_not_found_error, handle_value_error
|
||||
from file_utils import is_file_empty
|
||||
from db_utils import count_files, get_distinct_filenames_from_db
|
||||
from dotenv import load_dotenv
|
||||
import config
|
||||
import psycopg2
|
||||
|
||||
load_dotenv()
|
||||
|
||||
import re
|
||||
import logging
|
||||
import os
|
||||
|
||||
def analyze_pattern_match(text, description):
|
||||
"""Analyze which part of the 12-char pattern is not matching.
|
||||
|
||||
The code currently truncates base/folder names to the first 12 characters and
|
||||
uses the pattern r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$' which is 12 characters long.
|
||||
This function therefore validates a 12-character string and avoids
|
||||
indexing beyond its length.
|
||||
"""
|
||||
if not text:
|
||||
return [f"{description}: Empty or None text"]
|
||||
|
||||
issues = []
|
||||
expected_length = 12 # Pattern: [VA][OC]-[3chars]-[5digits]
|
||||
|
||||
# Check length
|
||||
if len(text) != expected_length:
|
||||
issues.append(f"Length mismatch: expected {expected_length}, got {len(text)}")
|
||||
return issues
|
||||
|
||||
# Step 1: Check 1st character - V or A
|
||||
if text[0] not in ['V', 'A']:
|
||||
issues.append(f"Position 1: Expected [V,A], got '{text[0]}'")
|
||||
|
||||
# Step 2: Check 2nd character - O or C
|
||||
if text[1] not in ['O', 'C']:
|
||||
issues.append(f"Position 2: Expected [O,C], got '{text[1]}'")
|
||||
|
||||
# Step 3: Check 3rd character - dash
|
||||
if text[2] != '-':
|
||||
issues.append(f"Position 3: Expected '-', got '{text[2]}'")
|
||||
|
||||
# Step 4: Check positions 4,5,6 - [A-Z0-9]
|
||||
for i in range(3, 6):
|
||||
if not re.match(r'^[A-Z0-9]$', text[i]):
|
||||
issues.append(f"Position {i+1}: Expected [A-Z0-9], got '{text[i]}'")
|
||||
|
||||
# Step 5: Check 7th character - dash
|
||||
if text[6] != '-':
|
||||
issues.append(f"Position 7: Expected '-', got '{text[6]}'")
|
||||
|
||||
# Step 6: Check positions 8-12 - digits
|
||||
for i in range(7, 12):
|
||||
if not text[i].isdigit():
|
||||
issues.append(f"Position {i+1}: Expected digit, got '{text[i]}'")
|
||||
|
||||
return issues
|
||||
|
||||
# MAIN PROCESS
|
||||
def main_process(aws_config, db_config, ach_config, bucket_name, ach_variables):
|
||||
# import global variables
|
||||
#from config import load_config, aws_config, db_config, ach_config, bucket_name
|
||||
#global aws_config, db_config, ach_config, bucket_name
|
||||
#config import load_config , aws_config, db_config, ach_config, bucket_name
|
||||
#load_config()
|
||||
|
||||
logging.info(f"bucket_name: {bucket_name}")
|
||||
|
||||
# Ensure timing variables are always defined so later error-email logic
|
||||
# won't fail if an exception is raised before end_time/elapsed_time is set.
|
||||
start_time = time.time()
|
||||
# IN HUMAN READABLE FORMAT
|
||||
logging.info(f"Process started at {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time))}")
|
||||
end_time = start_time
|
||||
elapsed_time = 0.0
|
||||
|
||||
try:
|
||||
logging.info("Starting the main process...")
|
||||
|
||||
# Helper to make spaces visible in filenames for logging (replace ' ' with open-box char)
|
||||
def _visible_spaces(name: str) -> str:
|
||||
try:
|
||||
return name.replace(' ', '\u2423')
|
||||
except Exception:
|
||||
return name
|
||||
|
||||
# Create the S3 client
|
||||
s3_client = create_s3_client(aws_config)
|
||||
|
||||
# List S3 bucket contents
|
||||
list_s3_files = list_s3_bucket(s3_client, bucket_name)
|
||||
|
||||
# Define valid extensions and excluded folders
|
||||
valid_extensions = {'.mp3', '.mp4', '.md5', '.json', '.pdf'}
|
||||
# excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/', 'FILE/'}
|
||||
excluded_folders = {'DOCUMENTAZIONE_FOTOGRAFICA/', 'TEST-FOLDER-DEV/'}
|
||||
# included_folders = {'FILE/'} # uncomment this to NOT use excluded folders
|
||||
# included_folders = {'TEST-FOLDER-DEV/'} # uncomment this to NOT use excluded folders
|
||||
|
||||
# Extract and filter file names
|
||||
|
||||
# s3_file_names: include only files that match valid extensions and
|
||||
# (if configured) whose key starts with one of the included_folders.
|
||||
# We still skip any explicitly excluded_folders. Guard against the
|
||||
# case where `included_folders` isn't defined to avoid NameError.
|
||||
try:
|
||||
use_included = bool(included_folders)
|
||||
except NameError:
|
||||
use_included = False
|
||||
|
||||
if use_included:
|
||||
s3_file_names = [
|
||||
content['Key'] for content in list_s3_files
|
||||
if any(content['Key'].endswith(ext) for ext in valid_extensions)
|
||||
and any(content['Key'].startswith(folder) for folder in included_folders)
|
||||
]
|
||||
logging.info(f"Using included_folders filter: {included_folders}")
|
||||
else:
|
||||
s3_file_names = [
|
||||
content['Key'] for content in list_s3_files
|
||||
if any(content['Key'].endswith(ext) for ext in valid_extensions)
|
||||
and not any(content['Key'].startswith(folder) for folder in excluded_folders)
|
||||
]
|
||||
logging.info("Using excluded_folders filter")
|
||||
|
||||
# check inventory code syntax
|
||||
# first check s3_file_names if the file base name and folder name match pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}_\d{2}$'
|
||||
pattern = r'^[VA][OC]-[A-Z0-9]{3}-\d{5}$'
|
||||
contents = []
|
||||
|
||||
for s3file in s3_file_names:
|
||||
# s3_file_names contains the object keys (strings), not dicts.
|
||||
base_name = os.path.basename(s3file)
|
||||
# keep only first 12 chars
|
||||
base_name = base_name[:12]
|
||||
logging.info(f"Base name: {base_name}")
|
||||
folder_name = os.path.dirname(s3file)
|
||||
# keep only first 12 chars of folder name as well
|
||||
# folder_name = folder_name[:12]
|
||||
# logging.info(f"Folder name: {folder_name}")
|
||||
|
||||
if re.match(pattern, base_name): # and re.match(pattern, folder_name):
|
||||
logging.info(f"File {base_name} matches pattern.")
|
||||
contents.append(s3file)
|
||||
else:
|
||||
# Check base name
|
||||
if not re.match(pattern, base_name):
|
||||
base_issues = analyze_pattern_match(base_name, "Base name")
|
||||
logging.warning(f"Base name '{base_name}' does not match pattern. Issues: {base_issues}")
|
||||
|
||||
logging.warning(f"File {base_name} in folder {folder_name} does not match pattern.")
|
||||
|
||||
|
||||
# filter_s3_files_not_in_db
|
||||
# --- Get all DB filenames in one call ---
|
||||
db_file_names = get_distinct_filenames_from_db()
|
||||
|
||||
# --- Keep only those not in DB ---
|
||||
# Additionally, if the DB already contains a sidecar record for the
|
||||
# same basename (for extensions .md5, .json, .pdf), skip the S3 file
|
||||
# since the asset is already represented in the DB via those sidecars.
|
||||
sidecar_exts = ('.md5', '.json', '.pdf')
|
||||
db_sidecar_basenames = set()
|
||||
for dbf in db_file_names:
|
||||
for ext in sidecar_exts:
|
||||
if dbf.endswith(ext):
|
||||
db_sidecar_basenames.add(dbf[:-len(ext)])
|
||||
break
|
||||
|
||||
file_names = []
|
||||
for f in contents:
|
||||
# exact key present in DB -> skip
|
||||
if f in db_file_names:
|
||||
continue
|
||||
# strip extension to get basename and skip if DB has sidecar for it
|
||||
base = os.path.splitext(f)[0]
|
||||
if base in db_sidecar_basenames:
|
||||
# logging.info("Skipping %s because DB already contains sidecar for basename %s", _visible_spaces(f), _visible_spaces(base))
|
||||
continue
|
||||
file_names.append(f)
|
||||
|
||||
# Print the total number of files
|
||||
total_files_s3 = len(contents)
|
||||
logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files in the S3 bucket before DB filter: {total_files_s3}")
|
||||
total_files = len(file_names)
|
||||
logging.info(f"Total number of the valid (mp3,mp4,md5,json,pdf) files after DB filter: {total_files}")
|
||||
|
||||
# Count files with .mp4 and .mp3 extensions
|
||||
mp4_count = sum(1 for file in s3_file_names if file.endswith('.mp4'))
|
||||
mp3_count = sum(1 for file in s3_file_names if file.endswith('.mp3'))
|
||||
md5_count = sum(1 for file in s3_file_names if file.endswith('.md5'))
|
||||
pdf_count = sum(1 for file in s3_file_names if file.endswith('.pdf'))
|
||||
json_count = sum(1 for file in s3_file_names if file.endswith('.json'))
|
||||
mov_count = sum(1 for file in s3_file_names if file.endswith('.mov'))
|
||||
# jpg_count = sum(1 for file in file_names if file.endswith('.jpg'))
|
||||
|
||||
# file directory
|
||||
avi_count = sum(1 for file in s3_file_names if file.endswith('.avi'))
|
||||
m4v_count = sum(1 for file in s3_file_names if file.endswith('.m4v'))
|
||||
# Log the counts
|
||||
# Get the logger instance
|
||||
logger = logging.getLogger()
|
||||
|
||||
# Use the logger instance to log custom info
|
||||
logging.info("Number of .mp4 files on S3 bucket (%s): %s", bucket_name, mp4_count)
|
||||
logging.info("Number of .mp3 files on S3 bucket (%s): %s", bucket_name, mp3_count)
|
||||
logging.info("Number of .md5 files on S3 bucket (%s): %s", bucket_name, md5_count)
|
||||
logging.info("Number of .pdf files on S3 bucket (%s): %s", bucket_name, pdf_count)
|
||||
logging.info("Number of .json files on S3 bucket (%s): %s", bucket_name, json_count)
|
||||
logging.info("Number of .mov files on S3 bucket (%s): %s", bucket_name, mov_count)
|
||||
# logging.info(f"Number of .jpg files: {jpg_count}")
|
||||
|
||||
# If ACH_SAFE_RUN is 'false' we enforce strict mp4/pdf parity and abort
|
||||
# when mismatched. Default is 'true' which skips this abort to allow
|
||||
# safer runs during testing or manual reconciliation.
|
||||
if os.getenv('ACH_SAFE_RUN', 'true') == 'true':
|
||||
if mp4_count != pdf_count:
|
||||
logging.error("Number of .mp4 files is not equal to number of .pdf files")
|
||||
# MOD 20251103
|
||||
# add a check to find the missing pdf or mp4 files and report them
|
||||
# use file_names to find missing files
|
||||
# store tuples (source_file, expected_counterpart) for clearer logging
|
||||
missing_pdfs = [] # list of (mp4_file, expected_pdf)
|
||||
missing_mp4s = [] # list of (pdf_file, expected_mp4)
|
||||
for file in file_names:
|
||||
if file.endswith('.mp4'):
|
||||
# remove extension
|
||||
base_name = file[:-4] # keeps any path prefix
|
||||
# if the mp4 is an H264 variant (e.g. name_H264.mp4) remove the suffix
|
||||
if base_name.endswith('_H264'):
|
||||
base_name = base_name[:-5]
|
||||
expected_pdf = base_name + '.pdf'
|
||||
if expected_pdf not in file_names:
|
||||
missing_pdfs.append((file, expected_pdf))
|
||||
elif file.endswith('.pdf'):
|
||||
# Normalize base name and accept either the regular mp4 or the _H264 variant.
|
||||
base_name = file[:-4]
|
||||
expected_mp4 = base_name + '.mp4'
|
||||
h264_variant = base_name + '_H264.mp4'
|
||||
# If neither the regular mp4 nor the H264 variant exists, report missing.
|
||||
if expected_mp4 not in file_names and h264_variant not in file_names:
|
||||
missing_mp4s.append((file, expected_mp4))
|
||||
# report missing files
|
||||
if missing_pdfs:
|
||||
logging.error("Missing .pdf files (mp4 -> expected pdf):")
|
||||
for mp4_file, expected_pdf in missing_pdfs:
|
||||
logging.error("%s -> %s", _visible_spaces(mp4_file), _visible_spaces(expected_pdf))
|
||||
|
||||
if missing_mp4s:
|
||||
logging.error("Missing .mp4 files (pdf -> expected mp4):")
|
||||
for pdf_file, expected_mp4 in missing_mp4s:
|
||||
logging.error("%s -> %s", _visible_spaces(pdf_file), _visible_spaces(expected_mp4))
|
||||
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
raise ValueError("Inconsistent file counts mp4 vs pdf")
|
||||
|
||||
if mp3_count + mp4_count != json_count:
|
||||
logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .json files")
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
# search wich file dont match TODO
|
||||
raise ValueError("Inconsistent file counts mp3+mp4 vs json")
|
||||
|
||||
if mp3_count + mp4_count != md5_count:
|
||||
logging.error("Number of .mp3 files + number of .mp4 files is not equal to number of .md5 files")
|
||||
logging.error("Abort Import Process due to missing files")
|
||||
# search wich file dont match TODO
|
||||
raise ValueError("Inconsistent file counts mp3+mp4 vs md5")
|
||||
|
||||
# Try to parse S3 files
|
||||
try:
|
||||
# If DRY RUN is set to True, the files will not be uploaded to the database
|
||||
if os.getenv('ACH_DRY_RUN', 'true') == 'false':
|
||||
uploaded_files_count, warning_files_count, error_files_count = parse_s3_files(s3_client, file_names, ach_variables, excluded_folders)
|
||||
else:
|
||||
logging.warning("DRY RUN is set to TRUE - No files will be added to the database")
|
||||
# set the tuples to zero
|
||||
uploaded_files_count, warning_files_count, error_files_count = (0, 0, 0)
|
||||
logging.info("Total number of files (mp3+mp4) with warnings: %s. (Probably already existing in the DB)", warning_files_count)
|
||||
logging.info("Total number of files with errors: %s", error_files_count)
|
||||
logging.info("Total number of files uploaded: %s", uploaded_files_count)
|
||||
logging.info("All files parsed")
|
||||
except Exception as e:
|
||||
logging.error(f"An error occurred while parsing S3 files: {e}")
|
||||
handle_general_error(e)
|
||||
|
||||
# Check results
|
||||
# connect to database
|
||||
conn = psycopg2.connect(**db_config)
|
||||
cur = conn.cursor()
|
||||
# function count_files that are wav and mov in db
|
||||
# Map file extensions (include leading dot) to mime types
|
||||
EXTENSION_MIME_MAP = {
|
||||
'.avi': 'video/x-msvideo',
|
||||
'.mov': 'video/mov',
|
||||
'.wav': 'audio/wav',
|
||||
'.mp4': 'video/mp4',
|
||||
'.m4v': 'video/mp4',
|
||||
'.mp3': 'audio/mp3',
|
||||
'.mxf': 'application/mxf',
|
||||
'.mpg': 'video/mpeg',
|
||||
}
|
||||
|
||||
# populate mime_type list with all relevant MediaInfo/MIME values
|
||||
mime_type = [
|
||||
'video/x-msvideo', # .avi
|
||||
'video/mov', # .mov
|
||||
'audio/wav', # .wav
|
||||
'video/mp4', # .mp4, .m4v
|
||||
'audio/mp3', # .mp3
|
||||
'application/mxf', # .mxf
|
||||
'video/mpeg', # .mpg
|
||||
]
|
||||
|
||||
logging.info(f"Mime types for counting files: {mime_type}")
|
||||
|
||||
all_files_on_db = count_files(cur, mime_type,'*', False)
|
||||
mov_files_on_db = count_files(cur,['video/mov'],'.mov', False )
|
||||
mxf_files_on_db = count_files(cur,['application/mxf'],'.mxf', False )
|
||||
mpg_files_on_db = count_files(cur,['video/mpeg'],'.mpg', False )
|
||||
avi_files_on_db = count_files(cur,['video/x-msvideo'],'.avi', False )
|
||||
m4v_files_on_db = count_files(cur,['video/mp4'],'.m4v', False )
|
||||
mp4_files_on_db = count_files(cur,['video/mp4'],'.mp4', False )
|
||||
wav_files_on_db = count_files(cur,['audio/wav'],'.wav', False )
|
||||
mp3_files_on_db = count_files(cur,['audio/mp3'],'.mp3', False )
|
||||
|
||||
# mov + m4v + avi + mxf + mpg
|
||||
logging.info(f"Number of all video files in the database: {all_files_on_db}")
|
||||
logging.info(f"Number of .mov files in the database: {mov_files_on_db} and S3: {mov_count} ")
|
||||
logging.info(f"Number of .mp4 files in the database: {mp4_files_on_db} and S3: {mp4_count}")
|
||||
|
||||
# compare the mp4 name and s3 name and report the missing files in the 2 lists a print the list
|
||||
missing_mp4s = [f for f in file_names if f.endswith('.mp4') and f not in db_file_names]
|
||||
# if missing_mp4s empty do not return a warning
|
||||
if missing_mp4s:
|
||||
logging.warning(f"Missing .mp4 files in DB compared to S3: {missing_mp4s}")
|
||||
|
||||
logging.info(f"Number of .wav files in the database: {wav_files_on_db} ")
|
||||
logging.info(f"Number of .mp3 files in the database: {mp3_files_on_db} and S3: {mp3_count}")
|
||||
|
||||
missing_mp3s = [f for f in file_names if f.endswith('.mp3') and f not in db_file_names]
|
||||
# if missing_mp3s empty do not return a warning
|
||||
if missing_mp3s:
|
||||
logging.warning(f"Missing .mp3 files in DB compared to S3: {missing_mp3s}")
|
||||
|
||||
logging.info(f"Number of .avi files in the database: {avi_files_on_db} ")
|
||||
logging.info(f"Number of .m4v files in the database: {m4v_files_on_db} ")
|
||||
logging.info(f"Number of .mxf files in the database: {mxf_files_on_db} ")
|
||||
logging.info(f"Number of .mpg files in the database: {mpg_files_on_db} ")
|
||||
|
||||
logging.info(f"Total file in s3 before import {total_files}")
|
||||
|
||||
# time elapsed
|
||||
end_time = time.time() # Record end time
|
||||
|
||||
elapsed_time = end_time - start_time
|
||||
logging.info(f"Processing completed. Time taken: {elapsed_time:.2f} seconds")
|
||||
|
||||
except FileNotFoundError as e:
|
||||
handle_file_not_found_error(e)
|
||||
except ValueError as e:
|
||||
handle_value_error(e)
|
||||
except Exception as e:
|
||||
handle_general_error(e)
|
||||
|
||||
# Send Email with logs if success or failure
|
||||
# Define the CET timezone
|
||||
cet = pytz.timezone('CET')
|
||||
|
||||
# Helper to rename a log file by appending a timestamp and return the new path.
|
||||
def _rename_log_if_nonempty(path):
|
||||
try:
|
||||
if not path or not os.path.exists(path):
|
||||
return None
|
||||
# If file is empty, don't attach/rename it
|
||||
if os.path.getsize(path) == 0:
|
||||
return None
|
||||
dir_name = os.path.dirname(path)
|
||||
base_name = os.path.splitext(os.path.basename(path))[0]
|
||||
timestamp = datetime.now(cet).strftime("%Y%m%d_%H%M%S")
|
||||
new_log_path = os.path.join(dir_name, f"{base_name}_{timestamp}.log")
|
||||
# Attempt to move/replace atomically where possible
|
||||
try:
|
||||
os.replace(path, new_log_path)
|
||||
except Exception:
|
||||
# Fallback to rename (may raise on Windows if target exists)
|
||||
os.rename(path, new_log_path)
|
||||
return new_log_path
|
||||
except Exception as e:
|
||||
logging.error("Failed to rename log %s: %s", path, e)
|
||||
return None
|
||||
|
||||
# close logging to flush handlers before moving files
|
||||
logging.shutdown()
|
||||
|
||||
logging.info("Preparing summary email")
|
||||
|
||||
error_log = './logs/ACH_media_import_errors.log'
|
||||
warning_log = './logs/ACH_media_import_warning.log'
|
||||
|
||||
# Determine presence of errors/warnings
|
||||
has_errors = False
|
||||
has_warnings = False
|
||||
try:
|
||||
if os.path.exists(error_log) and os.path.getsize(error_log) > 0:
|
||||
with open(error_log, 'r', encoding='utf-8', errors='ignore') as f:
|
||||
content = f.read()
|
||||
if 'ERROR' in content or len(content.strip()) > 0:
|
||||
has_errors = True
|
||||
if os.path.exists(warning_log) and os.path.getsize(warning_log) > 0:
|
||||
with open(warning_log, 'r', encoding='utf-8', errors='ignore') as f:
|
||||
content = f.read()
|
||||
if 'WARNING' in content or len(content.strip()) > 0:
|
||||
has_warnings = True
|
||||
except Exception as e:
|
||||
logging.error("Error while reading log files: %s", e)
|
||||
|
||||
# from env - split safely and strip whitespace
|
||||
def _split_env_list(name):
|
||||
raw = os.getenv(name, '')
|
||||
return [s.strip() for s in raw.split(',') if s.strip()]
|
||||
|
||||
EMAIL_RECIPIENTS = _split_env_list('EMAIL_RECIPIENTS')
|
||||
ERROR_EMAIL_RECIPIENTS = _split_env_list('ERROR_EMAIL_RECIPIENTS') or EMAIL_RECIPIENTS
|
||||
SUCCESS_EMAIL_RECIPIENTS = _split_env_list('SUCCESS_EMAIL_RECIPIENTS') or EMAIL_RECIPIENTS
|
||||
|
||||
# Choose subject and attachment based on severity
|
||||
if has_errors:
|
||||
subject = "ARKIVO Import of Video/Audio Ran with Errors"
|
||||
attachment_to_send = _rename_log_if_nonempty(error_log) or error_log
|
||||
body = "Please find the attached error log file. Job started at %s and ended at %s, taking %.2f seconds." % (
|
||||
datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S'),
|
||||
datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M:%S'),
|
||||
elapsed_time
|
||||
)
|
||||
email_recipients=ERROR_EMAIL_RECIPIENTS
|
||||
elif has_warnings:
|
||||
subject = "ARKIVO Import of Video/Audio Completed with Warnings"
|
||||
# Attach the warnings log for investigation
|
||||
attachment_to_send = _rename_log_if_nonempty(warning_log) or warning_log
|
||||
body = "The import completed with warnings. Please find the attached warning log. Job started at %s and ended at %s, taking %.2f seconds." % (
|
||||
datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S'),
|
||||
datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M:%S'),
|
||||
elapsed_time
|
||||
)
|
||||
email_recipients=ERROR_EMAIL_RECIPIENTS
|
||||
else:
|
||||
subject = "ARKIVO Video/Audio Import Completed Successfully"
|
||||
# No attachment for clean success
|
||||
attachment_to_send = None
|
||||
body = "The import of media (video/audio) completed successfully without any errors or warnings. Job started at %s and ended at %s, taking %.2f seconds." % (
|
||||
datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S'),
|
||||
datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M:%S'),
|
||||
elapsed_time
|
||||
)
|
||||
email_recipients=SUCCESS_EMAIL_RECIPIENTS
|
||||
|
||||
logging.info("Sending summary email: %s (attach: %s)", subject, bool(attachment_to_send))
|
||||
|
||||
# Send email
|
||||
try:
|
||||
send_email_with_attachment(
|
||||
subject=subject,
|
||||
body=body,
|
||||
attachment_path=attachment_to_send,
|
||||
email_recipients=email_recipients
|
||||
)
|
||||
except Exception as e:
|
||||
logging.error("Failed to send summary email: %s", e)
|
||||
|
||||
return
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
# Setup logging using standard TimedRotatingFileHandler handlers.
|
||||
# No manual doRollover calls — rely on the handler's built-in rotation.
|
||||
logger, rotating_handler, error_handler, warning_handler = setup_logging()
|
||||
|
||||
# Load configuration settings
|
||||
aws_config, db_config, ach_config, bucket_name, ach_variables = config.load_config()
|
||||
|
||||
logging.info("Config loaded, logging setup done")
|
||||
|
||||
# Run the main process
|
||||
main_process(aws_config, db_config, ach_config, bucket_name, ach_variables)
|
||||
|
||||
logging.info("Main process completed at: %s", datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"An error occurred: {e}")
|
||||
|
|
@ -0,0 +1,82 @@
|
|||
### Find records where the original filename indicates an H264 variant
|
||||
|
||||
-- Purpose: list distinct base records that have an "original" filename matching the
|
||||
-- pattern FILE%_H264% (i.e. files stored under the FILE... folder or beginning with
|
||||
-- "FILE" and containing the "_H264" marker). This helps locate master records that
|
||||
-- also have an H264-derived file present.
|
||||
-- Columns returned:
|
||||
-- base_id : the parent/base record id (h_base_record_id)
|
||||
-- file_type : the logical file type extracted from the JSON `file_type` column
|
||||
-- original_file_name: the stored original filename (may include folder/prefix)
|
||||
-- digital_file_name : the current digital filename in the database
|
||||
SELECT DISTINCT
|
||||
h_base_record_id AS base_id,
|
||||
file_type ->> 'type' AS file_type,
|
||||
original_file_name,
|
||||
digital_file_name
|
||||
FROM file
|
||||
WHERE file_type ->> 'type' IS NOT NULL
|
||||
AND original_file_name LIKE 'FILE%_H264%';
|
||||
|
||||
### Audio files (mp3) that are not in the FILE/ folder
|
||||
|
||||
-- Purpose: find distinct base records for streaming audio (.mp3) where the original
|
||||
-- filename is not located in the FILE/... area. Useful to separate ingest/original
|
||||
-- conservative copies (often under FILE/) from streaming or derivative objects.
|
||||
SELECT DISTINCT
|
||||
h_base_record_id AS base_id,
|
||||
file_type ->> 'type' AS file_type
|
||||
FROM file
|
||||
WHERE file_type ->> 'type' IS NOT NULL
|
||||
AND original_file_name NOT LIKE 'FILE%'
|
||||
AND digital_file_name LIKE '%mp3';
|
||||
|
||||
### Video files (mp4) that are not in the FILE/ folder
|
||||
|
||||
-- Purpose: same as the mp3 query but for mp4 streaming/derivative files. This helps
|
||||
-- identify which base records currently have mp4 derivatives recorded outside the
|
||||
-- FILE/... (master) namespace.
|
||||
SELECT DISTINCT
|
||||
h_base_record_id AS base_id,
|
||||
file_type ->> 'type' AS file_type
|
||||
FROM file
|
||||
WHERE file_type ->> 'type' IS NOT NULL
|
||||
AND original_file_name NOT LIKE 'FILE%'
|
||||
AND digital_file_name LIKE '%mp4';
|
||||
|
||||
### Records with non-image digital files
|
||||
|
||||
-- Purpose: list base records that have digital files which are not JPEG images. The
|
||||
-- `NOT LIKE '%jpg'` filter excludes typical image derivatives; this is useful for
|
||||
-- auditing non-image assets attached to records.
|
||||
SELECT DISTINCT
|
||||
h_base_record_id AS base_id,
|
||||
file_type ->> 'type' AS file_type
|
||||
FROM file
|
||||
WHERE file_type ->> 'type' IS NOT NULL
|
||||
AND original_file_name NOT LIKE 'FILE%'
|
||||
AND digital_file_name NOT LIKE '%jpg';
|
||||
|
||||
### Count of unique base records per file_type
|
||||
|
||||
-- Purpose: aggregate the number of distinct base records (h_base_record_id) associated
|
||||
-- with each `file_type` value. This gives an overview of how many unique objects have
|
||||
-- files recorded for each logical file type.
|
||||
SELECT
|
||||
file_type ->> 'type' AS file_type,
|
||||
COUNT(DISTINCT h_base_record_id) AS file_type_unique_record_count
|
||||
FROM file
|
||||
WHERE file_type ->> 'type' IS NOT NULL
|
||||
GROUP BY file_type ->> 'type';
|
||||
|
||||
### Duplicate of the previous aggregate (kept for convenience)
|
||||
|
||||
-- Note: the query below is identical to the one above and will produce the same
|
||||
-- counts; it may be intentional for running in a separate context or as a copy-and-paste
|
||||
-- placeholder for further edits.
|
||||
SELECT
|
||||
file_type ->> 'type' AS file_type,
|
||||
COUNT(DISTINCT h_base_record_id) AS file_type_unique_record_count
|
||||
FROM file
|
||||
WHERE file_type ->> 'type' IS NOT NULL
|
||||
GROUP BY file_type ->> 'type';
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
boto3 >= 1.35.39
|
||||
botocore >= 1.35.39
|
||||
psycopg2-binary>=2.9.9
|
||||
aiohttp >= 3.10.10
|
||||
asyncio >= 3.4.3
|
||||
python-dotenv >= 1.0.1
|
||||
email-validator >= 2.2.0
|
||||
pytz >= 2024.2
|
||||
|
|
@ -0,0 +1,326 @@
|
|||
import boto3 # for S3
|
||||
from botocore.exceptions import NoCredentialsError, PartialCredentialsError, ClientError # for exceptions
|
||||
import logging # for logging
|
||||
import json # for json.loads
|
||||
import os # for os.path
|
||||
import psycopg2 # for PostgreSQL
|
||||
|
||||
# Import custom modules
|
||||
from file_utils import retrieve_file_contents, check_related_files, extract_and_validate_file_info # for file operations
|
||||
from email_utils import handle_error # for error handling depecradted?
|
||||
from db_utils import get_db_connection, check_inventory_in_db, check_objkey_in_file_db, add_file_record_and_relationship, retrieve_digital_file_names # for database operations
|
||||
|
||||
import config
|
||||
|
||||
# Function to check the existence of related files and validate in PostgreSQL
|
||||
def parse_s3_files( s3, s3_files, ach_variables, excluded_folders=[]):
|
||||
"""
|
||||
Parses the S3 files and performs various operations on them.
|
||||
Args:
|
||||
s3 (S3): The S3 object for accessing S3 services.
|
||||
s3_files (list): The list of S3 files to be processed.
|
||||
Returns:
|
||||
None
|
||||
Raises:
|
||||
FileNotFoundError: If a required file is not found in S3.
|
||||
ValueError: If a file has zero size or if the file type is unsupported.
|
||||
Exception: If any other unexpected exception occurs.
|
||||
"""
|
||||
# Load the configuration from the .env file
|
||||
_ , db_config, _, bucket_name, _ = config.load_config()
|
||||
# logg ach_variables
|
||||
logging.info(f"ach_variables: {ach_variables}")
|
||||
|
||||
logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
|
||||
|
||||
try:
|
||||
logging.info(f"Starting to parse S3 files from bucket {bucket_name}...")
|
||||
# Ensure db_config is not None
|
||||
if db_config is None:
|
||||
raise ValueError("Database configuration is not loaded")
|
||||
# return # Exit the function if db_config is None
|
||||
|
||||
conn = psycopg2.connect(**db_config)
|
||||
cur = conn.cursor()
|
||||
|
||||
# Filter files with the desired prefix
|
||||
# excluded_prefix = ['TEST-FOLDER-DEV/', 'DOCUMENTAZIONE_FOTOGRAFICA/', 'BTC/', 'VHS/', 'UMT/', 'OV2/', 'OA4/']
|
||||
excluded_prefix = excluded_folders
|
||||
# Exclude files that start with any prefix in the excluded_prefix array
|
||||
filtered_files = [file for file in s3_files if not any(file.startswith(prefix) for prefix in excluded_prefix)]
|
||||
|
||||
# Filter files with the desired prefix
|
||||
# DEBUG : filtered_files = [file for file in s3_files if file.startswith('TestFolderDev/')]
|
||||
# Length of filtered files
|
||||
logging.info(f"Array Length of filtered files: {len(filtered_files)}")
|
||||
# Counters
|
||||
error_files_count = 0
|
||||
warning_files_count = 0
|
||||
uploaded_files_count = 0
|
||||
#for file in s3_files:
|
||||
for file in filtered_files:
|
||||
if file.endswith(('.mp4', '.mp3')): # Check for both .mp4 and .mp3
|
||||
logging.info("Processing file: %s in the bucket: %s", file, bucket_name)
|
||||
# check if file exists in db
|
||||
result = check_objkey_in_file_db( cur, file)
|
||||
# Check the result and proceed accordingly
|
||||
if result:
|
||||
# logging.warning(f"File {file} already exists in the database.")
|
||||
warning_files_count +=1
|
||||
continue
|
||||
|
||||
ach_variables['file_fullpath'] = file # is the Object key
|
||||
ach_variables['inventory_code'] = os.path.splitext(os.path.basename(file))[0][:12]
|
||||
logging.info(f"ach_variables['inventory_code'] {ach_variables['inventory_code']}: {file}")
|
||||
# Extract the file extension
|
||||
ach_variables['objectKeys']['media'] = file
|
||||
ach_variables['objectKeys']['pdf'] = f"{os.path.splitext(file)[0]}.pdf"
|
||||
ach_variables['objectKeys']['pdf'] = ach_variables['objectKeys']['pdf'].replace('_H264', '')
|
||||
if file.endswith('.mp4'):
|
||||
ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.mov" # remove _H264 is done later
|
||||
elif file.endswith('.mp3'):
|
||||
ach_variables['objectKeys']['conservative_copy'] = f"{os.path.splitext(file)[0]}.wav"
|
||||
else:
|
||||
logging.KeyError(f"Unsupported file type: {file}")
|
||||
error_files_count +=1
|
||||
continue
|
||||
|
||||
# Extract the file extension
|
||||
file_extension = os.path.splitext(file)[1]
|
||||
ach_variables['extension'] = file_extension # Store the file extension in ach_variables
|
||||
logging.info(f"the file File extension: {file_extension}")
|
||||
|
||||
# Extract the file name with directory part
|
||||
file_name_with_path = os.path.splitext(file)[0] # Remove the extension but keep path
|
||||
logging.info(f"File name with path: {file_name_with_path}")
|
||||
|
||||
# Extract the base name from the file name
|
||||
base_name = os.path.basename(file_name_with_path) # Extract the base name with path removed
|
||||
logging.info(f"Base name: {base_name}")
|
||||
|
||||
# Apply _H264 removal only for .mp4 files
|
||||
if file.endswith('.mp4'):
|
||||
logging.info(f"File is an mp4 file: {file}. remove _H264")
|
||||
base_name = base_name.replace('_H264', '')
|
||||
file_name_with_path = file_name_with_path.replace('_H264', '')
|
||||
logging.info(f"Modified base name for mp4: {base_name}")
|
||||
logging.info(f"Modified file name with path for mp4: {file_name_with_path}")
|
||||
|
||||
try:
|
||||
# Retrieve and log the file size
|
||||
file_size = get_file_size(s3, bucket_name, file)
|
||||
# maybe can trow an error inside te get_file_size function and catch it here
|
||||
if file_size is not None:
|
||||
ach_variables['media_disk_size'] = file_size
|
||||
logging.info(f"The media file disk size is: {ach_variables['media_disk_size']}")
|
||||
else:
|
||||
logging.warning("Could not retrieve file size for %s.", file)
|
||||
warning_files_count +=1
|
||||
continue # Skip to the next file in the loop
|
||||
|
||||
logging.info("Start Validating files for %s...", base_name)
|
||||
# Check if related file exist and retreive .pdf file size
|
||||
try:
|
||||
# Check if the required files exist in S3
|
||||
ach_variables['pdf_disk_size'] = check_related_files(s3, file_name_with_path, file, bucket_name)
|
||||
logging.info(f"PDF disk size: {ach_variables['pdf_disk_size']}")
|
||||
except FileNotFoundError as e:
|
||||
# Handle case where the file is not found
|
||||
logging.error(f"File not found error: {e}")
|
||||
error_files_count +=1
|
||||
continue # Move on to the next file in the loop
|
||||
except ValueError as e:
|
||||
# Handle value errors
|
||||
logging.error(f"Value error: {e} probabli filesize zero")
|
||||
error_files_count +=1
|
||||
continue # Move on to the next file in the loop
|
||||
except PermissionError as e:
|
||||
# Handle permission errors
|
||||
logging.error(f"Permission error: {e}")
|
||||
error_files_count +=1
|
||||
continue # Move on to the next file in the loop
|
||||
except Exception as e:
|
||||
# Handle any other exceptions
|
||||
logging.error(f"An error occurred: {e}")
|
||||
|
||||
# Retrieve the file contents for related files: .md5, .json
|
||||
try:
|
||||
# Check if the file exists in S3 and retrieve file contents
|
||||
logging.info(f"Retrieving file contents for {file_name_with_path}...")
|
||||
file_contents = retrieve_file_contents(s3, f"{file_name_with_path}")
|
||||
except Exception as e:
|
||||
# Log the error
|
||||
logging.error(f"Error retrieving file contents for {file_name_with_path}: {e}")
|
||||
file_contents = None # Set file_contents to None or handle it as needed
|
||||
error_files_count +=1
|
||||
continue # Move on to the next file in the loop
|
||||
|
||||
# if contents dont exists
|
||||
if file_contents is None:
|
||||
logging.error(f"Error retrieving file contents for {file}.")
|
||||
error_files_count +=1
|
||||
continue # Move on to the next file in the loop
|
||||
|
||||
# Ensure file_contents is a dictionary
|
||||
if isinstance(file_contents, str):
|
||||
file_contents = json.loads(file_contents)
|
||||
|
||||
# Extract and validate file information
|
||||
ach_variables['custom_data_in'], ach_variables['disk_size'], ach_variables['conservative_copy_extension'] = extract_and_validate_file_info(file_contents, file, ach_variables)
|
||||
logging.info(f"Custom data extracted: {ach_variables['custom_data_in']}")
|
||||
logging.info(f"Disk size extracted: {ach_variables['disk_size']}")
|
||||
logging.info(f"Conservative copy extension extracted: {ach_variables['conservative_copy_extension']}")
|
||||
logging.info(f"File {file} file validation completed")
|
||||
except Exception as e:
|
||||
logging.error(f"Error processing file {file}: {e}")
|
||||
error_files_count +=1
|
||||
continue # Move on to the next file in the loop
|
||||
|
||||
# no need truncate base name on this point
|
||||
# base_name = base_name[:12] # Truncate the base name to 12 characters
|
||||
# ////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
# Check if the base name exists in the database
|
||||
logging.info(f"Checking database for {base_name}...")
|
||||
|
||||
try:
|
||||
# Call the function to check the inventory code in the database and get the result
|
||||
result, truncated_base_name = check_inventory_in_db(s3, cur, base_name)
|
||||
logging.info(f"base name {base_name}, truncated_base_name: {truncated_base_name}")
|
||||
# Check the result and proceed accordingly
|
||||
if result:
|
||||
logging.info(f"Inventory code {base_name} found in the database.")
|
||||
# Call the function to retrieve digital file names
|
||||
if retrieve_digital_file_names(s3, cur, base_name, ach_variables['objectKeys']['media']) == True:
|
||||
# Call the function to add a file record and its relationship to the support record
|
||||
# logging.info(f"ach_variables: {ach_variables}")
|
||||
add_file_record_and_relationship(s3, cur, base_name, ach_variables)
|
||||
else:
|
||||
logging.warning(f"File record already exists for {base_name}.")
|
||||
warning_files_count +=1
|
||||
continue
|
||||
else:
|
||||
logging.error(f"Inventory code {base_name} not found in the database.")
|
||||
error_files_count +=1
|
||||
continue
|
||||
except ValueError as e:
|
||||
logging.error(f"An error occurred: {e}")
|
||||
error_files_count +=1
|
||||
continue
|
||||
|
||||
# Commit the changes to the database
|
||||
logging.info(f"commit to databse {base_name}...")
|
||||
|
||||
# Commit to the database (conn, cur) only if everything is okay; otherwise, perform a rollback.
|
||||
conn.commit()
|
||||
uploaded_files_count +=1
|
||||
cur.close()
|
||||
conn.close()
|
||||
except ValueError as e:
|
||||
# Handle specific validation errors
|
||||
logging.error(f"Validation error: {e}")
|
||||
#handle_error(e) # Pass the ValueError to handle_error
|
||||
raise e # Raise the exception to the calling function
|
||||
except Exception as e:
|
||||
# Handle any other unexpected errors
|
||||
logging.error(f"Unexpected error: {e}")
|
||||
#handle_error(e) # Pass unexpected errors to handle_error
|
||||
raise e # Raise the exception to the calling function
|
||||
|
||||
# return the file saved
|
||||
return uploaded_files_count, warning_files_count, error_files_count
|
||||
|
||||
# Function to create an S3 client
|
||||
def create_s3_client(aws_config):
|
||||
logging.info(f'Creating S3 client with endpoint: {aws_config["endpoint_url"]}')
|
||||
try:
|
||||
s3 = boto3.client(
|
||||
's3',
|
||||
endpoint_url=aws_config['endpoint_url'],
|
||||
aws_access_key_id=aws_config['aws_access_key_id'],
|
||||
aws_secret_access_key=aws_config['aws_secret_access_key'],
|
||||
region_name=aws_config['region_name'],
|
||||
config=boto3.session.Config(
|
||||
signature_version='s3v4',
|
||||
s3={'addressing_style': 'path'}
|
||||
)
|
||||
)
|
||||
logging.info('S3 client created successfully')
|
||||
return s3
|
||||
except (NoCredentialsError, PartialCredentialsError) as e:
|
||||
logging.error(f'Error creating S3 client: {e}')
|
||||
raise e
|
||||
|
||||
# Function to list the contents of an S3 bucket
|
||||
def list_s3_bucket(s3_client, bucket_name):
|
||||
try:
|
||||
paginator = s3_client.get_paginator('list_objects_v2')
|
||||
bucket_contents = []
|
||||
|
||||
for page in paginator.paginate(Bucket=bucket_name):
|
||||
if 'Contents' in page:
|
||||
bucket_contents.extend(page['Contents'])
|
||||
|
||||
logging.info(f"Retrieved {len(bucket_contents)} items from the bucket.")
|
||||
return bucket_contents
|
||||
except ClientError as e:
|
||||
logging.error(f'Error listing bucket contents: {e}')
|
||||
raise e
|
||||
|
||||
# Function to get file size from S3
|
||||
def get_file_size(s3_client, bucket_name, file_key):
|
||||
try:
|
||||
response = s3_client.head_object(Bucket=bucket_name, Key=file_key)
|
||||
return response['ContentLength']
|
||||
except ClientError as e:
|
||||
logging.error(f"Failed to retrieve file size for {file_key}: {e}")
|
||||
return None # or an appropriate fallback value
|
||||
except Exception as e:
|
||||
logging.error(f"An unexpected error occurred: {e}")
|
||||
return None
|
||||
|
||||
# Function to check if a file exists in S3
|
||||
def check_file_exists_in_s3(s3, file_name,bucket_name):
|
||||
"""
|
||||
Checks if a file exists in an S3 bucket.
|
||||
|
||||
Parameters:
|
||||
- s3 (boto3.client): The S3 client object.
|
||||
- file_name (str): The name of the file to check.
|
||||
|
||||
Returns:
|
||||
- bool: True if the file exists, False otherwise.
|
||||
|
||||
Raises:
|
||||
- ClientError: If there is an error checking the file.
|
||||
|
||||
"""
|
||||
try:
|
||||
s3.head_object(Bucket=bucket_name, Key=file_name)
|
||||
return True
|
||||
except ClientError as e:
|
||||
if e.response['Error']['Code'] == '404':
|
||||
return False
|
||||
else:
|
||||
logging.error(f'Error checking file {file_name}: {e}')
|
||||
raise e
|
||||
|
||||
# Function to retrieve file contents from S3
|
||||
def upload_file_to_s3(s3_client, file_path, bucket_name, object_name=None):
|
||||
if object_name is None:
|
||||
object_name = file_path
|
||||
try:
|
||||
s3_client.upload_file(file_path, bucket_name, object_name)
|
||||
logging.info(f"File {file_path} uploaded to {bucket_name}/{object_name}")
|
||||
except ClientError as e:
|
||||
logging.error(f'Error uploading file {file_path} to bucket {bucket_name}: {e}')
|
||||
raise e
|
||||
|
||||
# Function to download a file from S3
|
||||
def download_file_from_s3(s3_client, bucket_name, object_name, file_path):
|
||||
try:
|
||||
s3_client.download_file(bucket_name, object_name, file_path)
|
||||
logging.info(f"File {object_name} downloaded from {bucket_name} to {file_path}")
|
||||
except ClientError as e:
|
||||
logging.error(f'Error downloading file {object_name} from bucket {bucket_name}: {e}')
|
||||
raise e
|
||||
|
|
@ -0,0 +1,99 @@
|
|||
# Description: Utility functions for the media validation service
|
||||
# Art.c.hive 2024/09/30
|
||||
|
||||
# Standard libs
|
||||
import logging
|
||||
import os
|
||||
# import logging_config
|
||||
|
||||
|
||||
# result, message = check_video_info(audio_json_content)
|
||||
def check_video_info(media_info):
|
||||
logging.info("Checking video info...")
|
||||
logging.info(f"Media info: {media_info}")
|
||||
try:
|
||||
#('mediainfo', {}).get('media', {}).get('track', [])
|
||||
# Check if the file name ends with .mov
|
||||
file_name = media_info.get('media', {}).get('@ref', '')
|
||||
logging.info(f"File name in JSON: {file_name}")
|
||||
# Determine the parent directory (one level above the basename).
|
||||
# Example: for 'SOME/FOLDER/filename.mov' -> parent_dir == 'FOLDER'
|
||||
parent_dir = os.path.basename(os.path.dirname(file_name))
|
||||
logging.info(f"Parent directory: {parent_dir}")
|
||||
|
||||
# If the parent directory is 'FILE' accept multiple container types
|
||||
if parent_dir.lower() == 'file':
|
||||
# Accept .mov, .avi, .m4v, .mp4, .mxf, .mpg (case-insensitive)
|
||||
if not any(file_name.lower().endswith(ext) for ext in ('.mov', '.avi', '.m4v', '.mp4', '.mxf', '.mpg', '.mpeg')):
|
||||
return False, "The file is not a .mov, .avi, .m4v, .mp4, .mxf, .mpg or .mpeg file."
|
||||
|
||||
# Map file extensions to lists of acceptable general formats (video)
|
||||
general_formats = {
|
||||
'.avi': ['AVI'], # General/Format for AVI files
|
||||
'.mov': ['QuickTime', 'MOV', 'MPEG-4'], # MediaInfo may report QuickTime or MOV for .mov
|
||||
'.mp4': ['MPEG-4', 'MP4', 'QuickTime'], # MPEG-4 container (QuickTime variant ?? VO-MP4-16028_H264.mp4)
|
||||
'.m4v': ['MPEG-4', 'MP4'], # MPEG-4 container (Apple variant)
|
||||
'.mxf': ['MXF'], # Material eXchange Format
|
||||
'.mpg': ['MPEG','MPEG-PS'], # MPEG program/transport streams
|
||||
'.mpeg': ['MPEG','MPEG-PS'],
|
||||
}
|
||||
|
||||
# check that the extension correspond to one of the allowed formats in track 0 in the corresponding json file
|
||||
file_ext = os.path.splitext(file_name)[1].lower()
|
||||
logging.info(f"File extension: {file_ext}")
|
||||
expected_formats = general_formats.get(file_ext)
|
||||
logging.info(f"Expected formats for extension {file_ext}: {expected_formats}")
|
||||
if not expected_formats:
|
||||
return False, f"Unsupported file extension: {file_ext}"
|
||||
tracks = media_info.get('media', {}).get('track', [])
|
||||
if len(tracks) > 0:
|
||||
track_0 = tracks[0] # Assuming track 0 is the first element (index 0)
|
||||
logging.info(f"Track 0: {track_0}")
|
||||
actual_format = track_0.get('Format', '')
|
||||
if track_0.get('@type', '') == 'General' and actual_format in expected_formats:
|
||||
logging.info(f"File extension {file_ext} matches one of the expected formats {expected_formats} (actual: {actual_format}).")
|
||||
else:
|
||||
return False, f"Track 0 format '{actual_format}' does not match any expected formats {expected_formats} for extension {file_ext}."
|
||||
else:
|
||||
# Outside FILE/ directory require .mov specifically
|
||||
if not file_name.lower().endswith('.mov'):
|
||||
return False, "The file is not a .mov file."
|
||||
# Check if track 1's format is ProRes
|
||||
tracks = media_info.get('media', {}).get('track', [])
|
||||
if len(tracks) > 1:
|
||||
track_1 = tracks[1] # Assuming track 1 is the second element (index 1)
|
||||
logging.info(f"Track 1: {track_1}")
|
||||
if track_1.get('@type', '') == 'Video' and track_1.get('Format', '') == 'ProRes' and track_1.get('Format_Profile', '') == '4444':
|
||||
return True, "The file is a .mov file with ProRes format in track 1."
|
||||
else:
|
||||
return False, "Track 1 format is not ProRes."
|
||||
else:
|
||||
return False, "No track 1 found."
|
||||
|
||||
return True, "The file passed the video format checks."
|
||||
except Exception as e:
|
||||
return False, f"Error processing the content: {e}"
|
||||
|
||||
# result, message = check_audio_info(json_content)
|
||||
def check_audio_info(media_info):
|
||||
try:
|
||||
# Check if the file name ends with .wav
|
||||
file_name = media_info.get('media', {}).get('@ref', '')
|
||||
if not file_name.endswith('.wav'):
|
||||
logging.info(f"File name in JSON: {file_name}")
|
||||
return False, "The file is not a .wav file."
|
||||
|
||||
# Check if track 1's format is Wave
|
||||
tracks = media_info.get('media', {}).get('track', [])
|
||||
# Ensure there are at least two track entries before accessing index 1
|
||||
if len(tracks) > 1:
|
||||
track_1 = tracks[1] # Assuming track 1 is the second element (index 1)
|
||||
if track_1.get('@type', '') == 'Audio' and track_1.get('Format', '') == 'PCM' and track_1.get('SamplingRate', '') == '96000' and track_1.get('BitDepth', '') == '24':
|
||||
return True, "The file is a .wav file with Wave format in track 1."
|
||||
else:
|
||||
return False, f"Track 1 format is not Wave. Format: {track_1.get('Format', '')}, SamplingRate: {track_1.get('SamplingRate', '')}, BitDepth: {track_1.get('BitDepth', '')}"
|
||||
|
||||
return False, "No track 1 found."
|
||||
|
||||
except Exception as e:
|
||||
return False, f"Error processing the content: {e}"
|
||||
Loading…
Reference in New Issue