first commit

2026-03-16 11:34:56 +01:00 · 2026-03-16 11:34:56 +01:00 · 42446d2873
commit 42446d2873
6 changed files with 22401 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,216 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[codz]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#   Usually these files are written by a python script from a template
+#   before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py.cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+# Pipfile.lock
+
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+# uv.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+# poetry.lock
+# poetry.toml
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
+#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
+# pdm.lock
+# pdm.toml
+.pdm-python
+.pdm-build/
+
+# pixi
+#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
+# pixi.lock
+#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
+#   in the .venv directory. It is recommended not to include this directory in version control.
+.pixi
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# Redis
+*.rdb
+*.aof
+*.pid
+
+# RabbitMQ
+mnesia/
+rabbitmq/
+rabbitmq-data/
+
+# ActiveMQ
+activemq-data/
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#   JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#   be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#   and can be added to the global gitignore or merged into this file.  For a more nuclear
+#   option (not recommended) you can uncomment the following to ignore the entire idea folder.
+# .idea/
+
+# Abstra
+#   Abstra is an AI-powered process automation framework.
+#   Ignore directories containing user credentials, local state, and settings.
+#   Learn more at https://abstra.io/docs
+.abstra/
+
+# Visual Studio Code
+#   Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore 
+#   that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
+#   and can be added to the global gitignore or merged into this file. However, if you prefer, 
+#   you could uncomment the following to ignore the entire vscode folder
+# .vscode/
+
+# Ruff stuff:
+.ruff_cache/
+
+# PyPI configuration file
+.pypirc
+
+# Marimo
+marimo/_static/
+marimo/_lsp/
+__marimo__/
+
+# Streamlit
+.streamlit/secrets.toml
--- a/main.py
+++ b/main.py
@ -0,0 +1,237 @@
+#!/usr/bin/env python3
+"""
+Clean duplicates script
+
+This script parses a duplicates list (output by a specialized detection tool) in
+`resultsduplicate.txt` and moves duplicate files that live under a specific
+directory (VAR_DIRECTORY) to the trash (recycle bin) instead of deleting them
+permanently.
+
+Behavior:
+- The script parses blocks like:
+  - 2 equal files of size 5256842
+    "I:\\01_AI\\01_IMAGES\\00_Input\\old\\file.png"
+    "I:\\01_AI\\01_IMAGES\\55_Img2Img\\other\\file.png"
+- If any files in the block live under `VAR_DIRECTORY` they are candidates for
+  removal; they are moved to trash. (They are not removed permanently.)
+- If all files in a block are inside `VAR_DIRECTORY` the script will skip
+  deletion for that block to avoid losing every copy.
+
+The script supports a DRY_RUN mode (no changes are made, only logged). It uses
+send2trash if available, otherwise falls back to moving files to a local
+`.recycle_bin` directory in the project.
+
+This script takes no command-line arguments; change the variables near the
+top of the file to configure behavior.
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+import sys
+import shutil
+from typing import List
+
+try:
+    from send2trash import send2trash
+    SEND2TRASH_AVAILABLE = True
+except Exception:
+    SEND2TRASH_AVAILABLE = False
+
+
+# Configuration variables (script takes no CLI args - edit here):
+DUPLICATES_FILE = "results.txt"
+VAR_DIRECTORY = r"I:\01_AI\01_IMAGES\00_Input"
+# When DRY_RUN is True, the script logs actions but does not move files
+DRY_RUN = False
+# Default logging level (DEBUG/INFO/WARNING/ERROR)
+LOG_LEVEL = logging.INFO
+
+LOCAL_TRASH_DIR = ".recycle_bin"
+
+
+def setup_logger(level: int = logging.INFO) -> logging.Logger:
+    logger = logging.getLogger("cleanDupli")
+    logger.setLevel(level)
+    handler = logging.StreamHandler(sys.stdout)
+    formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
+    handler.setFormatter(formatter)
+    if not logger.handlers:
+        logger.addHandler(handler)
+    return logger
+
+
+logger = setup_logger(LOG_LEVEL)
+
+
+def normalize_path(p: str) -> str:
+    # Remove surrounding quotes/spaces then normalize case and path separators
+    p = p.strip().strip('"').strip("'")
+    p = os.path.normpath(p)
+    return os.path.normcase(p)
+
+
+def parse_duplicates_file(path: str) -> List[List[str]]:
+    if not os.path.exists(path):
+        logger.error("Duplicates file '%s' does not exist.", path)
+        raise FileNotFoundError(path)
+
+    blocks: List[List[str]] = []
+    current_block: List[str] = []
+    with open(path, "r", encoding="utf-8", errors="replace") as fh:
+        for raw in fh:
+            line = raw.rstrip("\n")
+            stripped = line.strip()
+            if not stripped:
+                continue
+            # block header begins with '- ' (like '- 2 equal files')
+            if stripped.startswith("-") and "equal file" in stripped:
+                # if we had a current block, push to blocks
+                if current_block:
+                    blocks.append(current_block)
+                    current_block = []
+                # skip header
+                continue
+            # lines with paths are indented and quoted
+            # push any quoted path into current block
+            if stripped.startswith('"') or stripped.startswith("'"):
+                p = stripped.strip().strip('"').strip("'")
+                # Windows paths in the file may use either forward/back slashes
+                p = p.replace("/", os.sep)
+                p = p.replace("\\\\", "\\")
+                current_block.append(p)
+            else:
+                # if not a path and not a header -- ignore
+                continue
+        if current_block:
+            blocks.append(current_block)
+    return blocks
+
+
+def ensure_local_trash() -> str:
+    os.makedirs(LOCAL_TRASH_DIR, exist_ok=True)
+    return os.path.abspath(LOCAL_TRASH_DIR)
+
+
+def move_to_trash(fp: str) -> None:
+    if DRY_RUN:
+        logger.info("[DRY RUN] Would move to trash: %s", fp)
+        return
+
+    if SEND2TRASH_AVAILABLE:
+        try:
+            send2trash(fp)
+            logger.info("Moved to system trash: %s", fp)
+            return
+        except Exception as ex:
+            logger.warning("send2trash failed for %s: %s", fp, ex)
+            # fallback to local trash
+
+    # fallback: copy file to local recycle bin and remove original
+    try:
+        trash_dir = ensure_local_trash()
+        base = os.path.basename(fp)
+        dst = os.path.join(trash_dir, base)
+        # avoid accidental overwrite
+        if os.path.exists(dst):
+            # append a unique suffix
+            import uuid
+            dst = os.path.join(trash_dir, f"{uuid.uuid4().hex[:8]}-{base}")
+        shutil.move(fp, dst)
+        logger.info("Moved to local trash (%s): %s", dst, fp)
+    except Exception as ex:
+        logger.error("Failed to move to local trash %s: %s", fp, ex)
+        raise
+
+
+def process_blocks(blocks: List[List[str]], var_dir: str) -> dict:
+    results = {
+        "total_blocks": len(blocks),
+        "candidate_files": 0,
+        "skipped_blocks": 0,
+        "moved_files": 0,
+        "errors": 0,
+    }
+    var_dir_norm = normalize_path(var_dir)
+
+    processed_count = 0
+    log_every_n = 200  # periodic progress logging for large lists
+    for block in blocks:
+        # skip single-file blocks
+        if len(block) <= 1:
+            logger.debug("Skipping single-entry block: %s", block)
+            results["skipped_blocks"] += 1
+            continue
+
+        normalized_paths = [normalize_path(p) for p in block]
+        in_var = [p for p in normalized_paths if p.startswith(var_dir_norm)]
+        out_var = [p for p in normalized_paths if not p.startswith(var_dir_norm)]
+
+        logger.debug("Block has %d entries (%d in var_dir, %d out): %s", len(block), len(in_var), len(out_var), block)
+
+        # if there are no files outside var_dir we avoid deleting everything
+        if not out_var and in_var:
+            logger.warning("Skipping block because all copies are in VAR_DIRECTORY (not deleting all): %s", block)
+            results["skipped_blocks"] += 1
+            continue
+
+        # otherwise, any files inside VAR_DIRECTORY are safe to remove
+        for p in in_var:
+            results["candidate_files"] += 1
+            try:
+                if not os.path.exists(p):
+                    logger.warning("File not found: %s (skipping)", p)
+                    results["errors"] += 1
+                    continue
+                move_to_trash(p)
+                # update counters
+                if DRY_RUN:
+                    # the file would be moved in a real run
+                    results.setdefault("would_move_files", 0)
+                    results["would_move_files"] += 1
+                else:
+                    results["moved_files"] += 1
+                processed_count += 1
+                # periodic progress logging
+                if processed_count % log_every_n == 0:
+                    if DRY_RUN:
+                        logger.info("Dry-run progress: %d files would be moved so far...", processed_count)
+                    else:
+                        logger.info("Progress: %d files moved so far...", processed_count)
+            except Exception as ex:
+                logger.error("Error moving to trash: %s -> %s", p, ex)
+                results["errors"] += 1
+
+    return results
+
+
+def main() -> int:
+    logger.setLevel(LOG_LEVEL)
+    logger.info("Starting clean duplicates script")
+    logger.debug("Configuration: DUPLICATES_FILE=%s VAR_DIRECTORY=%s DRY_RUN=%s LOG_LEVEL=%s", DUPLICATES_FILE, VAR_DIRECTORY, DRY_RUN, LOG_LEVEL)
+
+    if not os.path.exists(DUPLICATES_FILE):
+        logger.error("File '%s' not found in working dir %s", DUPLICATES_FILE, os.getcwd())
+        return 2
+
+    try:
+        blocks = parse_duplicates_file(DUPLICATES_FILE)
+    except Exception as ex:
+        logger.error("Failed to parse duplicates file: %s", ex)
+        return 3
+
+    logger.info("Parsed %d duplicate groups", len(blocks))
+    results = process_blocks(blocks, VAR_DIRECTORY)
+
+    logger.info("Done. Results: total_blocks=%d candidate_files=%d moved_files=%d would_move_files=%d skipped_blocks=%d errors=%d", results["total_blocks"], results["candidate_files"], results.get("moved_files", 0), results.get("would_move_files", 0), results["skipped_blocks"], results["errors"])  
+    # if any local trash used and not dry run, show message
+    if not SEND2TRASH_AVAILABLE and not DRY_RUN:
+        logger.warning("send2trash not available; files were moved to %s", os.path.abspath(LOCAL_TRASH_DIR))
+
+    return 0
+
+
+if __name__ == "__main__":
+    rc = main()
+    sys.exit(rc)
--- a/readme.md
+++ b/readme.md
@ -0,0 +1,77 @@
+Clean duplicate files from the `resultsduplicate.txt` report.
+
+Use Duplicate file finder to generate the file
+
+This repository contains a small script `main.py` that parses a report file
+(`resultsduplicate.txt`) produced by a specialized duplicate detection tool.
+It moves files that are duplicates and located in the configured `VAR_DIRECTORY`
+to the system trash (or a local recycle bin fallback) instead of deleting them
+permanently.
+
+File format (example block in `resultsduplicate.txt`):
+
+- 2 equal files of size 5256842
+  "I:\01_AI\01_IMAGES\00_Input\old\2k-ComfyUI-faceD_00050_.png"
+  "I:\01_AI\01_IMAGES\55_Img2Img\20240206-A-selected\2k-ComfyUI-faceD_00050_.png"
+
+Notes and behavior:
+- The first line of a block indicates the number of identical files (2, 3, ...)
+- The report contains groups of paths (quoted) listed under each header
+- If any path in a group points to `VAR_DIRECTORY` (usually the `old` folder),
+  those files are candidates to be removed
+- The script will move any file(s) inside `VAR_DIRECTORY` to the trash and leave
+  other copies intact
+- As a safety measure, if *all* copies in a group are inside `VAR_DIRECTORY`,
+  the script will *skip* that block (to avoid deleting the only remaining
+  copies) and log a warning
+
+Configuration (in `main.py`):
+- `DUPLICATES_FILE` - path to the duplicates report (default: `resultsduplicate.txt`)
+- `VAR_DIRECTORY` - the directory whose files should be removed (e.g., the
+  `old` folder)
+- `DRY_RUN` - if `True`, the script will only log actions and not move files
+- `LOG_LEVEL` - logging level (`logging.INFO`, `logging.DEBUG`, etc.)
+
+Installation (recommended):
+```pwsh
+python -m pip install --user send2trash
+```
+
+`send2trash` is recommended to move files to the system recycle bin safely. If
+`send2trash` is not installed, the script falls back to moving files to a local
+`.recycle_bin` directory in the project.
+
+Usage:
+```pwsh
+# dry-run first (only log what would be removed)
+python main.py
+
+# to actually remove files, edit main.py and set `DRY_RUN = False`
+python main.py
+```
+
+- The script logs events at levels DEBUG/INFO/WARNING/ERROR
+- Configure `LOG_LEVEL` near the top of `main.py` to change log verbosity
+
+Counters & progress:
+- The script logs a running progress message every 200 processed candidate files (change `log_every_n` in `main.py`).
+- Output summary includes:
+  - `total_blocks` - number of blocks parsed
+  - `candidate_files` - number of files eligible in `VAR_DIRECTORY`
+  - `would_move_files` - number of files that *would* be moved when `DRY_RUN=True`
+  - `moved_files` - number of files actually moved when `DRY_RUN=False`
+  - `skipped_blocks` - blocks that were skipped for safety (no copies outside `VAR_DIRECTORY`)
+  - `errors` - operation errors during the run
+
+- The script logs events at levels DEBUG/INFO/WARNING/ERROR
+- Configure `LOG_LEVEL` near the top of `main.py` to change log verbosity
+
+Safety tips:
+- Run with `DRY_RUN=True` and inspect the logs before making changes
+- Make sure `DUPLICATES_FILE` points to a fresh report and that backups exist
+  for important data
+
+If you want the script to behave differently (e.g., delete only single files
+or keep one copy per project), ask and I can implement more advanced options.
+
+
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1 @@
+send2trash>=1.8.0
--- a/results.txt
+++ b/results.txt
--- a/resultsduplicate.txt
+++ b/resultsduplicate.txt