first commit
This commit is contained in:
commit
42446d2873
|
|
@ -0,0 +1,216 @@
|
|||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[codz]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py.cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
.pybuilder/
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
# For a library or package, you might want to ignore these files since the code is
|
||||
# intended to run in multiple environments; otherwise, check them in:
|
||||
# .python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
# Pipfile.lock
|
||||
|
||||
# UV
|
||||
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# uv.lock
|
||||
|
||||
# poetry
|
||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||
# poetry.lock
|
||||
# poetry.toml
|
||||
|
||||
# pdm
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||
# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
|
||||
# https://pdm-project.org/en/latest/usage/project/#working-with-version-control
|
||||
# pdm.lock
|
||||
# pdm.toml
|
||||
.pdm-python
|
||||
.pdm-build/
|
||||
|
||||
# pixi
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
|
||||
# pixi.lock
|
||||
# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
|
||||
# in the .venv directory. It is recommended not to include this directory in version control.
|
||||
.pixi
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# Redis
|
||||
*.rdb
|
||||
*.aof
|
||||
*.pid
|
||||
|
||||
# RabbitMQ
|
||||
mnesia/
|
||||
rabbitmq/
|
||||
rabbitmq-data/
|
||||
|
||||
# ActiveMQ
|
||||
activemq-data/
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.envrc
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# PyCharm
|
||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
# .idea/
|
||||
|
||||
# Abstra
|
||||
# Abstra is an AI-powered process automation framework.
|
||||
# Ignore directories containing user credentials, local state, and settings.
|
||||
# Learn more at https://abstra.io/docs
|
||||
.abstra/
|
||||
|
||||
# Visual Studio Code
|
||||
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
|
||||
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. However, if you prefer,
|
||||
# you could uncomment the following to ignore the entire vscode folder
|
||||
# .vscode/
|
||||
|
||||
# Ruff stuff:
|
||||
.ruff_cache/
|
||||
|
||||
# PyPI configuration file
|
||||
.pypirc
|
||||
|
||||
# Marimo
|
||||
marimo/_static/
|
||||
marimo/_lsp/
|
||||
__marimo__/
|
||||
|
||||
# Streamlit
|
||||
.streamlit/secrets.toml
|
||||
|
|
@ -0,0 +1,237 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Clean duplicates script
|
||||
|
||||
This script parses a duplicates list (output by a specialized detection tool) in
|
||||
`resultsduplicate.txt` and moves duplicate files that live under a specific
|
||||
directory (VAR_DIRECTORY) to the trash (recycle bin) instead of deleting them
|
||||
permanently.
|
||||
|
||||
Behavior:
|
||||
- The script parses blocks like:
|
||||
- 2 equal files of size 5256842
|
||||
"I:\\01_AI\\01_IMAGES\\00_Input\\old\\file.png"
|
||||
"I:\\01_AI\\01_IMAGES\\55_Img2Img\\other\\file.png"
|
||||
- If any files in the block live under `VAR_DIRECTORY` they are candidates for
|
||||
removal; they are moved to trash. (They are not removed permanently.)
|
||||
- If all files in a block are inside `VAR_DIRECTORY` the script will skip
|
||||
deletion for that block to avoid losing every copy.
|
||||
|
||||
The script supports a DRY_RUN mode (no changes are made, only logged). It uses
|
||||
send2trash if available, otherwise falls back to moving files to a local
|
||||
`.recycle_bin` directory in the project.
|
||||
|
||||
This script takes no command-line arguments; change the variables near the
|
||||
top of the file to configure behavior.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import shutil
|
||||
from typing import List
|
||||
|
||||
try:
|
||||
from send2trash import send2trash
|
||||
SEND2TRASH_AVAILABLE = True
|
||||
except Exception:
|
||||
SEND2TRASH_AVAILABLE = False
|
||||
|
||||
|
||||
# Configuration variables (script takes no CLI args - edit here):
|
||||
DUPLICATES_FILE = "results.txt"
|
||||
VAR_DIRECTORY = r"I:\01_AI\01_IMAGES\00_Input"
|
||||
# When DRY_RUN is True, the script logs actions but does not move files
|
||||
DRY_RUN = False
|
||||
# Default logging level (DEBUG/INFO/WARNING/ERROR)
|
||||
LOG_LEVEL = logging.INFO
|
||||
|
||||
LOCAL_TRASH_DIR = ".recycle_bin"
|
||||
|
||||
|
||||
def setup_logger(level: int = logging.INFO) -> logging.Logger:
|
||||
logger = logging.getLogger("cleanDupli")
|
||||
logger.setLevel(level)
|
||||
handler = logging.StreamHandler(sys.stdout)
|
||||
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
|
||||
handler.setFormatter(formatter)
|
||||
if not logger.handlers:
|
||||
logger.addHandler(handler)
|
||||
return logger
|
||||
|
||||
|
||||
logger = setup_logger(LOG_LEVEL)
|
||||
|
||||
|
||||
def normalize_path(p: str) -> str:
|
||||
# Remove surrounding quotes/spaces then normalize case and path separators
|
||||
p = p.strip().strip('"').strip("'")
|
||||
p = os.path.normpath(p)
|
||||
return os.path.normcase(p)
|
||||
|
||||
|
||||
def parse_duplicates_file(path: str) -> List[List[str]]:
|
||||
if not os.path.exists(path):
|
||||
logger.error("Duplicates file '%s' does not exist.", path)
|
||||
raise FileNotFoundError(path)
|
||||
|
||||
blocks: List[List[str]] = []
|
||||
current_block: List[str] = []
|
||||
with open(path, "r", encoding="utf-8", errors="replace") as fh:
|
||||
for raw in fh:
|
||||
line = raw.rstrip("\n")
|
||||
stripped = line.strip()
|
||||
if not stripped:
|
||||
continue
|
||||
# block header begins with '- ' (like '- 2 equal files')
|
||||
if stripped.startswith("-") and "equal file" in stripped:
|
||||
# if we had a current block, push to blocks
|
||||
if current_block:
|
||||
blocks.append(current_block)
|
||||
current_block = []
|
||||
# skip header
|
||||
continue
|
||||
# lines with paths are indented and quoted
|
||||
# push any quoted path into current block
|
||||
if stripped.startswith('"') or stripped.startswith("'"):
|
||||
p = stripped.strip().strip('"').strip("'")
|
||||
# Windows paths in the file may use either forward/back slashes
|
||||
p = p.replace("/", os.sep)
|
||||
p = p.replace("\\\\", "\\")
|
||||
current_block.append(p)
|
||||
else:
|
||||
# if not a path and not a header -- ignore
|
||||
continue
|
||||
if current_block:
|
||||
blocks.append(current_block)
|
||||
return blocks
|
||||
|
||||
|
||||
def ensure_local_trash() -> str:
|
||||
os.makedirs(LOCAL_TRASH_DIR, exist_ok=True)
|
||||
return os.path.abspath(LOCAL_TRASH_DIR)
|
||||
|
||||
|
||||
def move_to_trash(fp: str) -> None:
|
||||
if DRY_RUN:
|
||||
logger.info("[DRY RUN] Would move to trash: %s", fp)
|
||||
return
|
||||
|
||||
if SEND2TRASH_AVAILABLE:
|
||||
try:
|
||||
send2trash(fp)
|
||||
logger.info("Moved to system trash: %s", fp)
|
||||
return
|
||||
except Exception as ex:
|
||||
logger.warning("send2trash failed for %s: %s", fp, ex)
|
||||
# fallback to local trash
|
||||
|
||||
# fallback: copy file to local recycle bin and remove original
|
||||
try:
|
||||
trash_dir = ensure_local_trash()
|
||||
base = os.path.basename(fp)
|
||||
dst = os.path.join(trash_dir, base)
|
||||
# avoid accidental overwrite
|
||||
if os.path.exists(dst):
|
||||
# append a unique suffix
|
||||
import uuid
|
||||
dst = os.path.join(trash_dir, f"{uuid.uuid4().hex[:8]}-{base}")
|
||||
shutil.move(fp, dst)
|
||||
logger.info("Moved to local trash (%s): %s", dst, fp)
|
||||
except Exception as ex:
|
||||
logger.error("Failed to move to local trash %s: %s", fp, ex)
|
||||
raise
|
||||
|
||||
|
||||
def process_blocks(blocks: List[List[str]], var_dir: str) -> dict:
|
||||
results = {
|
||||
"total_blocks": len(blocks),
|
||||
"candidate_files": 0,
|
||||
"skipped_blocks": 0,
|
||||
"moved_files": 0,
|
||||
"errors": 0,
|
||||
}
|
||||
var_dir_norm = normalize_path(var_dir)
|
||||
|
||||
processed_count = 0
|
||||
log_every_n = 200 # periodic progress logging for large lists
|
||||
for block in blocks:
|
||||
# skip single-file blocks
|
||||
if len(block) <= 1:
|
||||
logger.debug("Skipping single-entry block: %s", block)
|
||||
results["skipped_blocks"] += 1
|
||||
continue
|
||||
|
||||
normalized_paths = [normalize_path(p) for p in block]
|
||||
in_var = [p for p in normalized_paths if p.startswith(var_dir_norm)]
|
||||
out_var = [p for p in normalized_paths if not p.startswith(var_dir_norm)]
|
||||
|
||||
logger.debug("Block has %d entries (%d in var_dir, %d out): %s", len(block), len(in_var), len(out_var), block)
|
||||
|
||||
# if there are no files outside var_dir we avoid deleting everything
|
||||
if not out_var and in_var:
|
||||
logger.warning("Skipping block because all copies are in VAR_DIRECTORY (not deleting all): %s", block)
|
||||
results["skipped_blocks"] += 1
|
||||
continue
|
||||
|
||||
# otherwise, any files inside VAR_DIRECTORY are safe to remove
|
||||
for p in in_var:
|
||||
results["candidate_files"] += 1
|
||||
try:
|
||||
if not os.path.exists(p):
|
||||
logger.warning("File not found: %s (skipping)", p)
|
||||
results["errors"] += 1
|
||||
continue
|
||||
move_to_trash(p)
|
||||
# update counters
|
||||
if DRY_RUN:
|
||||
# the file would be moved in a real run
|
||||
results.setdefault("would_move_files", 0)
|
||||
results["would_move_files"] += 1
|
||||
else:
|
||||
results["moved_files"] += 1
|
||||
processed_count += 1
|
||||
# periodic progress logging
|
||||
if processed_count % log_every_n == 0:
|
||||
if DRY_RUN:
|
||||
logger.info("Dry-run progress: %d files would be moved so far...", processed_count)
|
||||
else:
|
||||
logger.info("Progress: %d files moved so far...", processed_count)
|
||||
except Exception as ex:
|
||||
logger.error("Error moving to trash: %s -> %s", p, ex)
|
||||
results["errors"] += 1
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def main() -> int:
|
||||
logger.setLevel(LOG_LEVEL)
|
||||
logger.info("Starting clean duplicates script")
|
||||
logger.debug("Configuration: DUPLICATES_FILE=%s VAR_DIRECTORY=%s DRY_RUN=%s LOG_LEVEL=%s", DUPLICATES_FILE, VAR_DIRECTORY, DRY_RUN, LOG_LEVEL)
|
||||
|
||||
if not os.path.exists(DUPLICATES_FILE):
|
||||
logger.error("File '%s' not found in working dir %s", DUPLICATES_FILE, os.getcwd())
|
||||
return 2
|
||||
|
||||
try:
|
||||
blocks = parse_duplicates_file(DUPLICATES_FILE)
|
||||
except Exception as ex:
|
||||
logger.error("Failed to parse duplicates file: %s", ex)
|
||||
return 3
|
||||
|
||||
logger.info("Parsed %d duplicate groups", len(blocks))
|
||||
results = process_blocks(blocks, VAR_DIRECTORY)
|
||||
|
||||
logger.info("Done. Results: total_blocks=%d candidate_files=%d moved_files=%d would_move_files=%d skipped_blocks=%d errors=%d", results["total_blocks"], results["candidate_files"], results.get("moved_files", 0), results.get("would_move_files", 0), results["skipped_blocks"], results["errors"])
|
||||
# if any local trash used and not dry run, show message
|
||||
if not SEND2TRASH_AVAILABLE and not DRY_RUN:
|
||||
logger.warning("send2trash not available; files were moved to %s", os.path.abspath(LOCAL_TRASH_DIR))
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
rc = main()
|
||||
sys.exit(rc)
|
||||
|
|
@ -0,0 +1,77 @@
|
|||
Clean duplicate files from the `resultsduplicate.txt` report.
|
||||
|
||||
Use Duplicate file finder to generate the file
|
||||
|
||||
This repository contains a small script `main.py` that parses a report file
|
||||
(`resultsduplicate.txt`) produced by a specialized duplicate detection tool.
|
||||
It moves files that are duplicates and located in the configured `VAR_DIRECTORY`
|
||||
to the system trash (or a local recycle bin fallback) instead of deleting them
|
||||
permanently.
|
||||
|
||||
File format (example block in `resultsduplicate.txt`):
|
||||
|
||||
- 2 equal files of size 5256842
|
||||
"I:\01_AI\01_IMAGES\00_Input\old\2k-ComfyUI-faceD_00050_.png"
|
||||
"I:\01_AI\01_IMAGES\55_Img2Img\20240206-A-selected\2k-ComfyUI-faceD_00050_.png"
|
||||
|
||||
Notes and behavior:
|
||||
- The first line of a block indicates the number of identical files (2, 3, ...)
|
||||
- The report contains groups of paths (quoted) listed under each header
|
||||
- If any path in a group points to `VAR_DIRECTORY` (usually the `old` folder),
|
||||
those files are candidates to be removed
|
||||
- The script will move any file(s) inside `VAR_DIRECTORY` to the trash and leave
|
||||
other copies intact
|
||||
- As a safety measure, if *all* copies in a group are inside `VAR_DIRECTORY`,
|
||||
the script will *skip* that block (to avoid deleting the only remaining
|
||||
copies) and log a warning
|
||||
|
||||
Configuration (in `main.py`):
|
||||
- `DUPLICATES_FILE` - path to the duplicates report (default: `resultsduplicate.txt`)
|
||||
- `VAR_DIRECTORY` - the directory whose files should be removed (e.g., the
|
||||
`old` folder)
|
||||
- `DRY_RUN` - if `True`, the script will only log actions and not move files
|
||||
- `LOG_LEVEL` - logging level (`logging.INFO`, `logging.DEBUG`, etc.)
|
||||
|
||||
Installation (recommended):
|
||||
```pwsh
|
||||
python -m pip install --user send2trash
|
||||
```
|
||||
|
||||
`send2trash` is recommended to move files to the system recycle bin safely. If
|
||||
`send2trash` is not installed, the script falls back to moving files to a local
|
||||
`.recycle_bin` directory in the project.
|
||||
|
||||
Usage:
|
||||
```pwsh
|
||||
# dry-run first (only log what would be removed)
|
||||
python main.py
|
||||
|
||||
# to actually remove files, edit main.py and set `DRY_RUN = False`
|
||||
python main.py
|
||||
```
|
||||
|
||||
- The script logs events at levels DEBUG/INFO/WARNING/ERROR
|
||||
- Configure `LOG_LEVEL` near the top of `main.py` to change log verbosity
|
||||
|
||||
Counters & progress:
|
||||
- The script logs a running progress message every 200 processed candidate files (change `log_every_n` in `main.py`).
|
||||
- Output summary includes:
|
||||
- `total_blocks` - number of blocks parsed
|
||||
- `candidate_files` - number of files eligible in `VAR_DIRECTORY`
|
||||
- `would_move_files` - number of files that *would* be moved when `DRY_RUN=True`
|
||||
- `moved_files` - number of files actually moved when `DRY_RUN=False`
|
||||
- `skipped_blocks` - blocks that were skipped for safety (no copies outside `VAR_DIRECTORY`)
|
||||
- `errors` - operation errors during the run
|
||||
|
||||
- The script logs events at levels DEBUG/INFO/WARNING/ERROR
|
||||
- Configure `LOG_LEVEL` near the top of `main.py` to change log verbosity
|
||||
|
||||
Safety tips:
|
||||
- Run with `DRY_RUN=True` and inspect the logs before making changes
|
||||
- Make sure `DUPLICATES_FILE` points to a fresh report and that backups exist
|
||||
for important data
|
||||
|
||||
If you want the script to behave differently (e.g., delete only single files
|
||||
or keep one copy per project), ask and I can implement more advanced options.
|
||||
|
||||
|
||||
|
|
@ -0,0 +1 @@
|
|||
send2trash>=1.8.0
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue