fastdup

<br /> <div align="left"> <a href="https://www.visual-layer.com" target="_blank" rel="noopener noreferrer" name="top"> <picture> <source media="(prefers-color-scheme: dark)" srcset="./gallery/logo_dark_mode.png" width=600> <source media="(prefers-color-scheme: light)" srcset="./gallery/logo.png" width=600> <img alt="fastdup logo." src="https://yellow-cdn.veclightyear.com/35dd4d3f/2c18a8a5-cbeb-4205-813c-b5ea72752b3d.png"> </picture> </a> <br> <br> </div>

<p align="left"> 这是由 <a href="https://github.com/apache/tvm">XGBoost</a>、<a href="https://github.com/apache/tvm">Apache TVM</a> 与 <a href="https://github.com/apple/turicreate">Turi Create</a> 的作者 <a href="https://www.linkedin.com/in/dr-danny-bickson-835b32">Danny Bickson</a>、<a href="https://www.linkedin.com/in/carlos-guestrin-5352a869">Carlos Guestrin</a> 和 <a href="https://www.linkedin.com/in/amiralush">Amir Alush</a> 创立的无监督且免费的图像和视频数据集分析工具。</p> <hr> <a href="https://visual-layer.readme.io/" target="_blank" rel="noopener noreferrer">文档</a> · <a href="#features--advantages" target="_blank" rel="noopener noreferrer">特点</a> · <a href="https://github.com/visual-layer/fastdup/issues/new/choose" target="_blank" rel="noopener noreferrer">报告 Bug</a> · <a href="https://medium.com/visual-layer" target="_blank" rel="noopener noreferrer">博客</a> · <a href="#getting-started" target="_blank" rel="noopener noreferrer">快速开始</a> · <a href="#visual-layer-cloud" target="_blank" rel="noopener noreferrer">Visual Layer 云</a> <hr> </p>

快速开始

pip 从 PyPI 安装 fastdup：

pip install fastdup

更多安装选项见这里。

初始化并运行 fastdup：

import fastdup

fd = fastdup.create(input_dir="IMAGE_FOLDER/")
fd.run()

在交互式网页界面中探索结果：

fd.explore()

或者，在静态画廊中可视化结果：

fd.vis.duplicates_gallery()    # 重复图像画廊
fd.vis.outliers_gallery()      # 异常值画廊
fd.vis.component_gallery()     # 连接组件画廊
fd.vis.stats_gallery()         # 图像统计画廊（如模糊度、亮度等）
fd.vis.similarity_gallery()    # 相似图像画廊

特点与优势

fastdup 处理带标签/不带标签的图像或视频格式的数据集，提供了一系列功能：

fastdup 与其他类似工具的不同点：

质量：高质量分析，识别重复/近似重复图像、异常值、误标图像、损坏图像和低质量图像。
规模：高度可扩展，能在单个CPU机器上处理4亿张图像，扩展至数十亿张图像。
速度：优化的C++引擎，即使在低资源的CPU机器上也能高效运行。
隐私：在本地或云端基础设施运行。您的数据保持原貌。
易用性：支持主要操作系统如MacOS、Linux和Windows，处理带标签或不带标签的图像或视频格式的数据集。

从示例中学习

通过交互式示例学习 fastdup 的基础知识。在 GitHub 或 nbviewer 上查看笔记本。更好的是，在 Google Colab 或 Kaggle 上免费运行它们。

<table> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/quickstart"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/a40cb9fd-7a4a-4e76-9dd2-d7462c3b1836.jpg" width="200"> </a> </td> <td rowspan="4"> <b>⚡ 快速入门：</b> 了解如何安装fastdup，加载数据集并分析可能存在的问题，如重复/近似重复、损坏图像、异常值、暗/亮/模糊图像，以及查看视觉上相似的图像集群。如果你是新手，从这里开始！ <br> <br> <b>📌 数据集：</b> <a href="https://www.robots.ox.ac.uk/~vgg/data/pets/">Oxford-IIIT Pet</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr>  <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/finding-removing-duplicates"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/51662696-92d2-4b65-890f-aaf6bbb08964.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🧹 查找和删除重复项：</b> 了解如何分析图像数据集中的重复和近似重复项。 <br> <br> <b>📌 数据集：</b> <a href="https://www.robots.ox.ac.uk/~vgg/data/pets/">Oxford-IIIT Pet</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr>  <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/finding-removing-mislabels"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4f44a1fa-f4d5-4a89-a7a5-0e1e799a794d.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🖼 查找和删除错误标签：</b> 了解如何分析图像数据集中的潜在错误标签，并导出错误标记图像列表以便进一步检查。 <br> <br> <b>📌 数据集：</b> <a href="https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/">Food-101</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr>  <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/image-search"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/96a64d51-fc60-47ff-9282-258137e35ac6.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🎁 图像相似度搜索：</b> 在大规模图像数据集中进行图像搜索。 <br> <br> <b>📌 数据集：</b> <a href="https://www.kaggle.com/competitions/shopee-product-matching/data">Shopee 产品匹配</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr>  <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/hugging-face-datasets"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/982a846f-ecaa-45c4-b294-b323a9010a62.jpg" width="200" /> </a> </td> <td rowspan="4"><b>🤗 Hugging Face 数据集：</b> 加载并分析来自<a href="https://huggingface.co/datasets">Hugging Face Datasets</a>的数据集。如果你已经在 Hugging Face hub 上托管了数据集，这非常适合。 </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30" /> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25" /> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20" /> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25" /> </a> </td> </tr>  <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/embeddings-timm"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/708d4af6-3501-4cc3-a5c3-af693eae1267.jpg" width="200"> </a> </td> <td rowspan="4"> <b> 🧠 TIMM 嵌入：</b> 使用<a href="https://github.com/huggingface/pytorch-image-models">TIMM (PyTorch 图像模型)</a>计算数据集嵌入并运行fastdup以曝光数据集问题。支持CPU和GPU。 </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </SOURCE_TEXT> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr>  <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/getting-started"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/145c3591-f59b-4ab0-8548-04a718cf2e01.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🦖 ONNX Embeddings:</b> 自带 ONNX 模型。在这个例子中，我们使用<a href="https://github.com/facebookresearch/dinov2">DINOv2</a>模型来提取图像的特征向量。可以在 CPU 上运行。 </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr>  </table> 查看更多[示例](EXAMPLES.md)。

加入社区

通过以下渠道获得 fastdup 团队或社区成员的帮助：

社区贡献的 fastdup 博客文章：

<table> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c6620346-36ed-4285-aa77-33219f026d3e.jpg" width="200"></td> <td> <a href="https://medium.com/@atahanbulus.w/deploying-aws-lambda-functions-with-docker-container-by-using-custom-base-image-2d110d307f9b">使用自定义基础映像通过 Docker 容器部署 AWS Lambda 函数</a><br> 🖋️ <a href="https://medium.com/@atahanbulus.w">atahan bulus</a>    •    🗓 2023 年 9 月 16 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/63a624a7-521d-45bb-bb76-b00342d79d72.jpg" width="200"></td> <td> <a href="https://medium.com/@daniel-klitzke/cleaning-image-classification-datasets-with-fastdup-and-renumics-spotlight-e68deb4730a3">使用 fastdup 和 Renumics Spotlight 清理图像分类数据集</a><br> 🖋️ <a href="https://medium.com/@daniel-klitzke">Daniel Klitzke</a>    •    🗓 2023 年 9 月 4 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/bda2d6cd-3e13-4f8d-9766-e53e1122d503.jpg" width="200"></td> <td> <a href="https://blog.roboflow.com/how-to-reduce-dataset-size-computer-vision/">Roboflow: 如何在不降低准确度的情况下减少数据集大小</a><br> 🖋️ <a href="https://blog.roboflow.com/author/arty/">Arty Ariuntuya</a>    •    🗓 2023 年 8 月 9 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d286a92-542a-4d65-87bc-f768782c2d55.jpg" width="200"></td> <td> <a href="https://alexlanseedoo.medium.com/the-weighty-significance-of-data-cleanliness-eb03dce1d0f8">数据清洁的重要性不可低估 — 或者正如我喜欢称呼的那样，“清洁即靠近模型清洁”</a><br> 🖋️ <a href="https://alexlanseedoo.medium.com/">Alexander Lan</a>    •    🗓 2023 年 3 月 9 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/9b2ff42b-659e-42bd-aa41-873d837cf2a4.gif" width="200"></td> <td> <a href="https://dicksonneoh.com/blog/clean_up_your_digital_life/">清理你的数字生活：我在几分钟内找到 1929 张完全相同的图片、黑暗、明亮和模糊的照片，免费。</a><br> 🖋️ <a href="https://medium.com/@dickson.neoh">Dickson Neoh</a>    •    🗓 2023 年 2 月 23 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/ac8c77e3-f9f3-4000-b484-17308cd0730c.gif" width="200"></td> <td> <a href="https://dicksonneoh.com/portfolio/fastdup_manage_clean_curate/">fastdup: 一款强大的工具，可以在 CPU 上大规模管理、清理和策划视觉数据 - 免费。</a><br> 🖋️ <a href="https://medium.com/@dickson.neoh">Dickson Neoh</a>    •    🗓 2023 年 1 月 3 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/eac11c11-4f61-475f-ab94-2091fe7e62ec.jpg" width="200"></td> <td> <a href="https://towardsdatascience.com/master-data-integrity-to-clean-your-computer-vision-datasets-df432cf9e596">掌握数据完整性以清理你的计算机视觉数据集。</a><br> 🖋️ <a href="https://pauliusztin.medium.com/">Paul lusztin</a>    •    🗓 2022 年 12 月 19 日 </td> </tr> </table>

用户反馈：