pip
从 PyPI 安装 fastdup:
pip install fastdup
更多安装选项见 这里。
初始化并运行 fastdup:
import fastdup fd = fastdup.create(input_dir="IMAGE_FOLDER/") fd.run()
在交互式网页界面中探索结果:
fd.explore()
或者,在静态画廊中可视化结果:
fd.vis.duplicates_gallery() # 重复图像画廊 fd.vis.outliers_gallery() # 异常值画廊 fd.vis.component_gallery() # 连接组件画廊 fd.vis.stats_gallery() # 图像统计画廊(如模糊度、亮度等) fd.vis.similarity_gallery() # 相似图像画廊
fastdup 处理带标签/不带标签的图像或视频格式的数据集,提供了一系列功能:
<div align="center" style="display:flex;flex-direction:column;"> <a href="https://www.visual-layer.com" target="_blank" rel="noopener noreferrer"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/01b4ad7d-e11c-4b71-baf1-51f94dd6f051.png" alt="fastdup" width="1000"> </a> </div>fastdup 与其他类似工具的不同点:
通过交互式示例学习 fastdup 的基础知识。在 GitHub 或 nbviewer 上查看笔记本。更好的是,在 Google Colab 或 Kaggle 上免费运行它们。
<table> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/quickstart"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/a40cb9fd-7a4a-4e76-9dd2-d7462c3b1836.jpg" width="200"> </a> </td> <td rowspan="4"> <b>⚡ 快速入门:</b> 了解如何安装fastdup,加载数据集并分析可能存在的问题,如重复/近似重复、损坏图像、异常值、暗/亮/模糊图像,以及查看视觉上相似的图像集群。如果你是新手,从这里开始! <br> <br> <b>📌 数据集:</b> <a href="https://www.robots.ox.ac.uk/~vgg/data/pets/">Oxford-IIIT Pet</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/quickstart.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/finding-removing-duplicates"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/51662696-92d2-4b65-890f-aaf6bbb08964.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🧹 查找和删除重复项:</b> 了解如何分析图像数据集中的重复和近似重复项。 <br> <br> <b>📌 数据集:</b> <a href="https://www.robots.ox.ac.uk/~vgg/data/pets/">Oxford-IIIT Pet</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/finding-removing-mislabels"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4f44a1fa-f4d5-4a89-a7a5-0e1e799a794d.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🖼 查找和删除错误标签:</b> 了解如何分析图像数据集中的潜在错误标签,并导出错误标记图像列表以便进一步检查。 <br> <br> <b>📌 数据集:</b> <a href="https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/">Food-101</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/image-search"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/96a64d51-fc60-47ff-9282-258137e35ac6.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🎁 图像相似度搜索:</b> 在大规模图像数据集中进行图像搜索。 <br> <br> <b>📌 数据集:</b> <a href="https://www.kaggle.com/competitions/shopee-product-matching/data">Shopee 产品匹配</a>. </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/image-search.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/hugging-face-datasets"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/982a846f-ecaa-45c4-b294-b323a9010a62.jpg" width="200" /> </a> </td> <td rowspan="4"><b>🤗 Hugging Face 数据集:</b> 加载并分析来自<a href="https://huggingface.co/datasets">Hugging Face Datasets</a>的数据集。如果你已经在 Hugging Face hub 上托管了数据集,这非常适合。 </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30" /> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25" /> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20" /> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25" /> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/embeddings-timm"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/708d4af6-3501-4cc3-a5c3-af693eae1267.jpg" width="200"> </a> </td> <td rowspan="4"> <b> 🧠 TIMM 嵌入:</b> 使用<a href="https://github.com/huggingface/pytorch-image-models">TIMM (PyTorch 图像模型)</a>计算数据集嵌入并运行fastdup以曝光数据集问题。支持CPU和GPU。 </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </SOURCE_TEXT> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> <tr> <td rowspan="4" width="160"> <a href="https://visual-layer.readme.io/docs/getting-started"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/145c3591-f59b-4ab0-8548-04a718cf2e01.jpg" width="200"> </a> </td> <td rowspan="4"> <b>🦖 ONNX Embeddings:</b> 自带 ONNX 模型。在这个例子中,我们使用<a href="https://github.com/facebookresearch/dinov2">DINOv2</a>模型来提取图像的特征向量。可以在 CPU 上运行。 </td> <td align="center" width="80"> <a href="https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/394a513c-9932-4611-9167-727b052ce788.png" height="30"> </a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d483812-0dcb-40ab-8069-2abe698e0985.png" height="25"> </a> </td> </tr> <tr> <td align="center"> <a href="https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c333c690-d0db-471d-ab07-949a6bba1b4c.png" height="20"> </a> </td> </tr> <tr> <td align="center"> <a href="https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-onnx-dinov2.ipynb"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/b2f574bf-e0c7-4e81-a13e-b9340ebdbef6.png" height="25"> </a> </td> </tr> <!-- ------------------------------------------------------------------- --> </table> 查看更多[示例](EXAMPLES.md)。通过以下渠道获得 fastdup 团队或社区成员的帮助:
<a href="https://discord.gg/tkYHJCA7mb" target="_blank" rel="noopener noreferrer"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/f8fc2323-c7fc-4f89-95f6-dfb9e3910555.png" alt="徽标"> </a> <a href="https://visual-layer.readme.io/discuss" target="_blank" rel="noopener noreferrer"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/eb498b98-fb68-44f0-8ea0-49bfdbd9b080.png" alt="徽标"> </a> <a href="https://github.com/visual-layer/fastdup/issues/new/choose" target="_blank" rel="noopener noreferrer"> <img src="https://yellow-cdn.veclightyear.com/35dd4d3f/ddd1ddc2-a50c-4cce-a132-ffc084595895.png" alt="GitHub 问题"> </a>社区贡献的 fastdup 博客文章:
<table> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/c6620346-36ed-4285-aa77-33219f026d3e.jpg" width="200"></td> <td> <a href="https://medium.com/@atahanbulus.w/deploying-aws-lambda-functions-with-docker-container-by-using-custom-base-image-2d110d307f9b">使用自定义基础映像通过 Docker 容器部署 AWS Lambda 函数</a><br> 🖋️ <a href="https://medium.com/@atahanbulus.w">atahan bulus</a> • 🗓 2023 年 9 月 16 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/63a624a7-521d-45bb-bb76-b00342d79d72.jpg" width="200"></td> <td> <a href="https://medium.com/@daniel-klitzke/cleaning-image-classification-datasets-with-fastdup-and-renumics-spotlight-e68deb4730a3">使用 fastdup 和 Renumics Spotlight 清理图像分类数据集</a><br> 🖋️ <a href="https://medium.com/@daniel-klitzke">Daniel Klitzke</a> • 🗓 2023 年 9 月 4 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/bda2d6cd-3e13-4f8d-9766-e53e1122d503.jpg" width="200"></td> <td> <a href="https://blog.roboflow.com/how-to-reduce-dataset-size-computer-vision/">Roboflow: 如何在不降低准确度的情况下减少数据集大小</a><br> 🖋️ <a href="https://blog.roboflow.com/author/arty/">Arty Ariuntuya</a> • 🗓 2023 年 8 月 9 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/4d286a92-542a-4d65-87bc-f768782c2d55.jpg" width="200"></td> <td> <a href="https://alexlanseedoo.medium.com/the-weighty-significance-of-data-cleanliness-eb03dce1d0f8">数据清洁的重要性不可低估 — 或者正如我喜欢称呼的那样,“清洁即靠近模型清洁”</a><br> 🖋️ <a href="https://alexlanseedoo.medium.com/">Alexander Lan</a> • 🗓 2023 年 3 月 9 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/9b2ff42b-659e-42bd-aa41-873d837cf2a4.gif" width="200"></td> <td> <a href="https://dicksonneoh.com/blog/clean_up_your_digital_life/">清理你的数字生活:我在几分钟内找到 1929 张完全相同的图片、黑暗、明亮和模糊的照片,免费。</a><br> 🖋️ <a href="https://medium.com/@dickson.neoh">Dickson Neoh</a> • 🗓 2023 年 2 月 23 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/ac8c77e3-f9f3-4000-b484-17308cd0730c.gif" width="200"></td> <td> <a href="https://dicksonneoh.com/portfolio/fastdup_manage_clean_curate/">fastdup: 一款强大的工具,可以在 CPU 上大规模管理、清理和策划视觉数据 - 免费。</a><br> 🖋️ <a href="https://medium.com/@dickson.neoh">Dickson Neoh</a> • 🗓 2023 年 1 月 3 日 </td> </tr> <tr> <td><img src="https://yellow-cdn.veclightyear.com/35dd4d3f/eac11c11-4f61-475f-ab94-2091fe7e62ec.jpg" width="200"></td> <td> <a href="https://towardsdatascience.com/master-data-integrity-to-clean-your-computer-vision-datasets-df432cf9e596">掌握数据完整性以清理你的计算机视觉数据集。</a><br> 🖋️ <a href="https://pauliusztin.medium.com/">Paul lusztin</a> • 🗓 2022 年 12 月 19 日 </td> </tr> </table>用户反馈:
Visual Layer 提供大规模管理、清理和策划视觉数据的商业服务。
免费注册。
https://github.com/visual-layer/fastdup/assets/6821286/57f13d77-0ac4-4c74-8031-07fae87c5b00
不确定?无需注册即可与 Visual Layer Cloud 公共数据集 互动。
我们新增了使用 Sentry 收集的实验性崩溃报告。
我们不会收集用户特定的信息如文件夹名称、用户名、图片名称、图片内容等。 我们收集与 fastdup 内部操作和性能统计相关的数据,如图像总数、每张图像的平均运行时间、总剩余内存、总剩余磁盘空间、核心数量等。
这有助于我们识别和解决稳定性问题,从而提高 fastdup 的整体可靠性。 数据收集的代码在此 here。在 MAC 上我们使用 Google crashpad 报告崩溃。
用户可以通过以下方法选择退出实验性崩溃报告系统:
SENTRY_OPT_OUT
的环境变量run()
时使用 turi_param='run_sentry=0'
fastdup 授权协议为 创作共用署名-非商业性使用-禁止演绎 4.0 国际 公共许可证。
如需进一步信息或有关许可证的询问,请联系 info@visual-layer.com 或参见 LICENSE 文件。
<div align="right"><a href="#top">🔝 返回顶部</a></div>