spotlight

<p align="center"><a href="https://spotlight.renumics.com"><img src="https://yellow-cdn.veclightyear.com/ab5030c0/5d7d5e45-7c6d-4080-af1c-e552cd29c6f9.svg" alt="灰色变形者" height="60"/></a></p> <h1 align="center">Renumics Spotlight</h1> <p align="center">从您的数据框中交互式探索非结构化数据集。</p> <p align="center"> <a href="https://github.com/Renumics/spotlight/blob/main/LICENSE"><img src="https://img.shields.io/github/license/renumics/spotlight" height="20"/></a> <a href="https://pypi.org/project/renumics-spotlight/"><img src="https://img.shields.io/pypi/pyversions/renumics-spotlight" height="20"/></a> <a href="https://pypi.org/project/renumics-spotlight/"><img src="https://img.shields.io/pypi/wheel/renumics-spotlight" height="20"/></a> </p> <h3 align="center"> <a href="https://spotlight.renumics.com"><b>文档</b></a> • <a href="https://renumics.com/docs/data-centric-ai/playbook"><b>操作手册</b></a> • <a href="https://renumics.com/blog/"><b>博客</b></a> • <a href="https://renumics.com/api/spotlight/"><b>API参考</b></a> </h3> <p align="center"><a href="https://spotlight.renumics.com"><img src="https://yellow-cdn.veclightyear.com/ab5030c0/4ea28475-a984-40d1-aa59-c15de685d6e9.gif" width="100%"/></a></p>

Spotlight帮助您快速理解非结构化数据集。您可以快速创建交互式可视化，并利用数据增强（如嵌入、预测、不确定性）来识别数据中的关键聚类。

Spotlight支持大多数非结构化数据类型，包括图像、音频、文本、视频、时间序列和几何数据。您可以从现有的数据框开始：

只需几行代码即可启动Spotlight：

from renumics import spotlight

spotlight.show(df, dtype={"image": spotlight.Image, "embedding": spotlight.Embedding})

🚀 从用例开始

机器学习和工程团队使用Spotlight来理解和交流复杂的非结构化数据问题。以下是一些基于公开数据集的示例，包括代码片段（👨‍💻）、交互式演示（🕹️）和博客文章（📝）：

<table> <thead> <tr> <th>模态</th> <th>任务</th> <th>描述</th> <th>链接</th> </tr> </thead> <tbody> <tr> <td rowspan="3">🖼️ 图像</td> <td rowspan="3">[分类]</td> <td>发现任何图像分类数据集中的问题</td> <td><a href="https://www.renumics.com/next/docs/use-cases/image-classification">👨‍💻</a> <a href="https://medium.com/@daniel-klitzke/finding-problematic-data-slices-in-unstructured-data-aeec0a3b9a2a">📝</a> <a href="https://huggingface.co/spaces/renumics/sliceguard-unstructured-data">🕹️</a></td> </tr> <tr> <td>发现CIFAR-100图像数据集中的数据问题</td> <td><a href="https://huggingface.co/spaces/renumics/navigate-data-issues">🕹️</a></td> </tr> <tr> <td>使用Bing图像搜索微调图像分类模型</td> <td><a href="https://renumics.com/next/docs/use-cases/image-fine-tuning">👨‍💻</a><a href="https://medium.com/@markus.stoll/image-classification-in-2023-8ab7dc552115">📝</a></td> </tr> <tr> <td rowspan="3">🔊 音频</td> <td rowspan="3">[分类]</td> <td>发现任何音频分类数据集中的问题</td> <td><a href="https://www.renumics.com/next/docs/use-cases/audio-classification">👨‍💻</a> <a href="https://medium.com/@daniel-klitzke/finding-problematic-data-slices-in-unstructured-data-aeec0a3b9a2a">📝</a><a href="https://huggingface.co/spaces/renumics/whisper-commonvoice-speaker-issues">🕹️</a></td> </tr> <tr> <td>在emodb数据集上调试预训练的性别检测模型</td> <td><a href="https://medium.com/p/dbfd923a5a79#432e-3559ae606f80">📝</a> <a href="https://huggingface.co/spaces/renumics/emodb-model-debugging">🕹️</a></td> </tr> <tr> <td>在emodb数据集上比较性别检测模型</td> <td><a href="https://medium.com/p/dbfd923a5a79#432e-3559ae606f80">📝</a> <a href="https://huggingface.co/spaces/renumics/emodb-model-comparison">🕹️</a></td> </tr> <tr> <td rowspan="1">📝 文本</td> <td rowspan="1">[分类]</td> <td>发现任何文本分类数据集中的问题</td> <td><a href="https://www.renumics.com/next/docs/use-cases/text-classification">👨‍💻</a> <a href="https://medium.com/@daniel-klitzke/finding-problematic-data-slices-in-unstructured-data-aeec0a3b9a2a">📝</a></td> </tr> <tr> <td rowspan="2">📈🖼️ 混合</td> <td rowspan="2">[探索性数据分析]</td> <td>探索2023年蒙特利尔F1大奖赛结果</td> <td><a href="https://huggingface.co/spaces/renumics/f1_montreal_gp">🕹️</a></td> </tr> <tr> <td>探索碰撞模拟数据集</td> <td><a href="https://huggingface.co/spaces/renumics/crash-simulation-demo">🕹️</a></td> </tr> </tbody> </table>

⏱️ 快速入门

通过安装Spotlight并加载你的第一个数据集来开始使用。

你需要的准备

Python 版本3.8-3.11

通过pip安装Spotlight

pip install renumics-spotlight

我们建议将Spotlight和处理数据所需的所有内容安装在单独的虚拟环境中。

加载数据集并开始探索

import pandas as pd
from renumics import spotlight

df = pd.read_csv("https://renumics.com/data/mnist/mnist-tiny.csv")
spotlight.show(df, dtype={"image": spotlight.Image})

pd.read_csv 将示例csv文件加载为pandas DataFrame。

spotlight.show 在浏览器中打开spotlight，准备好让你探索pandas数据框。dtype 参数为浏览器查看器指定自定义列类型。

加载Hugging Face音频数据集，包含嵌入和预定义布局

import datasets
from renumics import spotlight

ds = datasets.load_dataset('renumics/emodb-enriched', split='all')
layout= spotlight.layouts.debug_classification(label='gender', prediction='m1_gender_prediction', embedding='m1_embedding', features=['age', 'emotion'])
spotlight.show(ds, layout=layout)

在这里，数据类型是从数据集自动发现的，我们使用预定义的布局进行模型调试。可以通过编程或UI构建自定义布局。

可以通过pip安装datasets[audio]包。

使用跟踪

我们已添加崩溃报告和性能收集功能。我们不会收集除了由py-machineid获得的匿名机器ID之外的用户数据，并且只记录我们自己的操作。我们不会收集文件夹名称、数据集名称或任何类型的行数据，仅收集诸如表格加载总时间、崩溃数据等聚合性能统计信息。收集Spotlight崩溃数据将有助于我们提高稳定性。如果要选择不参与崩溃报告收集，请定义一个名为SPOTLIGHT_OPT_OUT的环境变量并将其设置为true。例如：export SPOTLIGHT_OPT_OUT=true