Evaluation-of-ChatGPT-on-Information-Extraction

ChatGPT在信息抽取任务上的评估

对ChatGPT在信息抽取任务上的评估，包括命名实体识别（NER）、关系抽取（RE）、事件抽取（EE）和基于方面的情感分析（ABSA）。

论文：ChatGPT是否解决了信息抽取问题？对性能、评估标准、鲁棒性和错误的分析

摘要

ChatGPT激发了大型语言模型领域的研究热潮。在本文中，我们从性能、评估标准、鲁棒性和错误类型四个方面评估ChatGPT的能力。具体而言，我们首先在零样本、少样本和思维链场景下，对17个数据集的14个IE子任务评估ChatGPT的性能，发现ChatGPT与最先进结果之间存在巨大的性能差距。接着，我们重新思考这一差距，并提出一种软匹配评估策略，以更准确地反映ChatGPT的性能。然后，我们分析了ChatGPT在14个IE子任务上的鲁棒性，发现：1）ChatGPT很少输出无效响应；2）无关上下文和长尾目标类型极大地影响ChatGPT的性能；3）ChatGPT无法很好地理解RE任务中的主客体关系。最后，我们分析了ChatGPT的错误，发现"未标注的跨度"是最主要的错误类型。这引发了对标注数据质量的担忧，同时也表明了使用ChatGPT标注数据的可能性。数据和代码已在Github上发布。

数据集、处理后的数据、输出结果文件

除ACE04、ACE05和TACRED原始数据集（出于版权原因）外，所有数据集、处理后的数据和输出结果文件均可在Google Drive上获取。

下载所有文件，解压缩，并将它们放置在相应的目录中。

使用API进行测试

bash ./scripts/absa/eval.sh
bash ./scripts/ner/eval.sh
bash ./scripts/re/eval_rc.sh
bash ./scripts/re/eval_triplet.sh
bash ./scripts/ee/eval_trigger.sh
bash ./scripts/ee/eval_argument.sh
bash ./scripts/ee/eval_joint.sh

测试前，您需要修改所有*.sh脚本中的--api_key和--result_file参数。

获取评估指标

bash ./scripts/absa/report.sh
bash ./scripts/ner/report.sh
bash ./scripts/re/report_rc.sh
bash ./scripts/re/report_triplet.sh
bash ./scripts/ee/report_trigger.sh
bash ./scripts/ee/report_argument.sh
bash ./scripts/ee/report_joint.sh

默认情况下，指标是基于我们在Google Drive上的输出结果文件计算的。

主要结果

提示示例

零样本少样本ICL 少样本COT

未来工作

我们将添加GPT-4的结果和分析。

引用

@article{han2023-chatgpt-IE-evaluation,
  author       = {Ridong Han and
                  Tao Peng and
                  Chaohao Yang and
                  Benyou Wang and
                  Lu Liu and
                  Xiang Wan},
  title        = {Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors},
  journal      = {CoRR},
  volume       = {abs/2305.14450},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2305.14450},
  url          = {https://doi.org/10.48550/arXiv.2305.14450},
  doi          = {10.48550/ARXIV.2305.14450},
}