IEPile

IEPile

双语大规模信息抽取数据集构建及模型优化

IEPile是一个包含0.32B tokens的双语信息抽取指令数据集,整合了26个英文和7个中文信息抽取数据集。采用基于模式的分批指令生成策略,IEPile支持多种信息抽取任务。研究者利用IEPile对Baichuan2-13B-Chat和LLaMA2-13B-Chat模型进行微调,在全监督和零样本信息抽取任务中均实现了显著性能提升。项目提供了详细的数据格式说明和模型训练指南。

IEPile信息抽取大规模数据集指令微调双语Github开源项目
<p align="left"> <b> English | <a href="https://github.com/zjunlp/IEPile/blob/main/README_CN.md">Chinese</a> </b> </p>

IEPile: A Large-Scale Information Extraction Corpus

This is the official repository for IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Datasets | Paper | Usage | Limitations | Statement & License | Citation

Please note that our IEPile may undergo updates (we will inform you upon their release). It is recommended to utilize the most current version.

News

  • [2024/05] The paper IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus is accepted by ACL 2024 main conference.
  • [2024/04] We release a new bilingual (Chinese and English) schema-based information extraction model called OneKE based on Chinese-Alpaca-2-13B.
  • [2024/02] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction dataset named IEPile, along with two models trained with IEPile, baichuan2-13b-iepile-lora and llama2-13b-iepile-lora.
  • [2023/10] We released a new bilingual (Chinese and English) theme-based Information Extraction (IE) instruction dataset named InstructIE with paper.
  • [2023/08] We introduced a dedicated 13B model for Information Extraction (IE), named knowlm-13b-ie.
  • [2023/05] We initiated an instruction-based Information Extraction project.

1.Introduction

IEPile dataset download links: Google Drive | Hugging Face | WiseModel | ModelScpoe

Please be aware that the data contained in the dataset links provided above has already excluded any part related to the ACE2005 dataset. Should you require access to the unfiltered, complete dataset and have successfully obtained the necessary permissions, please do not hesitate to contact us via email at guihonghao@zju.edu.cn or zhangningyu@zju.edu.cn. We will provide the complete dataset resources for your use.

Model download links for LLaMA2-IEPile | Baichuan2-IEPile | OneKE: zjunlp/llama2-13b-iepile-lora | zjunlp/baichuan2-13b-iepile-lora | zjunlp/OneKE

statistic

We have collected and cleaned existing Information Extraction (IE) datasets, integrating a total of 26 English IE datasets and 7 Chinese IE datasets. As shown in the Figure, these datasets cover multiple domains including general, medical, financial, and others.

In this study, we adopted the proposed "schema-based batched instruction generation strategy" to create a large-scale, high-quality, bilingual (Chinese and English) IE instruction tuning dataset named IEPile, containing approximately 0.32B tokens.

Based on IEPile, we fine-tuned the Baichuan2-13B-Chat and LLaMA2-13B-Chat models using the Lora technique. Experiments have demonstrated that the fine-tuned Baichuan2-IEPile and LLaMA2-IEPile models perform remarkably on fully supervised training sets and have achieved improvements in zero-shot information extraction tasks.

zero_en

zero_zh

<details> <summary><b>Supervision Results</b></summary>

supervision_ner

supervision_re

supervision_ee

</details>

2.Data

2.1Construction of IEPile

We concentrate on instruction-based IE, thus the construction of schema within the instructions is crucial. This is because they reflect the specific extraction requirements and are dynamically variable. Previous approaches with existing IE datasets often employ a rather extensive schema processing strategy when constructing instructions, utilizing all schemas within a label set for instruction building, raising two potential issues:

  1. Inconsistency in the number of schema queries within instruction between training and evaluation. For example, the model's performance will decrease if it is trained on about 20 schema queries but tested with either 10 or 30, even if the training and evaluation schemas are similar in content.
  2. Inadequate differentiation among schemas in the instructions. For example, semantically similar schemas like "layoffs", "depart" and "dismissals", may present co-occurrence ambiguities that could confuse the LLMs. Such schemas should co-occur more frequently within the instruction.

Therefore, we introduce the following solutions: 1)Hard Negative Schema; and 2) Batched Instruction Generation.

iepile

<details> <summary><b>Hard Negative Schema</b></summary>

Assuming that dataset $\mathcal{D}$ possesses a full label set $L$. For a given text $S$, the schemas present in its annotation constitute the positive schema set $Pos_L$, while others form the negative schema set $Neg_L$. In our analysis, we discover that the primary cause of model misjudgment stems from the semantic ambiguity of the schema. In traditional approaches, the $Neg_L$ is simply defined as $L - Pos_L$. However, they overlook a critical aspect: it is important to pay special attention to negative schemas that are semantically close to positive schemas. Inspired by the theory of contrastive learning, we construct a hard negative schema dictionary $\mathcal{K}$, where each key represents a unique schema and the associated value is a collection of schemas that are semantically similar to the key schema. Based on this, we define the hard negative schema set as $Hard_L = \mathcal{K}[Pos_L]$, and the other negative schema set as $Other_L = L - Pos_L - Hard_L$. The final $Neg_L$ is constituted by $Hard_L$ and a small subset of $Other_L$. Through this strategy, we not only present semantically similar schemas more frequently within the instruction but also reduce the number of training instances without sacrificing model performance.

</details> <details> <summary><b>Batched Instruction Generation</b></summary>

Subsequently, we obtain the final schema set $L' = Pos_L + Neg_L$. We employ a batched instruction generation method, limiting the number of schemas inquired in each instruction to the number of $split_num$, which ranges between 4 to 6. Therefore, $L'$ will be divided into $|L'|/split_num$ batches for querying, with each batch querying $split_num$ schemas. Consequently, even if the number of schemas inquired during the evaluation phase differs from that of training, the batched mechanism allows us to distribute the inquiries across $split_num$ schemas, thereby mitigating the decline in generalization performance.

</details>

2.2Data Format of IEPile

Each instance in IEPile contains four fields: task, source, instruction, and output.

Below is a data example:

{ "task": "NER", "source": "CoNLL2003", "instruction": "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}", "output": "{\"person\": [\"Robert Allenby\", \"Allenby\", \"Miguel Angel Martin\"], \"organization\": [], \"else\": [], \"location\": [\"Australia\", \"Spain\"]}" }

The data instance belongs to the NER task, is part of the CoNLL2003 dataset, the schema list to be extracted includes ["person", "organization", "else", "location"], and the text to be extracted from is "284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )". The output is {"person": ["Robert Allenby", "Allenby", "Miguel Angel Martin"], "organization": [], "else": [], "location": ["Australia", "Spain"]}.

Note that the order of schemas in the output is consistent with the order in the instruction.

<details> <summary><b>More Tasks Instance</b></summary>
{ "task": "EE", "source": "PHEE", "instruction": "{\"instruction\": \"You are an expert in event extraction. Please extract events from the input that conform to the schema definition. Return an empty list for events that do not exist, and return NAN for arguments that do not exist. If an argument has multiple values, please return a list. Respond in the format of a JSON string.\", \"schema\": [{\"event_type\": \"potential therapeutic event\", \"trigger\": true, \"arguments\": [\"Treatment.Time_elapsed\", \"Treatment.Route\", \"Treatment.Freq\", \"Treatment\", \"Subject.Race\", \"Treatment.Disorder\", \"Effect\", \"Subject.Age\", \"Combination.Drug\", \"Treatment.Duration\", \"Subject.Population\", \"Subject.Disorder\", \"Treatment.Dosage\", \"Treatment.Drug\"]}, {\"event_type\": \"adverse event\", \"trigger\": true, \"arguments\": [\"Subject.Population\", \"Subject.Age\", \"Effect\", \"Treatment.Drug\", \"Treatment.Dosage\", \"Treatment.Freq\", \"Subject.Gender\", \"Treatment.Disorder\", \"Subject\", \"Treatment\", \"Treatment.Time_elapsed\", \"Treatment.Duration\", \"Subject.Disorder\", \"Subject.Race\", \"Combination.Drug\"]}], \"input\": \"Our findings reveal that even in patients without a history of seizures, pregabalin can cause a cortical negative myoclonus.\"}", "output": "{\"potential therapeutic event\": [], \"adverse event\": [{\"trigger\": \"cause \", \"arguments\": {\"Subject.Population\": \"NAN\", \"Subject.Age\": \"NAN\", \"Effect\": \"cortical negative myoclonus\", \"Treatment.Drug\": \"pregabalin\", \"Treatment.Dosage\": \"NAN\", \"Treatment.Freq\": \"NAN\", \"Subject.Gender\": \"NAN\", \"Treatment.Disorder\": \"NAN\", \"Subject\": \"patients without a history of seizures\", \"Treatment\": \"pregabalin\", \"Treatment.Time_elapsed\": \"NAN\", \"Treatment.Duration\": \"NAN\", \"Subject.Disorder\": \"NAN\", \"Subject.Race\": \"NAN\", \"Combination.Drug\": \"NAN\"}}]}" } { "task": "RE", "source": "NYT11", "instruction": "{\"instruction\": \"You are an expert in relationship extraction. Please extract relationship triples that match the schema definition from the input. Return an empty list for relationships that do not exist. Please respond in the format of a JSON string.\", \"schema\": [\"neighborhood of\", \"nationality\", \"children\", \"place of death\"], \"input\": \" In the way New Jersey students know that Thomas Edison 's laboratory is in West Orange , the people of Colma know that Wyatt Earp 's ashes are buried at Hills of Eternity , a Jewish cemetery he was n't ; his wife was , and that Joe DiMaggio is at Holy Cross Cemetery , where visitors often lean bats against his gravestone . \"}", "output": "{\"neighborhood of\": [], \"nationality\": [], \"children\": [], \"place of death\": [{\"subject\": \"Thomas Edison\", \"object\": \"West Orange\"}]}" }
</details>

Below are the explanations for each field:

FieldDescription
taskThe task to which the instance belongs, one of the five types (NER, RE, EE, EET, EEA).
sourceThe dataset to which the instance belongs.
instructionThe instruction for inputting into the model, processed into a JSON string via json.dumps, including three parts: "instruction", "schema", and "input".
outputThe output in the format of a dictionary's JSON string, where the key is the schema, and the value is the extracted content.

In IEPile, the instruction format of IEPile adopts a JSON-like string structure, which is essentially a dictionary-type string composed of the following three main components: (1) 'instruction': Task description, which outlines the task to be performed by the instruction (one of NER, RE, EE, EET, EEA). (2) 'schema': A list of schemas to be extracted (entity types, relation types, event types). (3) 'input': The text from which information is to be extracted.

The file instruction.py provides instructions for various tasks.

3.Using IEPile to Train Models

3.1Environment

Before you begin, make sure to create an appropriate virtual environment following the instructions below:

conda create -n IEPile python=3.9 # Create a virtual environment conda activate IEPile # Activate the environment pip install -r requirements.txt # Install dependencies

3.2Download Data and Models

IEPile dataset download links: Google Drive | Hugging Face

IEPile ├── train.json # Training set └── dev.json # Validation set

Here are some of the models supported by

编辑推荐精选

讯飞智文

讯飞智文

一键生成PPT和Word,让学习生活更轻松

讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。

AI办公办公工具AI工具讯飞智文AI在线生成PPTAI撰写助手多语种文档生成AI自动配图热门
讯飞星火

讯飞星火

深度推理能力全新升级,全面对标OpenAI o1

科大讯飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。

热门AI开发模型训练AI工具讯飞星火大模型智能问答内容创作多语种支持智慧生活
Spark-TTS

Spark-TTS

一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型

Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。

Trae

Trae

字节跳动发布的AI编程神器IDE

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
咔片PPT

咔片PPT

AI助力,做PPT更简单!

咔片是一款轻量化在线演示设计工具,借助 AI 技术,实现从内容生成到智能设计的一站式 PPT 制作服务。支持多种文档格式导入生成 PPT,提供海量模板、智能美化、素材替换等功能,适用于销售、教师、学生等各类人群,能高效制作出高品质 PPT,满足不同场景演示需求。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

热门AI辅助写作AI工具讯飞绘文内容运营AI创作个性化文章多平台分发AI助手
材料星

材料星

专业的AI公文写作平台,公文写作神器

AI 材料星,专业的 AI 公文写作辅助平台,为体制内工作人员提供高效的公文写作解决方案。拥有海量公文文库、9 大核心 AI 功能,支持 30 + 文稿类型生成,助力快速完成领导讲话、工作总结、述职报告等材料,提升办公效率,是体制打工人的得力写作神器。

openai-agents-python

openai-agents-python

OpenAI Agents SDK,助力开发者便捷使用 OpenAI 相关功能。

openai-agents-python 是 OpenAI 推出的一款强大 Python SDK,它为开发者提供了与 OpenAI 模型交互的高效工具,支持工具调用、结果处理、追踪等功能,涵盖多种应用场景,如研究助手、财务研究等,能显著提升开发效率,让开发者更轻松地利用 OpenAI 的技术优势。

Hunyuan3D-2

Hunyuan3D-2

高分辨率纹理 3D 资产生成

Hunyuan3D-2 是腾讯开发的用于 3D 资产生成的强大工具,支持从文本描述、单张图片或多视角图片生成 3D 模型,具备快速形状生成能力,可生成带纹理的高质量 3D 模型,适用于多个领域,为 3D 创作提供了高效解决方案。

3FS

3FS

一个具备存储、管理和客户端操作等多种功能的分布式文件系统相关项目。

3FS 是一个功能强大的分布式文件系统项目,涵盖了存储引擎、元数据管理、客户端工具等多个模块。它支持多种文件操作,如创建文件和目录、设置布局等,同时具备高效的事件循环、节点选择和协程池管理等特性。适用于需要大规模数据存储和管理的场景,能够提高系统的性能和可靠性,是分布式存储领域的优质解决方案。

下拉加载更多