SQuAD数据集的结构和代码

本文最后更新于 2024年7月10日下午4点38分

一文理解SQuAD数据集的结构

SQuAD1.1版本:

SQuAD数据集全称Stanford Question Answering Dataset,是一个阅读理解数据集,是工作者在维基百科文章上提出的问题,每个问题的答案都是相应文章中的一段文本.

SQuAD1.1版本的论文: https://arxiv.org/pdf/1606.05250

SQuAD1.1的huggingface: https://huggingface.co/datasets/rajpurkar/squad

SQuAD2.0版本:

SQuAD2.0组合了SQuAD1.1中的10万个问题,并增加了超过5万个无法回答的问题,这些问题由众包工作者以对抗(adversarially)的方式设计,看起来与可回答的问题相似。
为了在SQuAD2.0数据集上表现出色。系统不仅必须在可能的情况下回答问题,还必须确定篇章数据何时不支持回答,并避免回答。

SQuAD2.0版本的论文: https://arxiv.org/abs/1806.03822

代码实现

下面通过代码分析SQuAD1.1的结构.

SQuAD数据集分为训练集和验证集, 训练集包含87599条数据, 验证集包含10570条数据.
每条数据分为[‘id’, ‘title’, ‘context’, ‘question’, ‘answers’]五个部分, 其中id是该条数据的id编号, title是标题, context是一段文本, question是针对这段文本的问题, answer是字典类型,分为text和‘answer-start’两部分, ‘text’里面是回答, answer-start里面是这个回答开始的位置.

具体的,我们看下面的代码:
首先下载数据, 查看datasets的结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
from datasets import load_dataset
datasets = load_dataset('squad')
print(datasets)
# DatasetDict({
# train: Dataset({
# features: ['id', 'title', 'context', 'question', 'answers'],
# num_rows: 87599
# })
# validation: Dataset({
# features: ['id', 'title', 'context', 'question', 'answers'],
# num_rows: 10570
# })
# })

打印出训练集中第一条数据的结构:

1
2
3
4
5
6
7
8
9
10
print(datasets['train'][0])

# {'id': '5733be284776f41900661182',
# 'title': 'University_of_Notre_Dame',
# 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
# 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
# 'answers':
# {'text': ['Saint Bernadette Soubirous'],
# 'answer_start': [515]}
# }

Answer回答可能包含多个回答, 输出验证集中包含多个回答的数据:

1
2
3
for index, dct in enumerate(datasets['validation']):
if len(dct['answers']['text'])>1:
print(dct['answers'])

SQuAD实现GPT2的问答微调

未完成

参考:

Bert实战4——问答任务-抽取式问答

LLM Fine-Tuning Workshop: Improve Question-Answering Skills


SQuAD数据集的结构和代码
https://kangkang37.github.io/2024/07/10/dataset-squad/
作者
kangkang
发布于
2024年7月10日
许可协议