详细介绍Google的SentencePiece

本文最后更新于 2024年7月31日凌晨1点32分

Google的SentencePiece分词器

什么是SentencePiece

SentencePiece是Google推出的sub-word开源工具包. 它是一个无监督的文本tokenizer和detokenizer, 主要用于基于神经网络的文本生成系统, 其中词汇量是在模型训练之前预先确定的. SentencePiece实现子词单元(例如byte-pair-encoding(BPE), 和unigram语言模型). SentencePiece让我们能创建一个存粹的端到端系统, 不依赖于特定语言的预处理/后处理.

SentencePiece的优势:

纯数据驱动. SentencePiece直接用句子训练tokenizer和detokenizer, pre-tokenizeation是不需要的.
语言独立的. SentencePiece直接把句子变为Unicode字符序列, 不依赖于语言, 解决了针对不同语言需要不同编码方式的问题.
多种subword算法. 支持BPE算法, unigram语言模型算法.
快速, 轻量级.

SentencePiece的特点

SentencePiece是基于sub-word分词粒度的tokenizer和detokenizer. 支持BPE和unigram方法.

关于BPE方法和unigram方法的介绍, 你可以看下方资源链接中的论文, 也可以看这个帖子Tokenizer的原理II——BPE和Unigram方法.

tokens的数量是预先确定的
大多数的无监督分词算法会假设词汇表是无限的, 但是sentencepiece训练分词模型时, 会预先确定词汇表大小, 例如8k, 16k, 32k.

直接从原始句子训练
之前的sub-word分词训练会先对输入的句子pre-tokenized, 这个做法能让后面的训练更高效. 但由于我们必须事先运行语言相关的tokenizer,导致预处理变得复杂. SentencePiece直接从原始句子开始训练, 让训练更快速. 对中文和日文的训练更有用, 因为这类语言的单词之间没有明确的空格.

空格被当作一个基本符号
NLP(Natural Language Processing)的第一步就是text tokenization. 例如, 一个标准的英文分词器会把文本‘hello, world.’分为三个tokens:

1	`[hello] [world] [.]`

我们会发现原始输入和分词后的序列是不可逆的转换. 也就是只能从原始输入得到分词后序列, 无法从分词后序列还原为原始输入. 例如, ‘world’和‘.’之间是否有空格, 这个信息在分词后序列中是被丢弃了的, 我们无法从分词后序列判断在world和句号之间是否有空格.即:

1	`Tokenizer('world.') == Tokenizer('world .')`

SentencePiece把输入文本先转为一个unicode字符序列, 这样, 空格也被当作一个标准符号处理. 为了明确的把空格当作一个basic token处理, 首先会把空格转为元符号$_ (U+2581)$, 如下:

1	`hello_world`

然后再把其分割为一个个tokens:

1	`[hello] [_wor] [ld] [.]`

因为在分割后文本中空格被保留, 我们可以对文本毫无歧义的去分词化detokenize:

1	`detokenized = ''.join(pieces).replace('_', ' ')`

这个特点(在分词后序列中保留空格信息)让detokenization的实施不依赖于语言种类. 例如, 因为英语是有空格的, 而汉语没有空格, 如果忽略空格信息, 这两种语言的去分词化方法肯定会有差异. 但是sentencepiece由于保留了空格信息, 它可以对所有语言使用统一的去分词化方法.

子词正则化和BPE dropout
子词正则化regularization, 和BPE dropout是简单的正则化方法, 通过实时子词采样来虚拟地增强训练数据, 帮助改善 NMT (Neural Machine Translation)模型的准确性和鲁棒性.

要启用subword regularization(子词正则化), 你需要将sentencepiece库集成到NMT系统中, 以便为每个参数更新采样一个分段, 这和标准的离线数据准备不同. 下面是示例, 会发现’New York’的每次分段都不同.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

代码实现和分析

上面分析了这么多, 有点空中楼阁的感觉, 会难以理解sentencepiece到底是如何实现. 下面结合源码分析, 更容易懂.

github的sentencepiece源码在这里: github SentencePiece.

安装sentencepiece

sentencepiece提供了python安装包, 和C++安装包. 下面只介绍python语言下面的安装和代码分析.

python安装很简单:

1	`pip install sentencepiece`

然后下载一个文本, 用以训练分词器模型.

1	`# !wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt`

基本操作

import sentencepiece as spm
dataset_path = './mydataset/botchan.txt'
spm.SentencePieceTrainer.train(input=dataset_path, model_prefix='m', vocab_size=2000)

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))

# returns vocab size
print(sp.get_piece_size())

# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

两种特殊符号

user defined symbols: 用户自己定义的特殊符号. 在input sentence中会显示出来.
control symbol: 只会保留control symbol tokens的id. 这些控制符号不会出现在输入文本中.

你也可以自己定义特殊符号, 在user_defined_symbols参数里面; 或者在control_symbols里面.

下面这个例子, 我定义了两个user_defined_symbols. 输入的文本中包含这两种符号, 在文本切分后,
每种符号被单独当作一个piece, 然后转为ids, 每种特殊符号对应一个ids, 这里就分别对应的3和4.

# Example of user defined symbols
dataset_path = './mydataset/botchan.txt'
spm.SentencePieceTrainer.train(input=dataset_path, model_prefix='m_user', user_defined_symbols='<sep>,<cls>',  vocab_size=2000)

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbols to appear in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print(sp_user.id_to_piece([3]))  # ['<sep>']
print(sp_ctrl.id_to_piece([4]))  # ['<cls>']
print('3=', sp_user.decode_ids([3]))  # 3= <sep>
print('4=', sp_user.decode_ids([4]))  # 4= <cls>
print('3=', sp_user.decode_pieces(['<sep>']))  # 3= <sep>
print('4=', sp_user.decode_pieces(['<cls>']))  # 4= <cls>

如果, 我定义的是control_symbols, 有什么区别呢, 看看下面例子.
input text中有特殊符号$< sep >, < cls >$, 但是分词器只把他们当作普通文本进行切分, 切分为多个tokens.
ids里面的3和4对应这两个特殊符号, 使用id_to_piece, 和piece_to_id可以得到.
但是使用decode解码, 无法得到解码后文本, 返回为空.

spm.SentencePieceTrainer.train('--input=./mydataset/botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print(sp_ctrl.id_to_piece([3]))  # ['<sep>']
print(sp_ctrl.id_to_piece([4]))  # ['<cls>']
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty
print('3=', sp_ctrl.decode_pieces(['<sep>']))  # decoded to empty
print('4=', sp_ctrl.decode_pieces(['<cls>']))  # decoded to empty

unigram和BPE两种模型类型

有两种类型: unigram(by default), 和BPE (—model_type=bpe).
这两种模型类型, 在文本质量上没有明显差异, 但是unigram model可以执行sampling和n-best segmentation.

spm.SentencePieceTrainer.train('--input=./mydataset/botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')

sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5))  # returns an empty list.

spm.SentencePieceTrainer.train('--input=./mydataset/botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))

character和word两种模型类型

sentencepiece提供character和word两种切分方式, 但是这两种方式肯定没有上面的BPE和unigram这两个sub-word的切分方式高级. 分别使用参数 model_type=char, model_type=word.
在word segmentation中, input text被切分为一个个word, 空格会表示为下划线跟在单词前面.
char segmentation中, input text被分为一个个字母, 空格当作一个下划线字母.

spm.SentencePieceTrainer.train('--input=./mydataset/botchan.txt --model_prefix=m_char --model_type=char --vocab_size=2000')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))

spm.SentencePieceTrainer.train('--input=./mydataset/botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')

sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')

print(sp_word.encode_as_pieces('this is a test.'))  # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))