深度学习--seqt2seq RNN 英语翻译法语--86

news/2024/9/24 18:06:22

目录
  • 1. 结构
  • 2. 代码解读

1. 结构

我画的:

2. 代码解读

导包

import nltk
import numpy as np
import re
import shutil
import tensorflow as tf
import os
import unicodedatafrom nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

数据集的预处理

def clean_up_logs(data_dir):checkpoint_dir = os.path.join(data_dir, "checkpoints")if os.path.exists(checkpoint_dir):shutil.rmtree(checkpoint_dir, ignore_errors=True)os.mkdir(checkpoint_dir)return checkpoint_dir

这个函数通过去除重音符号、添加标点符号周围的空格、去除非字母和标点符号的字符、去除多余的空格以及转换为小写,对输入的英文句子进行了全面的预处理。

def preprocess_sentence(sent):sent = "".join([c for c in unicodedata.normalize("NFD", sent) if unicodedata.category(c) != "Mn"])sent = re.sub(r"([!.?])", r" \1", sent)sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)sent = re.sub(r"\s+", " ", sent)sent = sent.lower()return sent

这里需要注意的是decoder
输入fr_sent_in 每一句的开头需要加上BOS (begin of sentence)
y_label为fr_sent_out 用于做评估 做loss计算 每一句的结尾 需要加上EOS (end of sentence)

def download_and_read():en_sents, fr_sents_in, fr_sents_out = [], [], []local_file = os.path.join("datasets", "fra.txt")with open(local_file, "r", encoding="utf-8") as fin:for i, line in enumerate(fin):en_sent, fr_sent, *_ = line.strip().split("\t")fr_sent = preprocess_sentence(fr_sent)fr_sent_in = [w for w in ("BOS " + fr_sent).split()]  # decoder输出为法语 需要在开头加上BOS标记fr_sent_out = [w for w in (fr_sent + " EOS").split()]  # decoder输出为法语 需要在结尾加上EOS标记en_sents.append(en_sent)fr_sents_in.append(fr_sent_in)fr_sents_out.append(fr_sent_out)if i >= NUM_SENT_PAIRS - 1:breakreturn en_sents, fr_sents_in, fr_sents_out

encoder部分
encoder的call方法 需要传入x 以及初始的state则会输出 encoder_out,encoder_state
具体的rnn实现使用封装好的GRU,控制参数return_state=True

class Encoder(tf.keras.Model):def __init__(self, vocab_size, embedding_dim, num_timesteps, encoder_dim, **kwargs):super(Encoder, self).__init__(**kwargs)self.encoder_dim = encoder_dimself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=num_timesteps)self.rnn = tf.keras.layers.GRU(encoder_dim, return_sequences=False, return_state=True)def call(self, x, state):x = self.embedding(x)x, state = self.rnn(x, initial_state=state)return x, statedef init_state(self, batch_size):return tf.zeros((batch_size, self.encoder_dim))

decoder部分
call方法也需要传输x,state然后返回 decoder_out,decoder_state
rnn的实现也是使用封装好的GRU,控制参数return_state=True, return_sequences=True
rnn的返回值x,再输入全链接层,输出下一个单子是哪一个

class Decoder(tf.keras.Model):def __init__(self, vocab_size, embedding_dim, num_timesteps, decoder_dim, **kwargs):super(Decoder, self).__init__(**kwargs)self.decoder_dim = decoder_dimself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=num_timesteps)self.rnn = tf.keras.layers.GRU(decoder_dim, return_state=True, return_sequences=True)self.dense = tf.keras.layers.Dense(vocab_size)def call(self, x, state):x = self.embedding(x)x, state = self.rnn(x, state)x = self.dense(x)return x, state

损失函数采用稀疏交叉熵SparseCategoricalCrossentropy 有了loss的计算才能求导计算梯度然后反向传播

def loss_func(ytrue, ypred):scce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)# tf.math.equal(ytrue, 0):判断 ytrue 中的每个元素是否等于 0(通常 0 表示填充)。# tf.math.logical_not:对上述结果取反,得到一个布尔张量,其中 True 表示非填充部分,False 表示填充部分。mask = tf.math.logical_not(tf.math.equal(ytrue, 0))# tf.cast:将布尔张量转换为整数张量(True 变为 1,False 变为 0)mask = tf.cast(mask, dtype=tf.int64)# 计算损失,使用 mask 作为样本权重,这样填充部分的损失将被忽略。loss = scce(ytrue, ypred, sample_weight=mask)return loss

训练step的定义

@tf.function
def train_step(encoder_in, decoder_in, decoder_out, encoder_state):with tf.GradientTape() as tape:encoder_out, encoder_state = encoder(encoder_in, encoder_state)decoder_state = encoder_statedecoder_pred, decoder_state = decoder(decoder_in, decoder_state)loss = loss_func(decoder_out, decoder_pred)variables = encoder.trainable_variables + decoder.trainable_variablesgradients = tape.gradient(loss, variables)  # 梯度计算 tf2版本自动求导的功能optimizer.apply_gradients(zip(gradients, variables))return loss

推理计算

def predict(encoder, decoder, batch_size, sents_en, data_en, sents_fr_out, word2idx_fr, idx2wprd_fr):# 随机的取一句random_id = np.random.choice(len(sents_en))print("Input    : ", " ".join(sents_en[random_id]))print("Output    : ", " ".join(sents_fr_out[random_id]))encoder_in = tf.expand_dims(data_en[random_id], axis=0)decoder_out = tf.expand_dims(sents_fr_out[random_id], axis=0)encoder_state = encoder.init_state(1)encoder_out, encoder_state = encoder(encoder_in, encoder_state)decoder_state = encoder_statedecoder_in = tf.expand_dims(tf.constant([word2idx_fr["BOS"]]), axis=0)pred_sent_fr = []while True:decoder_pred, decoder_state = decoder(decoder_in, decoder_state)decoder_pred = tf.argmax(decoder_pred, axis=-1)pred_word = idx2wprd_fr[decoder_pred.numpy()[0][0]]pred_sent_fr.append(pred_word)if pred_word == "EOS":breakdecoder_in = decoder_predprint("predict: ", " ".join(pred_sent_fr))

bleu分数的计算

def evaluate_bleu_score(encoder, decoder, test_dataset, word2idx_fr, idx2word_fr):bleu_scores = []smooth_fn = SmoothingFunction()for encoder_in, decoder_in, decoder_out in test_dataset:encoder_state = encoder.init_state(batch_size)encoder_out, encoder_state = encoder(encoder_in, encoder_state)decoder_state = encoder_statedecoder_pred, decoder_state = decoder(decoder_in, decoder_state)# compute argmaxdecoder_pred = tf.argmax(decoder_pred, axis=-1).numpy()# decoder_out 是y_truefor i in range(decoder_out.shape[0]):  # 取到y_true的一句话ref_sent = [idx2word_fr[j] for j in decoder_out[i].to_list() if j > 0]hyp_sent = [idx2word_fr[j] for j in decoder_pred[i].to_list() if j > 0]# remove EOSref_sent = ref_sent[0:-1]hyp_sent = hyp_sent[0:-1]bleu_score = sentence_bleu([ref_sent], hyp_sent, smoothing_function=smooth_fn)bleu_scores.append(bleu_score)return np.mean(np.array(bleu_scores))  # 取均值

一些全局变量

NUM_SENT_PAIRS = 30000
EMBEDDING_DIM = 256
ENCODER_DIM, DECODER_DIM = 1024, 1024
BATCH_SIZE = 64
NUM_EPOCHS = 30
NUM_EPOCHS = 5tf.random.set_seed(30)data_dir = "datasets"
checkpoint_dir = clean_up_logs(data_dir)# datasets preparation
download_url = "http://www.manythings.org/anki/fra-eng.zip"
sents_en, sents_fr_in, sents_fr_out = download_and_read()

分词器处理样本

tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters="", lower=False)
tokenizer_en.fit_on_texts(sents_en)
data_en = tokenizer_en.texts_to_sequences(sents_en)
data_en = tf.keras.preprocessing.sequence.pad_sequences(data_en, padding="post")tokenizer_fr = tf.keras.preprocessing.text.Tokenizer(filters="", lower=False)
tokenizer_fr.fit_on_texts(sents_fr_in)
tokenizer_fr.fit_on_texts(sents_fr_out)data_fr_in = tokenizer_fr.texts_to_sequences(sents_fr_in)
data_fr_in = tf.keras.preprocessing.sequence.pad_sequences(data_fr_in, padding='post')data_fr_out = tokenizer_fr.texts_to_sequences(sents_fr_out)
data_fr_out = tf.keras.preprocessing.sequence.pad_sequences(data_fr_out, padding="post")vocab_size_en = len(tokenizer_en.word_index)
vocab_size_fr = len(tokenizer_fr.word_index)
word2idx_en = tokenizer_en.word_index
idx2word_en = {v: k for k, v in word2idx_en.items()}word2idx_fr = tokenizer_fr.word_index
idx2word_fr = {v: k for k, v in word2idx_fr.items()}print(f"Vocab size (en): {vocab_size_en}")
print(f"Vocab size (fr): {vocab_size_fr}")maxlen_en = data_en.shape[1]
maxlen_fr = data_fr_out.shape[1]
print(f"seq len (en): {maxlen_en}")
print(f"seq len (fr): {maxlen_fr}")

数据集的划分

batch_size = BATCH_SIZE
dataset = tf.data.Dataset.from_tensor_slices((data_en, data_fr_in, data_fr_out))
dataset = dataset.shuffle(10000)
test_size = NUM_SENT_PAIRS // 4
test_dataset = dataset.take(test_size).batch(batch_size, drop_remainder=True)
train_dataset = dataset.skip(test_size).batch(batch_size, drop_remainder=True)

encoder decoder输入输出维度的检查

# check encoder/decoder dimensions
embedding_dim = EMBEDDING_DIM
encoder_dim, decoder_dim = ENCODER_DIM, DECODER_DIMencoder = Encoder(vocab_size_en+1, embedding_dim, maxlen_en, encoder_dim)
decoder = Decoder(vocab_size_fr+1, embedding_dim, maxlen_fr, decoder_dim)optimizer = tf.keras.optimizers.Adam()
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,encoder=encoder,decoder=decoder)for encoder_in, decoder_in, decoder_out in train_dataset:encoder_state = encoder.init_state(batch_size)encoder_out, encoder_state = encoder(encoder_in, encoder_state)decoder_state = encoder_statedecoder_pred, decoder_state = decoder(decoder_in, decoder_state)break
print("encoder input         :", encoder_in.shape)
print("encoder output        :", encoder_out.shape, "state:    ", encoder_state.shape)
print("decoder output (logits)       :", decoder_pred.shape, "state:    ", decoder_state.shape)
print("decoder output (labels)       :", decoder_out.shape)

训练

# training step
num_epochs = NUM_EPOCHS
for e in range(num_epochs):encoder_state = encoder.init_state(batch_size)for batch, data in enumerate(train_dataset):encoder_in, decoder_in, decoder_out = data# decoder_out is the label value# decoder_in feed into decoder and will return decoder_pred and state# print(encoder_in.shape, decoder_in.shape, decoder_out.shape)loss = train_step(encoder_in, decoder_in, decoder_out, encoder_state)print("Epoch: {}, Loss: {:.4f}".format(e+1, loss.numpy()))if e % 10 == 0:checkpoint.save(file_prefix=checkpoint_prefix)predict(encoder, decoder, batch_size, sents_en, data_en, sents_fr_out, word2idx_fr, idx2word_fr)eval_score = evaluate_bleu_score(encoder, decoder, test_dataset, word2idx_fr, idx2word_fr)print("Eval Score (BLEU): {:.3e}".format(eval_score))checkpoint.save(file_prefix=checkpoint_prefix)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.ryyt.cn/news/46238.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈,一经查实,立即删除!

相关文章

# 机器学习day05

机器学习第五天……张量元素类型转换data.type(torch.DoubleTensor)data = torch.full([2, 3], 10)print(data.dtype)# 将 data 元素类型转换为 float64 类型 data = data.type(torch.DoubleTensor)print(data.dtype)# 转换为其他类型 # data = data.type(torch.ShortTensor)# …

Kali 安装并配置 Nessus

Kali 安装并配置 Nessus 安装 Nessus创建nessus文件夹sudo mkdir /opt/nessus下载 Nessus ( https://www.tenable.com/downloads/nessus?loginAttempted=true ),并上传至 /opt/nessus 文件夹在 /opt/nessus 路径下,使用命令安装 Nessusdpkg -i Nessus-10.7.4-debian6_amd64.…

Kotlin 数据类型详解:数字、字符、布尔值与类型转换指南

Kotlin中变量类型由值决定,如Int、Double、Char、Boolean、String。通常可省略类型声明,但有时需指定。数字类型分整数(Byte, Short, Int, Long)和浮点(Float, Double),默认整数为Int,浮点为Double。布尔值是true或false,Char用单引号,字符串用双引号。数组和类型转换…

Exercises

### Auto自动化变量自动存储类别是默认的存储类别,通常用于在”函数内部定义的局部变量“。这些变量会在程序执行到其定义的代码块时对应的栈空间被创建,函数执行完毕后变量对应栈空间会自动销毁。 示例: int main() //宿主 {auto int data;//寄生虫 auto int data; 局部变量…

vxlan基本原理及裸搭过程

https://mp.weixin.qq.com/s/pqVvBd2CbHkWwD79aDb6mg剥离flannel或者其他overlay网络的上层封装,我们可以通过 ip命令纯手工搭建一个vxlan overlay网络, 这其中最关键的部分是:vethpair: 打通容器内外 vxlan.nic: 虚拟网卡,封装/解封数据包除了创建这些硬件,我们还需要设置…

c语言程序实验————实验报告十二

c语言程序实验————实验报告十二实验项目名称: 实验报告十二 用指针处理函数与数组 实验项目类型:验证性 实验日期:2024 年 5 月 30 日一、实验目的 1.掌握指针变量的定义格式,会定义和使用指针变量 2.能正确建立指针变量与数组(包括一维、两维和字符串数组)的联系,并…

6.21-二叉搜索树的最近公共祖先

235. 二叉搜索树的最近公共祖先 题意描述:给定一个二叉搜索树, 找到该树中两个指定节点的最近公共祖先。 百度百科中最近公共祖先的定义为:“对于有根树 T 的两个结点 p、q,最近公共祖先表示为一个结点 x,满足 x 是 p、q 的祖先且 x 的深度尽可能大(一个节点也可以是它自己…

Python标注工具labelImg使用Pyinstaller打包成EXE的过程及问题处理

直接上过程 1.在python项目中使用pip命令安装pyinstaller。 2.在python编辑器(如Pycharm)终端切换到要打包的.py文件所在目录。 3.使用pyinstaller工具命令打包.py文件,如: pyinstaller labelImg.py --noconsole --workpath .\Pyinstaller\temp --distpath .\Pyinstaller\…