LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper
• 2404.05961 • Published
• 66
This is a model that applies LLM2Vec to Swallow. Only the PEFT Adapter is distributed. LLM2Vec is fine-tuned on two tasks: MNTP and SimCSE, and this repository contains the results of applying SimCSE after MNTP. For the MNTP Adapter, please refer to this link.
| Classification | Clustering | PairClassification | Reranking | BitextMining | Retrieval | Sts | 平均 | |
|---|---|---|---|---|---|---|---|---|
| Llama2-Llm2vec-eng | 0.527 | 0.258 | 0.501 | 0.217 | 0.275 | 0.296 | 0.765 | 0.408 |
| Llama2-Llm2vec-jpn | 0.570 | 0.365 | 0.510 | 0.349 | 0.470 | 0.417 | 0.795 | 0.498 |
| Swallow-Llm2vec-jpn (This repo) | 0.621 | 0.391 | 0.510 | 0.475 | 0.475 | 0.491 | 0.832 | 0.523 |
| Classification | Clustering | Pair_Classification | Reranking | Retrieval | STS | 平均 | |
|---|---|---|---|---|---|---|---|
| Llama2-Llm2vec-eng | 0.709 | 0.386 | 0.780 | 0.588 | 0.329 | 0.723 | 0.586 |
| Llama2-Llm2vec-jpn | 0.722 | 0.428 | 0.785 | 0.594 | 0.371 | 0.717 | 0.603 |
| Swallow-Llm2vec-jpn (This repo) | 0.695 | 0.385 | 0.751 | 0.576 | 0.318 | 0.710 | 0.572 |
import argparse
import random
import re
from pathlib import Path
from datasets import load_dataset
from tqdm import tqdm
def main(args):
random.seed(args.seed)
wiki_ds = load_dataset("wikimedia/wikipedia", "20231101.ja")
sampled_index = random.sample(range(len(wiki_ds["train"])), args.N)
sample_wiki = wiki_ds["train"][sampled_index]
output_texts = []
for title, text in tqdm(zip(sample_wiki["title"], sample_wiki["text"])):
output_texts.append(title)
sentences = re.split("[\n。]", text)
for sentence in sentences:
if len(sentence) > args.min_sentence_len:
output_texts.append(sentence.strip()+"。")
with args.output_path.open(mode="w") as f:
for line in output_texts:
f.write(line)
f.write("\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--N", default=200000, type=int)
parser.add_argument("--seed", default=42, type=int)
parser.add_argument("-o", "--output_path", type=Path)
parser.add_argument("--min_sentence_len", default=50, type=int)
args = parser.parse_args()
main(args)
Base model
tokyotech-llm/Swallow-7b-hf