FLiP β€” FactLoLM (English, SONAR)

Factorized Log-linear Models trained to recover lexical content (keywords) from SONAR sentence embeddings, as described in:

Santosh Kesiraju, Bolaji Yusuf, Simon Sedlacek, Oldrich Plchot, Petr Schwarz. FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings. Speech@FIT, Brno University of Technology. arXiv: 2604.18109

Code: BUTSpeechFIT/FLiP Data: BUT-FIT/FLiP-data


Available variants

Trained on Mozilla Common Voice v15 (English), vocabulary of 100 000 unigrams.

Path Rank Params (M) Model size
mcv15/rank-512/ 512 ~153 207 MB
mcv15/rank-1024/ 1024 ~305 414 MB

Each variant directory contains:

  • model.safetensors β€” E1 (vocab x rank), E2 (rank x 1024), b (vocab x 1)
  • vocab.json β€” {"word": index, ...} (100 000 entries)
  • config.json β€” hyperparameters

Model architecture

FactLoLM scores every vocabulary word ww for an input sentence embedding s∈R1024\mathbf{s} \in \mathbb{R}^{1024}:

Ξ·w=(E1E2 s)w+bw\eta_w = (\mathbf{E}_1 \mathbf{E}_2\, \mathbf{s})_w + b_w

where E1∈RVΓ—r\mathbf{E}_1 \in \mathbb{R}^{V \times r} and E2∈RrΓ—1024\mathbf{E}_2 \in \mathbb{R}^{r \times 1024} form a rank- rr factorization of the full projection matrix. Keywords are the top- nn words by (logit) score.


Usage

import json
import torch
from safetensors.torch import load_file

# Load
tensors = load_file("mcv15/rank-512/model.safetensors")  # after hf download
E1, E2, b = tensors["E1"], tensors["E2"], tensors["b"]   # CPU tensors

with open("mcv15/rank-512/vocab.json") as f:
    vocab = json.load(f)                       # {word: index}
int2word = {v: k for k, v in vocab.items()}

def extract_keywords(embedding: torch.Tensor, topn: int = 10):
    """embedding: (1024,) SONAR sentence embedding (speech or text)."""
    scores = (E1 @ E2 @ embedding.unsqueeze(-1)).squeeze(-1) + b.squeeze(-1)
    top_ids = scores.topk(topn).indices.tolist()
    return [int2word[i] for i in top_ids]

Or use the full FLiP pipeline (recommended):

python scripts/evaluate.py \
    --sdict mcv15/rank-512/model.safetensors \
    --vocab  mcv15/rank-512/vocab.json \
    --data_yaml configs/datasets.yaml \
    --dataset mcv_15_en_test \
    --topn 10 --metrics all

Results

Keyword extraction on MCV15 EN (SONAR embeddings):

Model Text acc. Speech acc.
LiP (full-rank baseline) 59.45 57.27
FLiP rank-512 76.77 73.62
FLiP rank-1024 77.29 74.09

Span-aware accuracy vs SpLiCE (10 k concept vocabulary):

Method Text Speech
SpLiCE 29.58 28.21
FLiP 61.45 58.83

Citation

@misc{kesiraju2026flip,
  title         = {{FLiP}: Towards understanding and interpreting multimodal multilingual sentence embeddings},
  author        = {Kesiraju, Santosh and Yusuf, Bolaji and Sedl{\'{a}}{\v{c}}ek, {\v{S}}imon and Plchot, Old{\v{r}}ich and Schwarz, Petr},
  year          = {2026},
  eprint        = {2604.18109},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2604.18109},
}

License

MIT β€” see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for BUT-FIT/FLiP-en-sonar