GLiNER-bi-base-v2.0 β†’ LiteRT

LiteRT (.tflite) exports of knowledgator/gliner-bi-base-v2.0 for zero-shot NER on Android (Pixel-class hardware) via the LiteRT runtime. Matches upstream numerics at float32 precision (max diff 2.8e-05 vs PyTorch reference, over diverse test inputs).

Architecture

GLiNER-bi-v2 is a bi-encoder span-classification NER model:

  • Text encoder: jhu-clsp/ettin-encoder-150m (ModernBERT). Output hidden dim 768 (projected to 512).
  • Labels encoder: BAAI/bge-small-en-v1.5 (BERT). Output hidden dim 384 (projected to 512). Mean-pooled to produce one embedding per entity-type string.
  • Span head: enumerates all (start, width) spans up to width 12, projects via 2-layer MLPs, scores against each label embedding via dot product.
  • Output: logits[batch, word_idx, width_idx, label_idx] β€” apply sigmoid
    • threshold to decide if span is an entity of that type.

Note on the BiLSTM (if you're wondering why there's no LSTM here)

The upstream checkpoint contains trained BiLSTM weights, but BiEncoderSpanModel.forward in gliner/modeling/base.py never calls self.rnn(...). Verified four ways:

  1. Code grep: self.rnn(...) appears in BaseUniEncoderModel but not BaseBiEncoderModel.
  2. Forward hook: zero calls to inner.rnn.lstm during gm.predict_entities(...).
  3. Numerics: our no-LSTM wrapper gives diff = 0.000000 vs upstream inner(...).logits.
  4. Weight ablation: zeroing or randomizing LSTM weights produces bit-identical predictions (same entities, same scores to 4 decimals).

So the deployed model's accuracy is achieved without the LSTM. We don't ship it.

Files

Text encoder (per-query hot path)

file size purpose
text_encoder_seq256_fp32.tflite 656.6 MB baseline precision, seq_len 256
text_encoder_seq256_int8.tflite 182.6 MB int8 weight-only, seq_len 256
text_encoder_seq512_fp32.tflite 677.2 MB seq_len 512
text_encoder_seq512_int8.tflite 203.1 MB seq_len 512
text_encoder_seq1024_fp32.tflite 718.4 MB seq_len 1024
text_encoder_seq1024_int8.tflite 244.3 MB seq_len 1024
text_encoder_seq2048_fp32.tflite 800.7 MB seq_len 2048
text_encoder_seq2048_int8.tflite 326.6 MB seq_len 2048
text_encoder_seq4096_fp32.tflite 965.3 MB seq_len 4096
text_encoder_seq4096_int8.tflite 491.3 MB seq_len 4096
text_encoder_seq8192_fp32.tflite 1043.0 MB seq_len 8192 (ModernBERT max position 7999)
text_encoder_seq8192_int8.tflite 568.9 MB seq_len 8192

The seq_len is baked into each graph β€” pick the smallest variant β‰₯ your tokenized input length to minimize compute.

Labels encoder (one-shot at bridge startup)

file size purpose
labels_encoder_fp32.tflite 133.4 MB encodes up to 25 entity-type strings in parallel

Tokenizers

dir for
text_tokenizer/ ModernBERT tokenizer β€” tokenize input text
labels_tokenizer/ BERT tokenizer β€” tokenize entity-type strings

Config

file purpose
gliner_config.json reference config (max_types=25, max_width=12, span_mode=markerV0)

Input / Output contract

Labels encoder

Inputs (fixed shape, [25, 32]):

  • input_ids int64 [N=25, L=32] β€” tokenized entity-type strings, padded to 32 tokens; pad rows with empty strings
  • attention_mask int64 [25, 32]

Output:

  • labels_embeds float32 [25, 384 (projected to 512)] β€” one embedding per label row

Run once per distinct label set. Cache the output tensor on the bridge. Unused rows (beyond your actual N) are ignored downstream.

Text encoder

Inputs (fixed shape at compile time, chosen by seq_len variant):

name shape dtype description
input_ids [1, S] int64 tokenized text (ModernBERT tokenizer)
attention_mask [1, S] int64 1 for real tokens, 0 for padding
first_subword_positions [1, S] int64 for each word slot, the seq position of its first subword token; pad with 0 for unused slots
word_valid_mask [1, S] float32 1.0 for word slots [0..W_real), 0.0 for pad slots [W_real..S)
labels_embeds [1, 25, 512] float32 precomputed from the labels encoder (broadcast a batch dim to the [25, D] output)
span_idx [1, S*12, 2] int64 (start, end) word-index pairs for each candidate span

Output:

  • logits float32 [1, S, 12, 25] β€” score for each (word_idx, width_idx, label_idx) triple. Apply sigmoid and threshold (typically 0.3–0.5) to decide entity presence. Only spans where span_idx[_, :, 0] < W_real AND span_idx[_, :, 1] < W_real are meaningful; the rest are computed but ignored downstream.

Numerics

Validated with ai_edge_litert.Interpreter (LiteRT desktop build, XNNPACK CPU):

quant max diff vs upstream PyTorch (valid spans, logit-scale) decision-flip rate @ threshold=0.3
fp32 2.8e-05 (machine epsilon) 0 / 366 spans
int8 5.6e-01 (pre-sigmoid logits) 0 / 366 spans (none of the score drifts cross the threshold)

Test inputs were mixed English sentences (Elon Musk, Barack Obama, Satoshi Nakamoto, etc.) with 3-6 entity types each.

Usage sketch (Python reference)

import numpy as np
import torch
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

# --- one-time setup ---
text_tok = AutoTokenizer.from_pretrained("ckg/gliner-bi-base-v20-litert", subfolder="text_tokenizer")
labels_tok = AutoTokenizer.from_pretrained("ckg/gliner-bi-base-v20-litert", subfolder="labels_tokenizer")

labels_interp = Interpreter(model_path="labels_encoder_fp32.tflite")
labels_interp.allocate_tensors()

text_interp = Interpreter(model_path="text_encoder_seq512_fp32.tflite")
text_interp.allocate_tensors()

# --- precompute label embeddings ONCE ---
labels = ["person", "organization", "location"]
LABELS_PAD = labels + [""] * (25 - len(labels))
enc = labels_tok(LABELS_PAD, padding="max_length", max_length=32,
                  truncation=True, return_tensors="np")
labels_in0 = labels_interp.get_input_details()[0]["index"]
labels_in1 = labels_interp.get_input_details()[1]["index"]
labels_out = labels_interp.get_output_details()[0]["index"]
labels_interp.set_tensor(labels_in0, enc["input_ids"].astype(np.int64))
labels_interp.set_tensor(labels_in1, enc["attention_mask"].astype(np.int64))
labels_interp.invoke()
labels_embeds = labels_interp.get_tensor(labels_out)[None, ...]  # [1, 25, D]

# --- per-query inference ---
# ... tokenize text + precompute first_subword_positions + word_valid_mask + span_idx
# ... text_interp.set_tensor(...) for each of the 6 inputs
# ... invoke, read logits, sigmoid + threshold, decode spans to (start_char, end_char, label, score)

Provenance

Converted with litert-torch 0.8.0 from the upstream knowledgator/gliner-bi-base-v2.0 checkpoint. No retraining, no weight modification β€” only a graph restructuring:

  1. Skip the dead BiLSTM (confirmed unused by the upstream forward path)
  2. Replace upstream's dynamic-shape torch.where-based word/prompt extraction with static gather from precomputed positions (first_subword_positions)
  3. Replace 4D einsum scoring with 2D bmm + view

Total params shipped: 194M (upstream) β†’ ~187M in the tflites (labels encoder ~33M, text encoder ~150M, span head ~4M, prompt MLP ~1M). The ~1M unused BiLSTM weights are excluded.

License

Apache-2.0, propagated from knowledgator/gliner-bi-base-v2.0.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ckg/gliner-bi-base-v20-litert

Finetuned
(1)
this model