Computational Linguistics

The Science of Detection:
Entropy & Burstiness

How we distinguish between the statistical probability of machine generation and the chaotic creativity of human thought.

The Detection Pipeline

Normalization

Stripping zero-width characters and normalizing homoglyphs.

Tokenization

Breaking text into semantic units for vector embedding.

Entropy Scoring

Calculating randomness (Perplexity) and flow (Burstiness).

Verification

Cross-referencing against known LLM fingerprints.

Linguistic Entropy

Entropy (or Perplexity) measures the randomness of a text. AI models are designed to minimize surprise—they choose the most probable next word based on vast training data.

Humans, conversely, are unpredictable. We use creative metaphors, slang, and non-linear logic that statistically “confuses” AI models. High perplexity usually indicates human authorship.

Perplexity visualizer

Low Randomness (AI) High Randomness (Human)

“The quick brown fox jumps over the lazy dog.”

Likely AI Generated

Sentence Structure Analysis

AI (Monotonic Rhythm)

Human (High Burstiness)

Structural Rhythm (Burstiness)

While Perplexity measures word choice, Burstiness measures the structure of sentences. AI models tend to generate sentences of average length and standard structure (monotonic).

Humans write with “bursts”—a short sentence followed by a long, complex clause. This structural variation creates a rhythm that current LLMs struggle to replicate authentically without explicit prompting.

The Signature Engine

Beyond general entropy, CrossPlag maintains a database of Model Fingerprints. Each AI model (GPT, Claude, Gemini) has specific biases—preferred transition words, specific sentence starters, and refusal patterns.

Our engine scans for these micro-signatures to not just detect AI, but identify which AI was likely used.

Match

99.8%

Adversarial Robustness

Students and bad actors often attempt to bypass detection using “adversarial attacks.” Our engine includes a dedicated pre-processing layer to neutralize these evasion techniques before analysis begins.

Homoglyph Attacks

Replacing Latin characters (e.g., ‘a’, ‘o’) with identical Cyrillic counterparts. Our system normalizes all scripts to a canonical form.

Zero-Width Spaces

Injecting invisible characters to break word tokenization. We strip all non-printing characters during the normalization phase.

Prompt Injection

Detecting artifacts of prompts designed to confuse detectors (e.g., “Write with high burstiness”).

Content Taxonomy

The four distinct classifications used in our Institutional Reports.

Pure Human

High perplexity and high burstiness. Contains natural inconsistencies and creative logic jumps.

AI-Edited

Human logic with AI-smoothed syntax. Often seen when tools like Grammarly are overused.

Paraphrased

AI text passed through re-phrasers (e.g., Quillbot). Identifiable by “unnatural smoothness” or synonym stuffing.

AI-Generated

Low perplexity, monotonic rhythm. Matches statistical patterns of major LLMs perfectly.

The Science of Detection: Entropy & Burstiness