AI-FirstAI-First
Back to blog
strategie-ia
May 9, 2026
9 min read

Cheaper LLMs won't reach your clients this year

SubQ promises LLMs that are 1,000x cheaper thanks to a subquadratic architecture. But between the private beta and researcher skepticism, your clients won't see anything concrete for a long time.

Vincent

Vincent

AI expert, AI-First

SubQ AI is the first fully subquadratic LLM: 12M tokens of context and 1,000x lower attention cost according to Subquadratic. RULER 95%, MRCR and SWE-Bench benchmarks decoded, and why researchers are demanding independent proof.

SubQ is the first LLM declared fully subquadratic: launched in May 2026 by the startup Subquadratic (Miami), it claims, according to its own benchmarks not yet independently reproduced, a 12-million-token context window at 1,000 times lower attention cost than standard transformers, thanks to an architecture called Subquadratic Sparse Attention (SSA).

A startup from Miami appears out of nowhere, raises 29 million dollars and announces it has solved the problem that has been dragging down AI economics since 2017. SubQ promises costs divided by 1,000 on long contexts, a 12-million-token window and an architecture that the major labs supposedly never managed to make work. If it's true, it's the breakthrough of the decade. If it's not, it's well-packaged vaporware. And either way, it won't change anything for your AI projects this year.

  • ⚠️ Unverified promise: no technical report published, closed weights, private beta only.
  • 📉 Unfavorable track record: Mamba, RWKV, DeepSeek Sparse: every subquadratic attempt has failed at scale.
  • 💡 Wrong bottleneck: for an SME, model cost matters less than integration cost.
  • 🎯 Immediate action: existing models, properly integrated, already deliver measurable value.

SubQ: the startup promising to cut costs by 1,000x

SubQ is the model name from the startup Subquadratic (Miami), founded by Justin Dangel (CEO) and Alex Whedon (former Head of Generative AI at Meta), which raised 29 million dollars in seed funding in May 2026. It claims to be the first LLM built on an entirely subquadratic architecture, with attention costs divided by 1,000 on long context windows, claims not yet independently verified as of this date.

On May 5, 2026, Subquadratic emerged from stealth mode. The company, co-founded by Justin Dangel (CEO) and Alexander Whedon (CTO, former Head of Generative AI at Meta), announced SubQ 1M-Preview: the first LLM built on a fully subquadratic attention architecture.

The pitch fits in one sentence: where standard transformers compare every token to every other token (quadratic cost), SubQ selects only the relevant relationships. Announced result: a cost that grows linearly instead of quadratically.

How does the SSA architecture work?

Standard attention in a transformer is dense. Every token looks at every other token. Double the input, and computation quadruples. That's the quadratic wall.

SubQ replaces this with what they call Subquadratic Sparse Attention (SSA). For each token, the model dynamically selects a small subset of relevant positions, then computes exact attention only on those. This is not fixed sparse attention like Longformer, nor a state-space approach like Mamba. SSA keeps the attention mechanism but makes it selective.

In terms of algorithmic complexity, SSA moves from O(n²), where every token compares against all others, to O(n·k), where k is the average number of tokens selected per position. According to The New Stack, this architecture reaches a speed 52 times faster than FlashAttention at 1 million tokens.

According to VentureBeat, at 12 million tokens, this architecture would reduce attention compute by nearly 1,000x compared to current frontier models. According to SiliconANGLE, the RULER 128K benchmark would show 95% accuracy for 8 dollars, compared to 94.8% and roughly 2,600 dollars for Claude Opus 4.6.

Numbers that would make any CTO salivate.

The fundraise confirms that serious people believe in it: 29 million in seed, a valuation reported at 500 million by The New Stack, and investors that include the co-founder of Tinder (Justin Mateen), a former SoftBank Vision Fund partner (Javier Villamizar), as well as early investors in Anthropic, OpenAI, Stripe and Brex.

What do the benchmarks show?

Benchmark Claude Opus * GPT-5.5 SubQ 1M-Preview What it measures Trend
SWE-Bench Verified 87.6% (4.7) n/r 81.8% Real-world software engineering ↓ behind
RULER 128K 94.8% (4.6) n/r 95.0% Long-context accuracy ↑ +0.2 pts
MRCR v2 (1M, 8 needles) 32.2% (4.7) 74.0% 65.9% (deployed) Long coreference resolution → middle of the pack

SOURCE: subq.ai benchmarks + VentureBeat · Updated 05/2026. * Subquadratic used Claude Opus 4.6 for RULER and Claude Opus 4.7 for SWE-Bench / MRCR. The SubQ MRCR column shows the deployed model's score (65.9%); the research configuration claims 83%.

The numbers are interesting on long context, but SubQ trails on SWE-Bench Verified (81.8% versus 87.6% for Claude Opus 4.7). A cheaper model that codes worse isn't necessarily a good deal for an autonomous AI agent that needs to produce reliable code.

Why researchers remain skeptical

The problem isn't that the claims are impossible. It's that they're unverifiable.

What evidence is still missing?

According to FelloAI, the full technical report has not been published. The model weights remain closed. All products (API, SubQ Code, SubQ Search) are in private beta. And the benchmarks, though presented as third-party validated, have not been independently reproduced by the community.

This is not a minor detail. The history of subquadratic architectures is a graveyard of promises.

Mamba proposed a state-space approach that was supposed to replace attention. RWKV tried to reconcile RNNs and transformers. DeepSeek introduced its own sparse attention. Every time, the benchmarks on paper were promising and the production results were disappointing. None of these architectures managed to rival dense transformers at frontier scale.

A second red flag concerns the MRCR benchmarks themselves. According to DataCamp, SubQ's research configuration reaches 83% on MRCR v2, but the deployed API model only achieves 65.9%, a 17-point gap between lab and production. This kind of gap between internal benchmarks and real-world performance is precisely what the community is waiting to see explained publicly.

The Magic.dev precedent is also instructive. According to The New Stack and VentureBeat, that startup had announced in August 2024 a 100-million-token context with a similar 1,000x efficiency advantage, and had raised roughly 500 million dollars. By early 2026, there is still no public evidence that their LTM-2-mini model is used in production outside the company. Grand contextual efficiency announcements already have a track record.

SubQ argues that SSA is fundamentally different because it preserves exact attention on selected tokens, rather than replacing it with an alternative mechanism. That's an interesting technical argument. But until the community can reproduce the results, skepticism remains the rational position.

As VentureBeat puts it, researcher reactions range "from genuine curiosity to open accusations of vaporware." Not exactly a consensus.

The real problem: your clients aren't waiting for a cheaper model

Even if SubQ delivered on every promise tomorrow morning, model cost would rarely be the top expense in an enterprise AI project. What actually holds back deployments is integration with existing tools, not the token bill.

Let's assume for a moment that SubQ delivers on all its promises. A 12-million-token context, linear costs, frontier quality. What does that concretely change for a 50-person SME looking to automate customer service or streamline prospecting?

Not much this year.

Why isn't model cost your bottleneck?

I see it every week while working with SMEs on their AI projects: token cost is almost never the blocker. What's expensive is integration. Connecting an LLM to the CRM, to emails, to the knowledge base, training the teams, handling errors, iterating on prompts. The real cost of LLMs isn't on the API bill.

According to McKinsey, companies that capture value from AI are the ones investing in integration with existing workflows, not the ones chasing the cheapest model. The pattern is always the same: an impressive demo, then months of integration before the first euro of ROI.

Why does integration matter more than architecture?

A model that's 1,000x cheaper doesn't fix the fact that your ERP exports to CSV, that your sales team doesn't use the CRM properly, or that nobody on the team knows how to write a structured prompt. In my experience with SMEs, these problems absorb the vast majority of an AI project's budget, rarely less than 70 to 80%.

The companies I work with that get concrete results aren't the ones waiting for the next architectural breakthrough. They're the ones that integrate AI into their departments with the models available today, starting with a specific and measurable use case.

"The real value isn't in the model, it's in the integration with your business processes. SubQ or not, that equation doesn't change."

Vincent, May 2026

What you should do instead of waiting

Don't postpone your AI projects waiting for SubQ. Existing models already deliver measurable value, and SubQ won't be available for enterprise production until late 2026 at the earliest, probably not before 2027.

The natural reflex when an announcement like SubQ drops is to think: "let's wait, prices will come down." That's exactly the wrong calculation.

Should you delay your AI projects waiting for SubQ?

No. For three reasons.

First, SubQ is in private beta with no announced general availability date. Even if the model works, you won't be able to use it in production for months, probably not before 2027 for reliable enterprise use.

Second, the costs of existing models are already dropping. OpenAI offers free fine-tuning, Anthropic has significantly reduced its model pricing over the past year, and open-source models like Llama allow local inference for certain use cases. You don't need an architectural breakthrough to get reasonable costs.

Third, every month of waiting is a month without the operational gains AI can already deliver. A well-configured AI agent on your sales pipeline generates value from the first week. A model that's 1,000x cheaper but doesn't exist yet generates none.

What signals should you watch to know if SubQ is serious?

Three indicators to look for:

The publication of the full technical report. Without it, any discussion of the architecture remains speculative. Independent reproduction of the benchmarks by at least two recognized research teams. And the opening of a public API with verifiable pricing, not a private, invite-only beta.

Until all three conditions are met, SubQ remains a promise, not a tool. And promises don't reduce your operating costs.

The right strategy hasn't changed: identify the task that costs you the most in time and money, plug an existing model into it, measure ROI in six weeks, iterate. It's less spectacular than a 29-million-dollar funding announcement, but it's what works. Companies that put AI at the heart of their operations today, with today's tools, will have a structural advantage over those waiting for the perfect model. At GoLive Software, we support exactly this kind of transition: pragmatic, measurable, without waiting for the next revolution.

Frequently asked questions

Is SubQ really 1,000 times cheaper than Claude or GPT?

That's what Subquadratic claims for very long contexts (12 million tokens). At 128K tokens, the announced reduction would be closer to 300x according to SiliconANGLE. These numbers have not been independently reproduced, and the model is not publicly accessible. Until the technical report is published, these claims remain unverifiable.

Can SubQ be used in production today?

No. All three products (API, SubQ Code, SubQ Search) are in private beta by request. No general availability date has been communicated. For enterprise use requiring reliability and support, you'll likely need to wait until at least late 2026, possibly 2027.

Why have subquadratic architectures always failed?

Previous attempts (Mamba, RWKV, DeepSeek Sparse Attention) replaced attention with alternative mechanisms or used fixed sparsity patterns. They performed well on benchmarks but lost quality at frontier scale. SubQ claims SSA is different because it preserves exact attention, but this claim remains to be validated.

Should an SME wait for LLM costs to drop before launching an AI project?

No. Token cost is rarely the main expense in an SME's AI project. Integration with existing tools, team training, and iterating on use cases absorb most of the budget. Waiting for a cheaper model delays operational gains that are already achievable with current models.

Could SubQ replace RAG and context pipelines?

That's the stated ambition: with 12 million tokens, there's no need to chunk, index and retrieve documents, everything fits in context. In theory, this would drastically simplify architectures. In practice, nobody has yet been able to verify that quality holds up on real-world use cases at this scale.

Vidéos YouTube

Articles & ressources

Take action with AI-First

Transform your business with AI. Audit, implementation and follow-up by certified experts.

Request an audit →

More articles