Pixel Smile: the first zero-shot facial expression editing model that fixes semantic entanglement between emotions.
- 🔑 Four open source releases to test: Pixel Smile, Group Editing, Astralab, Cohere Transcribe.
- 🎯 Pixel Smile solves semantic entanglement across 12 expressions with a continuous intensity slider, zero-shot.
- 💡 Group Editing modifies a batch of consistent images in a single pass, accepted at CVPR 2026.
- 🚀 Cohere Transcribe transcribes 16 minutes of audio in 9 seconds, a 100x real-time factor.
- ⚠️ Astralab improves AI video quality at no extra memory or inference time cost.
Group Editing: edit a batch of images in a single pass, accepted at CVPR 2026, available as open source LoRA.
Astralab: an RL framework that improves AI video quality at no extra memory cost, compatible with existing models.
Cohere Transcribe: 2B parameters, 4 GB, 16 minutes of audio transcribed in 9 seconds, Apache 2.0.
All 4 releases are open source or available as public weights on Hugging Face.
Pixel Smile: finally, precise control over facial emotions
There is a problem that has been haunting AI facial editing for years: when you ask a model to make someone look scared, it also sneaks in a bit of surprise. The two expressions share too many common facial markers. This is called semantic entanglement.
Pixel Smile was built specifically to solve this. The model covers 12 expressions: the 6 basics (happiness, sadness, anger, fear, surprise, disgust) plus 6 extended ones (confused, contempt, confident, shy, sleepy, anxious). Each one comes with a continuous intensity slider. You do not toggle an expression, you dial it in.
What is really impressive is the blending. The team tested all 15 possible combinations of the 6 basic emotions. 9 of them produce coherent compound expressions the model has never seen during training. Anger + sadness yields a haunted look. Happiness + disgust is exactly the face someone makes when they bite into something bad.
These blending results are entirely zero-shot. The model was not trained on compound expressions; it learned the underlying emotional topology.
Technically, Pixel Smile is a LoRA adapter on QN-Image-Edit-2511, a multimodal diffusion transformer. 850 MB in safe tensor format. Your VRAM requirements depend on the base model, not the LoRA.
Model | Precision (6 emotions) | Structural confusion rate |
|---|---|---|
Pixel Smile | 0.8627 | 0.0550 ← lowest |
Nano Banana Pro | 0.8431 | 0.1754 |
GPT Image 1.5 | 0.8039 | 0.1107 |
Other models | variable | > 0.2000 |
Code on GitHub, weights on Hugging Face right now.
Group Editing: edit a batch of images with a single prompt
You have 4 photos of the same dog, each from a different angle. You want to restyle all of them with a single prompt. Result: 4 images edited together, consistently, without touching poses or backgrounds.
That is exactly what Group Editing does, a paper just accepted at CVPR 2026, the top venue in computer vision.
The demonstrated use cases cover a lot of ground. Restyling 4 photos of the same object, consistent colorization of 4 black-and-white images, converting 4 line drawings into realistic renders. Character swapping: you provide a reference image of a character and two scenes with different characters, the system replaces in both scenes simultaneously. Global color change across 4 images of the same car, style transfer on 4 elephants in a single pass.
Like Pixel Smile, it runs as a LoRA, compatible with your existing generation pipeline. Code and weights available on GitHub.
Astralab: improving AI video without touching memory
AI-generated video has a common problem: it often looks flat, movements feel off, and there is this diffuse sense that something is not quite right. Astralab is a reinforcement learning framework you attach to an existing distilled video model to fix exactly that.
What sets Astralab apart from other RL approaches for video: it increases neither the memory footprint nor the inference time. That is the piece every previous attempt was missing.
The technical trick is called trajectory-free forward process RL. Classic RL methods need to unroll the entire reverse diffusion process to compute gradients, which blows up memory. Astralab sidesteps this by directly comparing positive and negative final outputs. Zero trajectory storage.
For long video, a rolling KV cache processes clip windows one at a time, so memory usage stays constant regardless of length. To prevent the model from gaming the reward function, a multi-reward objective simultaneously covers visual quality, motion dynamics, and text alignment.
The results against causal vid and self forcing are clear. On every prompt tested, Astralab is a step above. If you already use models like Claude Code for video content, this is the kind of toolkit worth keeping on hand.
Available now, tested on Craya 14B (40GB+ VRAM) and Causal Forcing 1.3B for lighter setups.
Cohere Transcribe: 16 minutes of audio in 9 seconds
Cohere just released a speech-to-text transcription model. 2 billion parameters, about 4 GB. 14 languages: English, French, German, Italian, Chinese, Japanese, Arabic, Vietnamese, and 6 others. Apache 2.0 license.
I tested it myself: a 16-minute audio file, roughly 1,000 seconds. Result in 9 seconds. Real-time factor: 100x. The transcription itself was clean, no hallucinations, no garbled words.
9 seconds for 16 minutes of audio. If you have ever waited for Whisper to grind through a long file locally, you know exactly what that changes.
Criterion | Cohere Transcribe | Whisper E3 (OpenAI) |
|---|---|---|
Model size | 4 GB | ~10 GB |
Languages | 14 | 99+ |
License | Apache 2.0 (commercial) | MIT |
Speed (x real-time) | ~100x | ~30-50x |
WER (AMI benchmark) | Top of leaderboard | 2nd place |
Win rate vs 11labs | 51% | , |
This is a gated model on Hugging Face, so before you run anything, head to their HF repo, click 'Agree and access repository', generate a read token, and paste it into the provided Colab notebook. After that, run all, and you have a Gradio interface in 2 minutes with file upload or direct microphone recording.
What this actually changes
These 4 releases share one thing: they are open source or available as public weights. No waitlist, no closed API. You download, you test, you integrate.
Pixel Smile and Group Editing transform batch visual asset management. Astralab makes your video generations more professional without changing your stack. Cohere Transcribe replaces Whisper if speed is your bottleneck.
For teams building automated content pipelines, these are exactly the kind of building blocks we use in AI operating systems for clients: specialized, lightweight, composable tools.
