Weekly Reads

Week of March 23, 2026

Papers

MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints — Yu Qi, Xinyi Xu et al.
Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames...
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering — Xinyi Shang, Yi Tang et al.
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask ...
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation — Jiazheng Xing, Fei Du et al.
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained ...
Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation — Sebastian Gerard, Josephine Sullivan
Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions...
CoVR-R:Reason-Aware Composed Video Retrieval — Omkar Thawakar, Dmitry Demidov et al.
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification...

Blog Posts

Build a Domain-Specific Embedding Model in Under a Day — Hugging Face Blog
What's New in Mellea 0.4.0 + Granite Libraries Release — Hugging Face Blog
How we monitor internal coding agents for misalignment — OpenAI Blog
How OpenAI uses chain-of-thought monitoring to study misalignment in internal coding agents—analyzing real-world deployments to detect risks and st...

Week of March 16, 2026

Papers

PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization — Yangsong Zhang, Anujith Muraleedharan et al.
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. B...
Representation Learning for Spatiotemporal Physical Systems — Helen Qu, Rudy Morel et al.
Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accura...
Visual-ERM: Reward Modeling for Visual Equivalence — Ziyu Liu, Shengyuan Ding et al.
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured repres...
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models — Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D fra...
Neuron-Aware Data Selection In Instruction Tuning For Large Language Models — Xin Chen, Junchao Wu et al.
Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent stu...

Blog Posts

Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline — Hugging Face Blog
Rakuten fixes issues twice as fast with Codex — OpenAI Blog
Rakuten uses Codex, the coding agent from OpenAI, to ship software faster and safer, reducing MTTR 50%, automating CI/CD reviews, and delivering fu...
Designing AI agents to resist prompt injection — OpenAI Blog
How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.

Week of March 9, 2026

Papers

Multimodal Large Language Models as Image Classifiers — Nikita Kisel, Illia Volkov et al.
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comp...
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion — Lijiang Li, Zuwei Long et al.
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive archite...
BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations — Thomas Monninger, Shaoyuan Xie et al.
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic unde...
Fly360: Omnidirectional Obstacle Avoidance within Drone View — Xiangkai Zhang, Dizhe Zhang et al.
Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spati...
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation — Vishal Thengane, Zhaochong An et al.
Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remai...

Blog Posts

How Descript enables multilingual video dubbing at scale — OpenAI Blog
Descript uses OpenAI models to scale multilingual video dubbing, optimizing translations for both meaning and timing so dubbed speech sounds natura...
Codex Security: now in research preview — OpenAI Blog
Codex Security is an AI application security agent that analyzes project context to detect, validate, and patch complex vulnerabilities with higher...
How Balyasny Asset Management built an AI research engine for investing — OpenAI Blog
See how Balyasny built an AI research system with GPT-5.4, rigorous model evaluation, and agent workflows to transform investment analysis at scale.

Week of March 2, 2026

Papers

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Snell et al.
Shows that adaptively allocating test-time compute can outperform 14x larger models.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI
Demonstrates how RL-based training produces strong reasoning without supervised fine-tuning.
Utonia: Toward One Encoder for All Point Clouds — Yujia Zhang, Xiaoyang Wu et al.
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we pre...
MIBURI: Towards Expressive Interactive Gesture Synthesis — M. Hamza Mughal, Rishabh Dabral et al.
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large...
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance — Hanyang Wang, Yiyang Liu et al.
Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we...
How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference — Toru Lin, Shuying Deng et al.
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks ar...
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation — Xialin He, Sirui Xu et al.
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing appr...

Blog Posts

A Visual Guide to Quantization — Maarten Grootendorst
Intuitive visual walkthrough of LLM quantization techniques from FP16 to GPTQ and GGUF.
Inference at the Edge: Practical Lessons from Deploying LLMs on Consumer GPUs — vLLM Blog
Real-world tips for efficient local LLM inference with vLLM on consumer hardware.
Understanding AI and learning outcomes — OpenAI Blog
OpenAI introduces the Learning Outcomes Measurement Suite to assess AI's impact on student learning across diverse educational environments over time.
PRX Part 3 — Training a Text-to-Image Model in 24h! — Hugging Face Blog
GPT-5.3 Instant System Card — OpenAI Blog