Week of March 23, 2026
Papers
-
MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames...
-
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask ...
-
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained ...
-
Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation
Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions...
-
CoVR-R:Reason-Aware Composed Video Retrieval
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification...
Blog Posts
- Build a Domain-Specific Embedding Model in Under a Day
- What's New in Mellea 0.4.0 + Granite Libraries Release
-
How we monitor internal coding agents for misalignment
How OpenAI uses chain-of-thought monitoring to study misalignment in internal coding agents—analyzing real-world deployments to detect risks and st...
Week of March 16, 2026
Papers
-
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. B...
-
Representation Learning for Spatiotemporal Physical Systems
Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accura...
-
Visual-ERM: Reward Modeling for Visual Equivalence
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured repres...
-
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D fra...
-
Neuron-Aware Data Selection In Instruction Tuning For Large Language Models
Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent stu...
Blog Posts
- Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline
-
Rakuten fixes issues twice as fast with Codex
Rakuten uses Codex, the coding agent from OpenAI, to ship software faster and safer, reducing MTTR 50%, automating CI/CD reviews, and delivering fu...
-
Designing AI agents to resist prompt injection
How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.
Week of March 9, 2026
Papers
-
Multimodal Large Language Models as Image Classifiers
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comp...
-
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive archite...
-
BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic unde...
-
Fly360: Omnidirectional Obstacle Avoidance within Drone View
Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spati...
-
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation
Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remai...
Blog Posts
-
How Descript enables multilingual video dubbing at scale
Descript uses OpenAI models to scale multilingual video dubbing, optimizing translations for both meaning and timing so dubbed speech sounds natura...
-
Codex Security: now in research preview
Codex Security is an AI application security agent that analyzes project context to detect, validate, and patch complex vulnerabilities with higher...
-
How Balyasny Asset Management built an AI research engine for investing
See how Balyasny built an AI research system with GPT-5.4, rigorous model evaluation, and agent workflows to transform investment analysis at scale.
Week of March 2, 2026
Papers
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Shows that adaptively allocating test-time compute can outperform 14x larger models.
-
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Demonstrates how RL-based training produces strong reasoning without supervised fine-tuning.
-
Utonia: Toward One Encoder for All Point Clouds
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we pre...
-
MIBURI: Towards Expressive Interactive Gesture Synthesis
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large...
-
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we...
-
How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks ar...
-
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing appr...
Blog Posts
-
A Visual Guide to Quantization
Intuitive visual walkthrough of LLM quantization techniques from FP16 to GPTQ and GGUF.
-
Inference at the Edge: Practical Lessons from Deploying LLMs on Consumer GPUs
Real-world tips for efficient local LLM inference with vLLM on consumer hardware.
-
Understanding AI and learning outcomes
OpenAI introduces the Learning Outcomes Measurement Suite to assess AI's impact on student learning across diverse educational environments over time.
- PRX Part 3 — Training a Text-to-Image Model in 24h!
- GPT-5.3 Instant System Card