๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป About Me

Hi, I am Xiaokun Feng (ไธฐๆ•ˆๅค)! Iโ€™m a Ph.D. student at Institute of Automation, Chinese Academy of Sciences (CASIA), supervised by Prof. Kaiqi Huang (IAPR Fellow). Additionally, Iโ€™m a member of Visual Intelligence Interest Group (VIIG).

Currently, my research focuses on video multimodal tracking and video generation tasks. If you are intrigued by my work or wish to collaborate, feel free to reach out to me.

๐Ÿ”ฅ News

  • 2025.09: ๐Ÿ“One co-first author paper (CoS) has been accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS, CCF-A Conference, Poster).
  • 2025.08: ๐Ÿ“ฃOur new benchmark (NarrLV) is now available! It is a novel benchmark to evaluate long video generation models from the perspective of narrative expressiveness.
  • 2025.06: ๐Ÿ“Two papers (ATCTrack, VMBench) have been accepted by International Conference on Computer Vision (ICCV, CCF-A conference). ATCTrack was recognized as a Highlight paper.
  • 2025.05: ๐Ÿ“One paper (CSTrack) has been accepted by International Conference on Machine Learning (ICML, CCF-A conference).
  • 2025.01: ๐Ÿ“One paper (CTVLT) has been accepted by IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP, CCF-B conference).
  • 2024.09: ๐Ÿ“Two papers (MemVLT and CPDTrack) have been accepted by Conference on Neural Information Processing Systems (NeurIPS, CCF-A Conference).
  • 2024.05: ๐Ÿ“One paper (LKRobust) has been accepted by International Conference on Machine Learning (ICML, CCF-A conference).
  • 2024.04: ๐Ÿ“ฃ We will present our work (Global Instance Tracking) at TPAMI2023 during the VALSE2024 poster session (May 2024, Chongqing, China) and extend a warm invitation to colleagues interested in visual object/language tracking, evaluation methodologies, and human-computer interaction to engage in discussions with us (see our Poster for more information).
  • 2024.04: ๐Ÿ“ One paper has been accepted by the 3rd CVPR Workshop on Vision Datasets Understanding and DataCV Challenge as Oral Presentation (CVPRW, Workshop in CCF-A Conference, Oral)!
  • 2023.09: ๐Ÿ“ One paper has been accepted by the 37th Conference on Neural Information Processing Systems (NeurIPS, CCF-A Conference, Poster)!
  • 2023.08 : ๐Ÿ“One paper (HIST) has been accepted by Chinese Conference on Pattern Recognition and Computer Vision (PRCV, CCF-C conference).
  • 2022.04: ๐Ÿ† Obtain Beijing Outstanding Graduates (ๅŒ—ไบฌๅธ‚ไผ˜็ง€ๆฏ•ไธš็”Ÿ) !
  • 2021.12: ๐Ÿ† Obtain China National Scholarship (ๅ›ฝๅฎถๅฅ–ๅญฆ้‡‘) (the highest honor for undergraduates in China, awarded to top 1% students of BIT)!
  • 2020.12: ๐Ÿ† Obtain China National Scholarship (ๅ›ฝๅฎถๅฅ–ๅญฆ้‡‘) (the highest honor for undergraduates in China, awarded to top 1% students of BIT)!

๐Ÿ”ฌ Research Interests

Video multimodal tracking

  • Investigating multimodal tracking to address challenges in integrating visual information with auxiliary modalities (e.g., language, heat maps, infrared images, and depth maps), thereby enhancing tracking accuracy.
  • Leveraging Large Language Models (LLMs) in conjunction with visual-language tracking to explore humanโ€“computer interaction patterns, contributing to the development of more intuitive and user-friendly interaction systems.
  • Following the paradigm of unified foundation models by jointly utilizing datasets from multiple modalities to train a single model capable of handling all multimodal tracking tasks.

Video generation

  • Focusing on long video generation, aiming to improve existing evaluation benchmarks and advance model design in this field.
  • Exploring the applications of foundation video generation models in various downstream tasks.

๐Ÿ“– Educations

sym

2018.09 - 2022.06, undergraduate study, Ranking 5/381 (1.3%)
School of Information and Electronics
Beijing Institute of Technology, Beijing

๐Ÿ’ป Research Experiences

๐Ÿ“ Publications

โ˜‘๏ธ Ongoing Research

Under Review
sym

NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation

Xiaokun Feng, Haiming Yu, Meiqi Wu, Shiyu Hu, et.al

Under Review

๐Ÿ“Œ Long Video Generation ๐Ÿ“Œ Perception-Aligned Evaluation ๐Ÿ“Œ Narrative Content Assessment
๐Ÿ“ƒ Paper

Under Review
sym

$S^2$-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models

Chubin Chen, Jiashu Zhu, Xiaokun Feng, et.al

Under Review

๐Ÿ“Œ Diffusion Models ๐Ÿ“Œ Guidance Optimization ๐Ÿ“Œ Stochastic Sub-Network Techniques
๐Ÿ“ƒ Paper

Under Review
sym

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, et.al

Under Review

๐Ÿ“Œ Visual Effects Generation ๐Ÿ“Œ Multi-Effect Spatial Control ๐Ÿ“Œ Prompt-Guided Generation
๐Ÿ“ƒ Paper

โœ… Acceptance

NeurIPS 2025
sym

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Reward
Honghao Chen, Xingzhou Lou, Xiaokun Feng*, et.al
NeurIPS 2025 (CCF-A Conference)
๐Ÿ“Œ Multimodal Large Language Model ๐Ÿ“Œ Reinforcement Learning for Reasoning ๐Ÿ“Œ Process Reward Model
๐Ÿ“ƒ Paper

ICCV 2025
sym

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
Xiaokun Feng*, Shiyu, Hu*, Xuchen, Li, DaiLing, Zhang, et.al
ICCV 2025 (CCF-A Conference, Highlight)
๐Ÿ“Œ Visual Language Tracking ๐Ÿ“Œ Vision-Language Alignment ๐Ÿ“Œ Adaptive Prompts
๐Ÿ“ƒ Paper

ICCV 2025
sym

VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng*, et.al
ICCV 2025 (CCF-A Conference)
๐Ÿ“Œ Video Generation ๐Ÿ“Œ Perception-Aligned Evaluation ๐Ÿ“Œ Motion Quality Assessment
๐Ÿ“ƒ Paper

ICML 2025
sym

CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features
Xiaokun Feng*, DaiLing, Zhang, Shiyu, Hu*, et.al
ICML 2025 (CCF-A Conference)
๐Ÿ“Œ RGB-X Tracking ๐Ÿ“Œ Foundation Model Design ๐Ÿ“Œ Compact Spatiotemporal Modeling
๐Ÿ“ƒ Paper

ICASSP 2025
sym

Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
Xiaokun Feng*, DaiLing, Zhang, Shiyu, Hu*, et.al
ICASSP 2025 (CCF-B Conference)
๐Ÿ“Œ Visual Language Tracking ๐Ÿ“Œ Foundation Model ๐Ÿ“Œ Multimodal Cue Utilization
๐Ÿ“ƒ Paper

NeurIPS 2024
sym

MemVLT: Vision-language tracking with adaptive memory-based prompts
Xiaokun Feng*, Xuchen Li, Shiyu, Hu*, et.al
NeurIPS 2024 (CCF-A Conference)
๐Ÿ“Œ Visual Language Tracking ๐Ÿ“Œ Memory-Based Prompt Adaptation ๐Ÿ“Œ Complementary Learning Systems Theory
๐Ÿ“ƒ Paper

NeurIPS 2024
sym

Beyond accuracy: Tracking more like human via visual search
DaiLing, Zhang, Shiyu, Hu*, Xiaokun Feng*, et.al
NeurIPS 2024 (CCF-A Conference)
๐Ÿ“Œ Human-like Visual Tracking ๐Ÿ“Œ Central-Peripheral Dichotomy Theory ๐Ÿ“Œ Spatio-Temporal Discontinuity Challenge
๐Ÿ“ƒ Paper

ICML 2024
sym

Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness
Honghao Chen, Yurong Zhang, Xiaokun Feng*, et.al
ICML 2024 (CCF-A Conference)
๐Ÿ“Œ Vision Transformers (ViTs) ๐Ÿ“Œ Large Kernel Convolutional Networks ๐Ÿ“Œ Robustness
๐Ÿ“ƒ Paper

CVPRW 2024
sym

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Xuchen Li, Xiaokun Feng*, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, Kaiqi Huang

๐Ÿ“Œ Visual Language Tracking ๐Ÿ“Œ Large Language Model ๐Ÿ“Œ Evaluation Technique
๐Ÿ“ƒ Paper CVPRW 2024 Oral (Workshop in CCF-A Conference, Oral): the 3rd CVPR Workshop on Vision Datasets Understanding and DataCV Challenge

NeurIPS 2023
sym

A Multi-modal Global Instance Tracking Benchmark (MGIT): Better Locating Target in Complex Spatio-temporal and Causal Relationship

Shiyu Hu, DaiLing, Zhang, Meiqi Wu, Xiaokun Feng*, et.al

๐Ÿ“Œ Visual Language Tracking ๐Ÿ“Œ Long Video Understanding and Reasoning ๐Ÿ“Œ Hierarchical Semantic Information Annotation
NeurIPS 2023 (CCF-A Conference)
[Paper] [Slides]

๐ŸŽ– Honors and Awards

  • China National Scholarship (ๅ›ฝๅฎถๅฅ–ๅญฆ้‡‘), at BIT, by Ministry of Education of China, 2021
  • China National Scholarship (ๅ›ฝๅฎถๅฅ–ๅญฆ้‡‘), at BIT, by Ministry of Education of China, 2020
  • Beijing Outstanding Graduates (ๅŒ—ไบฌๅธ‚ไผ˜็ง€ๆฏ•ไธš็”Ÿ), at BIT, by Beijing Municipal Education Commission, 2022
  • China National Encouragement Scholarship, at BIT, by Ministry of Education of China, 2019

๐Ÿค Collaborators

I am honored to collaborate with these outstanding researchers. We engage in close discussions concerning various fields such as computer vision, AI4Science, and human-computer interaction. If you are also interested in these areas, please feel free to contact me.

My homepage visitors recorded from April, 2024. Thanks for attention.