Multimodal-AND-Large-Language-Models

Multimodal-AND-Large-Language-Models

多模态与大语言模型前沿研究综述

本项目汇总了多模态和大语言模型领域的最新研究进展,涵盖结构化知识提取、事件抽取、场景图生成和属性识别等核心技术。同时探讨了视觉语言模型在推理、组合性和开放词汇等方面的前沿问题。项目还收录了大量相关综述和立场文章,为研究人员提供全面的领域概览和未来方向参考。

多模态大语言模型视觉语言模型人工智能机器学习Github开源项目

Multimodal & Large Language Models

Note: This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

Table of Contents

Survey

  • Multimodal Learning with Transformers: A Survey; Peng Xu, Xiatian Zhu, David A. Clifton
  • Multimodal Machine Learning: A Survey and Taxonomy; Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency; Introduce 4 challenges for multi-modal learning, including representation, translation, alignment, fusion, and co-learning.
  • FOUNDATIONS & RECENT TRENDS IN MULTIMODAL MACHINE LEARNING: PRINCIPLES, CHALLENGES, & OPEN QUESTIONS; Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
  • Multimodal research in vision and language: A review of current and emerging trends; Shagun Uppal et al;
  • Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods; Aditya Mogadala et al
  • Challenges and Prospects in Vision and Language Research; Kushal Kafle et al
  • A Survey of Current Datasets for Vision and Language Research; Francis Ferraro et al
  • VLP: A Survey on Vision-Language Pre-training; Feilong Chen et al
  • A Survey on Multimodal Disinformation Detection; Firoj Alam et al
  • Vision-Language Pre-training: Basics, Recent Advances, and Future Trends; Zhe Gan et al
  • Deep Multimodal Representation Learning: A Survey; Wenzhong Guo et al
  • The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges; Maria Lymperaiou et al
  • Augmented Language Models: a Survey; Grégoire Mialon et al
  • Multimodal Deep Learning; Matthias Aßenmacher et al
  • Sparks of Artificial General Intelligence: Early experiments with GPT-4; Sebastien Bubeck et al
  • Retrieving Multimodal Information for Augmented Generation: A Survey; Ruochen Zhao et al
  • Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning; Renze Lou et al
  • A Survey of Large Language Models; Wayne Xin Zhao et al
  • Tool Learning with Foundation Models; Yujia Qin et al
  • A Cookbook of Self-Supervised Learning; Randall Balestriero et al
  • Foundation Models for Decision Making: Problems, Methods, and Opportunities; Sherry Yang et al
  • Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation; Patrick Fernandes et al
  • Reasoning with Language Model Prompting: A Survey; Shuofei Qiao et al
  • Towards Reasoning in Large Language Models: A Survey; Jie Huang et al
  • Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models; Chen Ling et al
  • Unifying Large Language Models and Knowledge Graphs: A Roadmap; Shirui Pan et al
  • Interactive Natural Language Processing; Zekun Wang et al
  • A Survey on Multimodal Large Language Models; Shukang Yin et al
  • TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT; Yang Liu et al
  • Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback; Stephen Casper et al
  • Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies; Liangming Pan et al
  • Challenges and Applications of Large Language Models; Jean Kaddour et al
  • Aligning Large Language Models with Human: A Survey; Yufei Wang et al
  • Instruction Tuning for Large Language Models: A Survey; Shengyu Zhang et al
  • From Instructions to Intrinsic Human Values —— A Survey of Alignment Goals for Big Models; Jing Yao et al
  • A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation; Xiaowei Huang et al
  • Explainability for Large Language Models: A Survey; Haiyan Zhao et al
  • Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models; Yue Zhang et al
  • Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity; Cunxiang Wang et al
  • ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?; Hailin Chen et al
  • Vision-Language Instruction Tuning: A Review and Analysis; Chen Li et al
  • The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities; Yuxiang Zhou et al
  • Efficient Large Language Models: A Survey; Zhongwei Wan et al
  • The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision); Zhengyuan Yang et al
  • Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents; Zhuosheng Zhang et al
  • Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis; Yafei Hu et al
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants; Chunyuan Li et al
  • A Survey on Large Language Model based Autonomous Agents; Lei Wang et al
  • Video Understanding with Large Language Models: A Survey; Yunlong Tang et al
  • A Survey of Preference-Based Reinforcement Learning Methods; Christian Wirth et al
  • AI Alignment: A Comprehensive Survey; Jiaming Ji et al
  • A SURVEY OF REINFORCEMENT LEARNING FROM HUMAN FEEDBACK; Timo Kaufmann et al
  • TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS; Lichao Sun et al
  • AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION; Zane Durante et al
  • Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey; Cedric Colas et al
  • Safety of Multimodal Large Language Models on Images and Text; Xin Liu et al
  • MM-LLMs: Recent Advances in MultiModal Large Language Models; Duzhen Zhang et al
  • Rethinking Interpretability in the Era of Large Language Models; Chandan Singh et al
  • Large Multimodal Agents: A Survey; Junlin Xie et al
  • A Survey on Data Selection for Language Models; Alon Albalak et al
  • What Are Tools Anyway? A Survey from the Language Model Perspective; Zora Zhiruo Wang et al
  • Best Practices and Lessons Learned on Synthetic Data for Language Models; Ruibo Liu et al
  • A Survey on the Memory Mechanism of Large Language Model based Agents; Zeyu Zhang et al
  • A Survey on Self-Evolution of Large Language Models; Zhengwei Tao et al
  • When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models; Xianzheng Ma et al
  • An Introduction to Vision-Language Modeling; Florian Bordes et al
  • Towards Scalable Automated Alignment of LLMs: A Survey; Boxi Cao et al
  • A Survey on Mixture of Experts; Weilin Cai et al
  • The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective; Zhen Qin et al
  • Retrieval-Augmented Generation for Large Language Models: A Survey; Yunfan Gao et al

Position Paper

  • Eight Things to Know about Large Language Models; Samuel R. Bowman et al
  • A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models; Oana Ignat et al
  • Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models; Yuxi Ma et al
  • Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models; Lingxi Xie et al
  • A Path Towards Autonomous Machine Intelligence; Yann LeCun et al
  • GPT-4 Can’t Reason; Konstantine Arkoudas et al
  • Cognitive Architectures for Language Agents; Theodore Sumers et al
  • Large Search Model: Redefining Search Stack in the Era of LLMs; Liang Wang et al
  • PROAGENT: FROM ROBOTIC PROCESS AUTOMATION TO AGENTIC PROCESS AUTOMATION; Yining Ye et al
  • Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning; Zhiting Hu et al
  • A Roadmap to Pluralistic Alignment; Taylor Sorensen et al
  • Towards Unified Alignment Between Agents, Humans, and Environment; Zonghan Yang et al
  • Video as the New Language for Real-World Decision Making; Sherry Yang et al
  • A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI; Seliem El-Sayed et al

Structure

  • Finding Structural Knowledge in Multimodal-BERT; Victor Milewski et al
  • Going Beyond Nouns With Vision & Language Models Using Synthetic Data; Paola Cascante-Bonilla et al
  • Measuring Progress in Fine-grained Vision-and-Language Understanding; Emanuele Bugliarello et al
  • PV2TEA: Patching Visual Modality to Textual-Established Information Extraction; Hejie Cui et al

Event Extraction

  • Cross-media Structured Common Space for Multimedia Event Extraction; Manling Li et al; Focus on image-text event extraction. A new benchmark and baseline are proposed.
  • Visual Semantic Role Labeling for Video Understanding; Arka Sadhu et al; A new benchmark is proposed.
  • GAIA: A Fine-grained Multimedia Knowledge Extraction System; Manling Li et al; Demo paper. Extract knowledge (relation, event) from multimedia data.
  • MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities; Yubo Ma et al

Situation Recognition

  • Situation Recognition: Visual Semantic Role Labeling for Image Understanding; Mark Yatskar et al; Focus on image understanding. Given images, do the semantic role labeling task. No text available. A new benchmark and baseline are proposed.
  • Commonly Uncommon: Semantic Sparsity in Situation Recognition; Mark Yatskar et al; Address the long-tail problem.
  • Grounded Situation Recognition; Sarah Pratt et al
  • Rethinking the Two-Stage Framework for Grounded Situation Recognition; Meng Wei et al
  • Collaborative Transformers for Grounded Situation Recognition; Junhyeong Cho et al

Scene Graph

  • Action Genome: Actions as Composition of Spatio-temporal Scene Graphs; Jingwei Ji et al; Spatio-temporal scene graphs (video).
  • Unbiased Scene Graph Generation from Biased Training; Kaihua Tang et al
  • Visual Distant Supervision for Scene Graph Generation; Yuan Yao et al
  • Learning to Generate Scene Graph from Natural Language Supervision; Yiwu Zhong et al
  • Weakly Supervised Visual Semantic Parsing; Alireza Zareian, Svebor Karaman, Shih-Fu Chang
  • Scene Graph Prediction with Limited Labels; Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, Li Fei-Fei
  • Neural Motifs: Scene Graph Parsing with Global Context; Rowan Zellers et al
  • Fine-Grained Scene Graph Generation with Data Transfer; Ao Zhang et al
  • Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning; Tao He et al
  • COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION; Kaifeng Gao et al; Video.
  • LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation; Xiaoguang Chang et al
  • TRANSFORMER-BASED IMAGE GENERATION FROM SCENE GRAPHS; Renato Sortino et al
  • The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation; Lin Li et al
  • Knowledge-augmented Few-shot Visual Relation Detection; Tianyu Yu et al
  • Prototype-based Embedding Network for Scene Graph Generation; Chaofan Zhen et al
  • Unified Visual Relationship Detection with Vision and Language Models; Long Zhao et al
  • Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge; Yufeng Huang et al

Attribute

  • COCO Attributes: Attributes for People, Animals, and Objects; Genevieve Patterson et al
  • Human Attribute Recognition by Deep Hierarchical Contexts; Yining Li et al; Attribute prediction in specific domains.
  • Emotion Recognition in Context; Ronak Kosti et al; Attribute prediction in specific domains.
  • The iMaterialist Fashion Attribute Dataset; Sheng Guo et al; Attribute prediction in specific domains.
  • Learning to Predict Visual Attributes in the Wild; Khoi Pham et al
  • Open-vocabulary Attribute Detection; Marıa A. Bravo et al
  • OvarNet: Towards Open-vocabulary Object Attribute Recognition; Keyan Chen et al

Compositionality

  • CREPE: Can Vision-Language Foundation Models Reason Compositionally?; Zixian Ma et al
  • Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality; Tristan Thrush et al
  • WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?; Mert Yuksekgonul et al
  • GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering; Drew A. Hudson et al
  • COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images; Ben Bogin et al
  • Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension; Zhenfang Chen et al
  • **Do

编辑推荐精选

即梦AI

即梦AI

一站式AI创作平台

提供 AI 驱动的图片、视频生成及数字人等功能,助力创意创作

扣子-AI办公

扣子-AI办公

AI办公助手,复杂任务高效处理

AI办公助手,复杂任务高效处理。办公效率低?扣子空间AI助手支持播客生成、PPT制作、网页开发及报告写作,覆盖科研、商业、舆情等领域的专家Agent 7x24小时响应,生活工作无缝切换,提升50%效率!

Keevx

Keevx

AI数字人视频创作平台

Keevx 一款开箱即用的AI数字人视频创作平台,广泛适用于电商广告、企业培训与社媒宣传,让全球企业与个人创作者无需拍摄剪辑,就能快速生成多语言、高质量的专业视频。

TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
蛙蛙写作

蛙蛙写作

AI小说写作助手,一站式润色、改写、扩写

蛙蛙写作—国内先进的AI写作平台,涵盖小说、学术、社交媒体等多场景。提供续写、改写、润色等功能,助力创作者高效优化写作流程。界面简洁,功能全面,适合各类写作者提升内容品质和工作效率。

AI辅助写作AI工具蛙蛙写作AI写作工具学术助手办公助手营销助手AI助手
问小白

问小白

全能AI智能助手,随时解答生活与工作的多样问题

问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。

热门AI助手AI对话AI工具聊天机器人
Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞智文

讯飞智文

一键生成PPT和Word,让学习生活更轻松

讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。

AI办公办公工具AI工具讯飞智文AI在线生成PPTAI撰写助手多语种文档生成AI自动配图热门
讯飞星火

讯飞星火

深度推理能力全新升级,全面对标OpenAI o1

科大讯飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。

热门AI开发模型训练AI工具讯飞星火大模型智能问答内容创作多语种支持智慧生活
Spark-TTS

Spark-TTS

一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型

Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。

下拉加载更多