Multimodal-AND-Large-Language-Models

Multimodal-AND-Large-Language-Models

多模态与大语言模型前沿研究综述

本项目汇总了多模态和大语言模型领域的最新研究进展,涵盖结构化知识提取、事件抽取、场景图生成和属性识别等核心技术。同时探讨了视觉语言模型在推理、组合性和开放词汇等方面的前沿问题。项目还收录了大量相关综述和立场文章,为研究人员提供全面的领域概览和未来方向参考。

多模态大语言模型视觉语言模型人工智能机器学习Github开源项目

Multimodal & Large Language Models

Note: This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

Table of Contents

Survey

  • Multimodal Learning with Transformers: A Survey; Peng Xu, Xiatian Zhu, David A. Clifton
  • Multimodal Machine Learning: A Survey and Taxonomy; Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency; Introduce 4 challenges for multi-modal learning, including representation, translation, alignment, fusion, and co-learning.
  • FOUNDATIONS & RECENT TRENDS IN MULTIMODAL MACHINE LEARNING: PRINCIPLES, CHALLENGES, & OPEN QUESTIONS; Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
  • Multimodal research in vision and language: A review of current and emerging trends; Shagun Uppal et al;
  • Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods; Aditya Mogadala et al
  • Challenges and Prospects in Vision and Language Research; Kushal Kafle et al
  • A Survey of Current Datasets for Vision and Language Research; Francis Ferraro et al
  • VLP: A Survey on Vision-Language Pre-training; Feilong Chen et al
  • A Survey on Multimodal Disinformation Detection; Firoj Alam et al
  • Vision-Language Pre-training: Basics, Recent Advances, and Future Trends; Zhe Gan et al
  • Deep Multimodal Representation Learning: A Survey; Wenzhong Guo et al
  • The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges; Maria Lymperaiou et al
  • Augmented Language Models: a Survey; Grégoire Mialon et al
  • Multimodal Deep Learning; Matthias Aßenmacher et al
  • Sparks of Artificial General Intelligence: Early experiments with GPT-4; Sebastien Bubeck et al
  • Retrieving Multimodal Information for Augmented Generation: A Survey; Ruochen Zhao et al
  • Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning; Renze Lou et al
  • A Survey of Large Language Models; Wayne Xin Zhao et al
  • Tool Learning with Foundation Models; Yujia Qin et al
  • A Cookbook of Self-Supervised Learning; Randall Balestriero et al
  • Foundation Models for Decision Making: Problems, Methods, and Opportunities; Sherry Yang et al
  • Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation; Patrick Fernandes et al
  • Reasoning with Language Model Prompting: A Survey; Shuofei Qiao et al
  • Towards Reasoning in Large Language Models: A Survey; Jie Huang et al
  • Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models; Chen Ling et al
  • Unifying Large Language Models and Knowledge Graphs: A Roadmap; Shirui Pan et al
  • Interactive Natural Language Processing; Zekun Wang et al
  • A Survey on Multimodal Large Language Models; Shukang Yin et al
  • TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT; Yang Liu et al
  • Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback; Stephen Casper et al
  • Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies; Liangming Pan et al
  • Challenges and Applications of Large Language Models; Jean Kaddour et al
  • Aligning Large Language Models with Human: A Survey; Yufei Wang et al
  • Instruction Tuning for Large Language Models: A Survey; Shengyu Zhang et al
  • From Instructions to Intrinsic Human Values —— A Survey of Alignment Goals for Big Models; Jing Yao et al
  • A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation; Xiaowei Huang et al
  • Explainability for Large Language Models: A Survey; Haiyan Zhao et al
  • Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models; Yue Zhang et al
  • Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity; Cunxiang Wang et al
  • ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?; Hailin Chen et al
  • Vision-Language Instruction Tuning: A Review and Analysis; Chen Li et al
  • The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities; Yuxiang Zhou et al
  • Efficient Large Language Models: A Survey; Zhongwei Wan et al
  • The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision); Zhengyuan Yang et al
  • Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents; Zhuosheng Zhang et al
  • Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis; Yafei Hu et al
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants; Chunyuan Li et al
  • A Survey on Large Language Model based Autonomous Agents; Lei Wang et al
  • Video Understanding with Large Language Models: A Survey; Yunlong Tang et al
  • A Survey of Preference-Based Reinforcement Learning Methods; Christian Wirth et al
  • AI Alignment: A Comprehensive Survey; Jiaming Ji et al
  • A SURVEY OF REINFORCEMENT LEARNING FROM HUMAN FEEDBACK; Timo Kaufmann et al
  • TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS; Lichao Sun et al
  • AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION; Zane Durante et al
  • Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey; Cedric Colas et al
  • Safety of Multimodal Large Language Models on Images and Text; Xin Liu et al
  • MM-LLMs: Recent Advances in MultiModal Large Language Models; Duzhen Zhang et al
  • Rethinking Interpretability in the Era of Large Language Models; Chandan Singh et al
  • Large Multimodal Agents: A Survey; Junlin Xie et al
  • A Survey on Data Selection for Language Models; Alon Albalak et al
  • What Are Tools Anyway? A Survey from the Language Model Perspective; Zora Zhiruo Wang et al
  • Best Practices and Lessons Learned on Synthetic Data for Language Models; Ruibo Liu et al
  • A Survey on the Memory Mechanism of Large Language Model based Agents; Zeyu Zhang et al
  • A Survey on Self-Evolution of Large Language Models; Zhengwei Tao et al
  • When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models; Xianzheng Ma et al
  • An Introduction to Vision-Language Modeling; Florian Bordes et al
  • Towards Scalable Automated Alignment of LLMs: A Survey; Boxi Cao et al
  • A Survey on Mixture of Experts; Weilin Cai et al
  • The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective; Zhen Qin et al
  • Retrieval-Augmented Generation for Large Language Models: A Survey; Yunfan Gao et al

Position Paper

  • Eight Things to Know about Large Language Models; Samuel R. Bowman et al
  • A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models; Oana Ignat et al
  • Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models; Yuxi Ma et al
  • Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models; Lingxi Xie et al
  • A Path Towards Autonomous Machine Intelligence; Yann LeCun et al
  • GPT-4 Can’t Reason; Konstantine Arkoudas et al
  • Cognitive Architectures for Language Agents; Theodore Sumers et al
  • Large Search Model: Redefining Search Stack in the Era of LLMs; Liang Wang et al
  • PROAGENT: FROM ROBOTIC PROCESS AUTOMATION TO AGENTIC PROCESS AUTOMATION; Yining Ye et al
  • Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning; Zhiting Hu et al
  • A Roadmap to Pluralistic Alignment; Taylor Sorensen et al
  • Towards Unified Alignment Between Agents, Humans, and Environment; Zonghan Yang et al
  • Video as the New Language for Real-World Decision Making; Sherry Yang et al
  • A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI; Seliem El-Sayed et al

Structure

  • Finding Structural Knowledge in Multimodal-BERT; Victor Milewski et al
  • Going Beyond Nouns With Vision & Language Models Using Synthetic Data; Paola Cascante-Bonilla et al
  • Measuring Progress in Fine-grained Vision-and-Language Understanding; Emanuele Bugliarello et al
  • PV2TEA: Patching Visual Modality to Textual-Established Information Extraction; Hejie Cui et al

Event Extraction

  • Cross-media Structured Common Space for Multimedia Event Extraction; Manling Li et al; Focus on image-text event extraction. A new benchmark and baseline are proposed.
  • Visual Semantic Role Labeling for Video Understanding; Arka Sadhu et al; A new benchmark is proposed.
  • GAIA: A Fine-grained Multimedia Knowledge Extraction System; Manling Li et al; Demo paper. Extract knowledge (relation, event) from multimedia data.
  • MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities; Yubo Ma et al

Situation Recognition

  • Situation Recognition: Visual Semantic Role Labeling for Image Understanding; Mark Yatskar et al; Focus on image understanding. Given images, do the semantic role labeling task. No text available. A new benchmark and baseline are proposed.
  • Commonly Uncommon: Semantic Sparsity in Situation Recognition; Mark Yatskar et al; Address the long-tail problem.
  • Grounded Situation Recognition; Sarah Pratt et al
  • Rethinking the Two-Stage Framework for Grounded Situation Recognition; Meng Wei et al
  • Collaborative Transformers for Grounded Situation Recognition; Junhyeong Cho et al

Scene Graph

  • Action Genome: Actions as Composition of Spatio-temporal Scene Graphs; Jingwei Ji et al; Spatio-temporal scene graphs (video).
  • Unbiased Scene Graph Generation from Biased Training; Kaihua Tang et al
  • Visual Distant Supervision for Scene Graph Generation; Yuan Yao et al
  • Learning to Generate Scene Graph from Natural Language Supervision; Yiwu Zhong et al
  • Weakly Supervised Visual Semantic Parsing; Alireza Zareian, Svebor Karaman, Shih-Fu Chang
  • Scene Graph Prediction with Limited Labels; Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, Li Fei-Fei
  • Neural Motifs: Scene Graph Parsing with Global Context; Rowan Zellers et al
  • Fine-Grained Scene Graph Generation with Data Transfer; Ao Zhang et al
  • Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning; Tao He et al
  • COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION; Kaifeng Gao et al; Video.
  • LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation; Xiaoguang Chang et al
  • TRANSFORMER-BASED IMAGE GENERATION FROM SCENE GRAPHS; Renato Sortino et al
  • The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation; Lin Li et al
  • Knowledge-augmented Few-shot Visual Relation Detection; Tianyu Yu et al
  • Prototype-based Embedding Network for Scene Graph Generation; Chaofan Zhen et al
  • Unified Visual Relationship Detection with Vision and Language Models; Long Zhao et al
  • Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge; Yufeng Huang et al

Attribute

  • COCO Attributes: Attributes for People, Animals, and Objects; Genevieve Patterson et al
  • Human Attribute Recognition by Deep Hierarchical Contexts; Yining Li et al; Attribute prediction in specific domains.
  • Emotion Recognition in Context; Ronak Kosti et al; Attribute prediction in specific domains.
  • The iMaterialist Fashion Attribute Dataset; Sheng Guo et al; Attribute prediction in specific domains.
  • Learning to Predict Visual Attributes in the Wild; Khoi Pham et al
  • Open-vocabulary Attribute Detection; Marıa A. Bravo et al
  • OvarNet: Towards Open-vocabulary Object Attribute Recognition; Keyan Chen et al

Compositionality

  • CREPE: Can Vision-Language Foundation Models Reason Compositionally?; Zixian Ma et al
  • Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality; Tristan Thrush et al
  • WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?; Mert Yuksekgonul et al
  • GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering; Drew A. Hudson et al
  • COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images; Ben Bogin et al
  • Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension; Zhenfang Chen et al
  • **Do

编辑推荐精选

博思AIPPT

博思AIPPT

AI一键生成PPT,就用博思AIPPT!

博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。

AI办公办公工具AI工具博思AIPPTAI生成PPT智能排版海量精品模板AI创作热门
潮际好麦

潮际好麦

AI赋能电商视觉革命,一站式智能商拍平台

潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。

iTerms

iTerms

企业专属的AI法律顾问

iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。

SimilarWeb流量提升

SimilarWeb流量提升

稳定高效的流量提升解决方案,助力品牌曝光

稳定高效的流量提升解决方案,助力品牌曝光

Sora2视频免费生成

Sora2视频免费生成

最新版Sora2模型免费使用,一键生成无水印视频

最新版Sora2模型免费使用,一键生成无水印视频

Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

热门AI辅助写作AI工具讯飞绘文内容运营AI创作个性化文章多平台分发AI助手
TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
商汤小浣熊

商汤小浣熊

最强AI数据分析助手

小浣熊家族Raccoon,您的AI智能助手,致力于通过先进的人工智能技术,为用户提供高效、便捷的智能服务。无论是日常咨询还是专业问题解答,小浣熊都能以快速、准确的响应满足您的需求,让您的生活更加智能便捷。

imini AI

imini AI

像人一样思考的AI智能体

imini 是一款超级AI智能体,能根据人类指令,自主思考、自主完成、并且交付结果的AI智能体。

下拉加载更多