research-article

Open access

ExpressEdit: Video Editing with Natural Language and Sketching

Authors:

Bekzat Tilekbay,

Saelyne Yang,

Michal Adam Lewkowicz,

Alex Suryapranata,

Juho KimAuthors Info & Claims

IUI '24: Proceedings of the 29th International Conference on Intelligent User Interfaces

Pages 515 - 536

https://doi.org/10.1145/3640543.3645164

Published: 05 April 2024 Publication History

All formats PDF

Abstract

Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality—natural language (NL) and sketching, which are natural modalities humans use for expression—can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user’s multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

References

[1]

Ayodeji Opeyemi Abioye, Stephen D. Prior, Peter Saddington, and Sarvapali D. Ramchurn. 2022. The performance and cognitive workload analysis of a multimodal speech and visual gesture (mSVG) UAV control interface. Robotics and Autonomous Systems 147 (Jan. 2022), 103915. https://doi.org/10.1016/j.robot.2021.103915

Abstract

References

Cited By

Index Terms

Recommendations

ExpressEdit: Video Editing with Natural Language and Sketching

Schematic storyboarding for video visualization and editing

Multi-clip video editing from a single viewpoint

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations