skip to main content
10.1145/3640543.3645164acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article
Open access

ExpressEdit: Video Editing with Natural Language and Sketching

Published: 05 April 2024 Publication History

Abstract

Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality—natural language (NL) and sketching, which are natural modalities humans use for expression—can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user’s multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

References

[1]
Ayodeji Opeyemi Abioye, Stephen D. Prior, Peter Saddington, and Sarvapali D. Ramchurn. 2022. The performance and cognitive workload analysis of a multimodal speech and visual gesture (mSVG) UAV control interface. Robotics and Autonomous Systems 147 (Jan. 2022), 103915. https://doi.org/10.1016/j.robot.2021.103915
[2]
Khan Academy. 2024. Khan Academy. https://www.khanacademy.org Accessed: 2024-01-25.
[3]
Adobe. 2024. Adobe Photoshop. https://www.adobe.com/products/photoshop.html Accessed: 2024-01-25.
[4]
Adobe. 2024. Adobe Premiere Pro. https://www.adobe.com/products/premiere.html Accessed: 2024-01-25.
[5]
Remotion AG. 2024. Remotion. https://www.remotion.dev Accessed: 2024-01-25.
[6]
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
[7]
Apple. 2024. Final Cut Pro. https://www.apple.com/final-cut-pro Accessed: 2024-01-25.
[8]
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022. Text2LIVE: Text-Driven Layered Image and Video Editing. https://doi.org/10.48550/arXiv.2204.02491
[9]
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Transactions on Graphics 31, 4 (July 2012), 67:1–67:8. https://doi.org/10.1145/2185520.2185563
[10]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165v4
[11]
Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (April 2021), 188:1–188:21. https://doi.org/10.1145/3449287
[12]
Mireille Bétrancourt and Kalliopi Benetos. 2018. Why and when does instructional video facilitate learning? A commentary to the special issue “developments and trends in learning with instructional video”. Computers in Human Behavior 89 (Dec. 2018), 471–475. https://doi.org/10.1016/j.chb.2018.08.035
[13]
Diogo Cabral and Nuno Correia. 2017. Video editing with pen-based technology. Multimedia Tools and Applications 76, 5 (March 2017), 6889–6914. https://doi.org/10.1007/s11042-016-3329-y
[14]
Linda Candy. 2013. Evaluating Creativity. 57–84. https://doi.org/10.1007/978-1-4471-4111-2_4
[15]
Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun. 2023. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT. https://doi.org/10.48550/arXiv.2303.04226
[16]
Juan Casares, A. Chris Long, Brad A. Myers, Rishi Bhatnagar, Scott M. Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques(DIS ’02). Association for Computing Machinery, New York, NY, USA, 157–166. https://doi.org/10.1145/778712.778737
[17]
Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. 2023. StableVideo: Text-driven Consistency-aware Diffusion Video Editing. https://doi.org/10.48550/arXiv.2308.09592
[18]
Gael Chandler. 2004. Cut by cut: editing your film or video. Michael Wiese Productions, Studio City, CA.
[19]
Minsuk Chang, Mina Huh, and Juho Kim. 2021. RubySlippers: Supporting Content-based Voice Navigation for How-to Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3411764.3445131
[20]
Zhutian Chen, Shuainan Ye, Xiangtong Chu, Haijun Xia, Hui Zhang, Huamin Qu, and Yingcai Wu. 2022. Augmenting Sports Videos with VisCommentator. IEEE Transactions on Visualization and Computer Graphics 28, 1 (Jan. 2022), 824–834. https://doi.org/10.1109/TVCG.2021.3114806
[21]
Erin Cherry and Celine Latulipe. 2014. Quantifying the Creativity Support of Digital Tools through the Creativity Support Index. ACM Transactions on Computer-Human Interaction 21, 4 (June 2014), 21:1–21:25. https://doi.org/10.1145/2617588
[22]
Peggy Chi, Tao Dong, Christian Frueh, Brian Colonna, Vivek Kwatra, and Irfan Essa. 2022. Synthesis-Assisted Video Prototyping From a Document. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3526113.3545676
[23]
Peggy Chi, Nathan Frey, Katrina Panovich, and Irfan Essa. 2021. Automatic Instructional Video Creation from a Markdown-Formatted Tutorial. In The 34th Annual ACM Symposium on User Interface Software and Technology(UIST ’21). Association for Computing Machinery, New York, NY, USA, 677–690. https://doi.org/10.1145/3472749.3474778
[24]
Peggy Chi, Zheng Sun, Katrina Panovich, and Irfan Essa. 2020. Automatic Video Creation From a Web Page. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology(UIST ’20). Association for Computing Machinery, New York, NY, USA, 279–292. https://doi.org/10.1145/3379337.3415814
[25]
Pei-Yu Chi, Joyce Liu, Jason Linder, Mira Dontcheva, Wilmot Li, and Bjoern Hartmann. 2013. DemoCut: generating concise instructional videos for physical demonstrations. In Proceedings of the 26th annual ACM symposium on User interface software and technology(UIST ’13). Association for Computing Machinery, New York, NY, USA, 141–150. https://doi.org/10.1145/2501988.2502052
[26]
John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3501819
[27]
Codecademy. 2018. Livestream: Getting Started with C++ (Episode 1). Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/OKQpOzEY_A4
[28]
Descript. 2024. Descript. https://www.descript.com Accessed: 2024-01-25.
[29]
Nick DiGiovanni. 2023. Learn To Cook In Less Than 1 Hour. Video. Retrieved on 2024-01-25 from https://youtu.be/zhI7bQyTmHw
[30]
edX LLC. 2024. edX. https://www.edx.org Accessed: 2024-01-25.
[31]
Logan Fiorella and Richard E. Mayer. 2018. What works and doesn’t work with instructional video. Computers in Human Behavior 89 (Dec. 2018), 465–470. https://doi.org/10.1016/j.chb.2018.07.015
[32]
Ohad Fried and Maneesh Agrawala. 2019. Puppet Dubbing. https://arxiv.org/abs/1902.04285v1
[33]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video. ACM Transactions on Graphics 38, 4 (July 2019), 68:1–68:14. https://doi.org/10.1145/3306346.3323028
[34]
Ankit Gandhi, Arijit Biswas, Kundan Shrivastava, Ranjeet Kumar, Sahil Loomba, and Om Deshmukh. 2016. Easy Navigation through Instructional Videos using Automatically Generated Table of Content. In Companion Publication of the 21st International Conference on Intelligent User Interfaces(IUI ’16 Companion). Association for Computing Machinery, New York, NY, USA, 92–96. https://doi.org/10.1145/2876456.2879472
[35]
Google. 2024. Google Slides. https://slides.google.com Accessed: 2024-01-25.
[36]
Philip J. Guo, Juho Kim, and Rob Rubin. 2014. How video production affects student engagement: an empirical study of MOOC videos. In Proceedings of the first ACM conference on Learning @ scale conference(L@S ’14). Association for Computing Machinery, New York, NY, USA, 41–50. https://doi.org/10.1145/2556325.2566239
[37]
Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Advances in Psychology, Peter A. Hancock and Najmedin Meshkati (Eds.). Human Mental Workload, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9
[38]
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. 2022. Imagen Video: High Definition Video Generation with Diffusion Models. https://doi.org/10.48550/arXiv.2210.02303
[39]
P. T. Hove. 2014. Characteristics of instructional videos for conceptual knowledge development. https://www.semanticscholar.org/paper/Characteristics-of-instructional-videos-for-Hove/c377da3ea8c08dbe79cd36927b25154ecb51cb48
[40]
Xian-Sheng Hua, Zengzhi Wang, and Shipeng Li. 2005. LazyCut: content-aware template-based video authoring. In Proceedings of the 13th annual ACM international conference on Multimedia(MULTIMEDIA ’05). Association for Computing Machinery, New York, NY, USA, 792–793. https://doi.org/10.1145/1101149.1101318
[41]
Nisha Huang, Yuxin Zhang, and Weiming Dong. 2023. Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer. https://doi.org/10.48550/arXiv.2305.05464
[42]
Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, and Gautham J. Mysore. 2019. B-Script: Transcript-based B-roll Video Editing with Recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems(CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300311
[43]
Mina Huh, Saelyne Yang, Yi-Hao Peng, Xiang ’Anthony’ Chen, Young-Ho Kim, and Amy Pavel. 2023. AVscript: Accessible Video Editing with Audio-Visual Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3544548.3581494
[44]
Imvidu. 2024. Imvidu. https://imvidu.com Accessed: 2024-01-25.
[45]
Apple Inc. 2024. iMovie. https://www.apple.com/ca/imovie Accessed: 2024-01-25.
[46]
Coursera Inc. 2024. Coursera. https://www.coursera.org Accessed: 2024-01-25.
[47]
Upwork Global Inc.2024. Upwork. https://www.upwork.com/ Accessed: 2024-01-25.
[48]
Zoom Video Communications Inc. 2024. Zoom. https://zoom.us Accessed: 2024-01-25.
[49]
Amir Jahanlou and Parmit K Chilana. 2022. Katika: An End-to-End System for Authoring Amateur Explainer Motion Graphics Videos. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3491102.3517741
[50]
Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman. 2023. Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3544548.3581196
[51]
Tero Jokela, Kaj Mäkelä, and Minna Karukka. 2007. Empirical observations on video editing in the mobile context. In Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology(Mobility ’07). Association for Computing Machinery, New York, NY, USA, 482–489. https://doi.org/10.1145/1378063.1378140
[52]
Murat Kalender, M. Tolga Eren, Zonghuan Wu, Ozgun Cirakman, Sezer Kutluk, Gunay Gultekin, and Emin Erkan Korkmaz. 2018. Videolization: knowledge graph based automated video generation from web content. Multimedia Tools and Applications 77, 1 (Jan. 2018), 567–595. https://doi.org/10.1007/s11042-016-4275-4
[53]
Kimia Kiani, Parmit K. Chilana, Andrea Bunt, Tovi Grossman, and George Fitzmaurice. 2020. “I Would Just Ask Someone”: Learning Feature-Rich Design Software in the Modern Workplace. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1–10. https://doi.org/10.1109/VL/HCC50065.2020.9127288 ISSN: 1943-6106.
[54]
Juho Kim, Phu Tran Nguyen, Sarah Weir, Philip J. Guo, Robert C. Miller, and Krzysztof Z. Gajos. 2014. Crowdsourcing step-by-step information extraction to enhance existing how-to videos. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems(CHI ’14). Association for Computing Machinery, New York, NY, USA, 4017–4026. https://doi.org/10.1145/2556288.2556986
[55]
Tae Soo Kim, DaEun Choi, Yoonseo Choi, and Juho Kim. 2022. Stylette: Styling the Web with Natural Language. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3491102.3501931
[56]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. https://doi.org/10.48550/arXiv.2304.02643
[57]
KonvaJS. 2024. KonvaJS. https://konvajs.org Accessed: 2024-01-25.
[58]
LangChain. 2024. LangChain. https://www.langchain.com Accessed: 2024-01-25.
[59]
Gierad P. Laput, Mira Dontcheva, Gregg Wilensky, Walter Chang, Aseem Agarwala, Jason Linder, and Eytan Adar. 2013. PixelTone: a multimodal interface for image editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems(CHI ’13). Association for Computing Machinery, New York, NY, USA, 2185–2194. https://doi.org/10.1145/2470654.2481301
[60]
Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Transactions on Graphics 36, 4 (July 2017), 130:1–130:14. https://doi.org/10.1145/3072959.3073653
[61]
Mackenzie Leake, Hijung Valentina Shin, Joy O. Kim, and Maneesh Agrawala. 2020. Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3313831.3376519
[62]
Bongshin Lee, Arjun Srinivasan, John Stasko, Melanie Tory, and Vidya Setlur. 2018. Multimodal interaction for data visualization. In Proceedings of the 2018 International Conference on Advanced Visual Interfaces(AVI ’18). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/3206505.3206602
[63]
Seung Hyun Lee, Sieun Kim, Innfarn Yoo, Feng Yang, Donghyeon Cho, Youngseo Kim, Huiwen Chang, Jinkyu Kim, and Sangpil Kim. 2023. Soundini: Sound-Guided Diffusion for Natural Video Editing. https://doi.org/10.48550/arXiv.2304.06818
[64]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. https://doi.org/10.48550/arXiv.2301.12597 arXiv:2301.12597 [cs].
[65]
Jian Liao, Adnan Karim, Shivesh Singh Jadon, Rubaiat Habib Kazi, and Ryo Suzuki. 2022. RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live Storytelling. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3526113.3545702
[66]
David Chuan-En Lin, Fabian Caba Heilbron, Joon-Young Lee, Oliver Wang, and Nikolas Martelaro. 2022. VideoMap: Video Editing in Latent Space. https://arxiv.org/abs/2211.12492v1
[67]
Steve Kaufmann lingosteve. 2019. Language Learning Live Stream. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/3_nLdcHBJY4
[68]
Doctor Gary Linkov. 2022. Surgeon does Live QA | Hair Loss Awareness Month. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/sz8Lo3NY1m0
[69]
Zachary C. Lipton. 2017. The Mythos of Model Interpretability. https://doi.org/10.48550/arXiv.1606.03490
[70]
Google LLC. 2024. YouTube. https://www.youtube.com Accessed: 2024-01-25.
[71]
Blackmagic Design Pty. Ltd. 2024. DaVinci Resolve. https://www.blackmagicdesign.com/products/davinciresolve Accessed: 2024-01-25.
[72]
Cui-Xia Ma, Yong-Jin Liu, Hong-An Wang, Dong-Xing Teng, and Guo-Zhong Dai. 2012. Sketch-Based Annotation and Visualization in Video Authoring. IEEE Transactions on Multimedia 14, 4 (Aug. 2012), 1153–1165. https://doi.org/10.1109/TMM.2012.2190389
[73]
Jiaju Ma, Anyi Rao, Li-Yi Wei, Rubaiat Habib Kazi, Hijung Valentina Shin, and Maneesh Agrawala. 2023. Automated Conversion of Music Videos into Lyric Videos. https://doi.org/10.1145/3586183.3606757
[74]
Meta. 2024. React. https://react.dev Accessed: 2024-01-25.
[75]
MobX. 2024. MobX. https://mobx.js.org Accessed: 2024-01-25.
[76]
Jamie Oliver. 2011. Jamie Oliver live - pasta. Video. Retrieved on 2024-01-25 from https://youtu.be/b3TVLNNqgdc
[77]
OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
[78]
Sharon Oviatt, Rachel Coulston, and Rebecca Lunsford. 2004. When do we interact multimodally? cognitive load and multimodal communication patterns. In Proceedings of the 6th international conference on Multimodal interfaces(ICMI ’04). Association for Computing Machinery, New York, NY, USA, 129–136. https://doi.org/10.1145/1027933.1027957
[79]
Pallets. 2024. Flask. https://flask.palletsprojects.com Accessed: 2024-01-25.
[80]
Lihang Pan, Chun Yu, Zhe He, and Yuanchun Shi. 2023. A Human-Computer Collaborative Editing Tool for Conceptual Diagrams. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–29. https://doi.org/10.1145/3544548.3580676
[81]
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology(UIST ’15). Association for Computing Machinery, New York, NY, USA, 181–190. https://doi.org/10.1145/2807442.2807502
[82]
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2016. VidCrit: Video-based Asynchronous Video Review. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology(UIST ’16). Association for Computing Machinery, New York, NY, USA, 517–528. https://doi.org/10.1145/2984511.2984552
[83]
Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: a browsable, skimmable format for informational lecture videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology(UIST ’14). Association for Computing Machinery, New York, NY, USA, 573–582. https://doi.org/10.1145/2642918.2647400
[84]
Gillian Perkins. 2020. How To SURVIVE As An Entrepreneur. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/oYMAX90kNkU
[85]
Gillian Perkins. 2023. the mindset shift that will finally change your work-life. Video. Retrieved on 2024-01-25 from https://youtu.be/T8LE3SpZdag
[86]
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. 2023. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. https://doi.org/10.48550/arXiv.2303.09535
[87]
Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, and Yueting Zhuang. 2023. InstructVid2Vid: Controllable Video Editing with Natural Language Instructions. https://doi.org/10.48550/arXiv.2305.12328
[88]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020
[89]
Gordon Ramsey. 2021. At Home for the Holidays with Gordon Ramsay. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/kdN41iYTg3U
[90]
Nils Reimers. 2024. SentenceTransformers. https://www.sbert.net Accessed: 2024-01-25.
[91]
Nazmus Saquib, Rubaiat Habib Kazi, Li-Yi Wei, and Wilmot Li. 2019. Interactive Body-Driven Graphics for Augmented Video Performance. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems(CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300852
[92]
Jeff Sauro. 2011. A Practical Guide to the System Usability Scale: Background, Benchmarks & Best Practices. CreateSpace Independent Publishing Platform. Open Library ID: OL26858541M.
[93]
Hyungyu Shin, Eun-Young Ko, Joseph Jay Williams, and Juho Kim. 2018. Understanding the Effect of In-Video Prompting on Learners and Instructors. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems(CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3173893
[94]
Chengyu Su, Chao Yang, Yonghui Chen, Fupan Wang, Fang Wang, Yadong Wu, and Xiaorong Zhang. 2021. Natural multimodal interaction in immersive flow visualization. Visual Informatics 5, 4 (Dec. 2021), 56–66. https://doi.org/10.1016/j.visinf.2021.12.005
[95]
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 93:1–93:11. https://doi.org/10.1145/3072959.3073699
[96]
Linus Tech Tips. 2018. Microsoft Surface Go - Classic LIVE Unboxing. Video. Retrieved on 2024-01-25 from https://www.youtube.com/live/4LdIvyfzoGY
[97]
Anh Truong, Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2016. QuickCut: An Interactive Tool for Editing Narrated Video. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology(UIST ’16). Association for Computing Machinery, New York, NY, USA, 497–507. https://doi.org/10.1145/2984511.2984569
[98]
Prateksha Udhayanan, Suryateja Bv, Parth Laturia, Dev Chauhan, Darshan Khandelwal, Stefano Petrangeli, and Balaji Vasan Srinivasan. 2023. Recipe2Video: Synthesizing Personalized Videos from Recipe Texts. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2267–2276. https://doi.org/10.1109/WACV56688.2023.00230 ISSN: 2642-9381.
[99]
Jan Van Der Kamp and Veronica Sundstedt. 2011. Gaze and voice controlled drawing. In Proceedings of the 1st Conference on Novel Gaze-Controlled Applications. ACM, Karlskrona Sweden, 1–8. https://doi.org/10.1145/1983302.1983311
[100]
Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, and Ariel Shamir. 2019. Write-a-video: computational video montage from themed text. ACM Transactions on Graphics 38, 6 (Nov. 2019), 177:1–177:13. https://doi.org/10.1145/3355089.3356520
[101]
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. 2022. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. https://arxiv.org/abs/2212.03191v2
[102]
Tongshuang Wu, Michael Terry, and Carrie J. Cai. 2022. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. https://doi.org/10.48550/arXiv.2110.01691
[103]
Haijun Xia, Jennifer Jacobs, and Maneesh Agrawala. 2020. Crosscast: Adding Visuals to Audio Travel Podcasts. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology(UIST ’20). Association for Computing Machinery, New York, NY, USA, 735–746. https://doi.org/10.1145/3379337.3415882
[104]
Saelyne Yang, Jisu Yim, Juho Kim, and Hijung Valentina Shin. 2022. CatchLive: Real-time Summarization of Live Streams with Stream Content and Interaction Data. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–20. https://doi.org/10.1145/3491102.3517461
[105]
Xinwei Yao, Ohad Fried, Kayvon Fatahalian, and Maneesh Agrawala. 2020. Iterative Text-based Editing of Talking-heads Using Neural Retargeting. https://doi.org/10.48550/arXiv.2011.10688
[106]
youtube-dl developers. 2024. youtube-dl. https://ytdl-org.github.io/youtube-dl Accessed: 2024-01-25.
[107]
Yaxi Zhao, Razan Jaber, Donald McMillan, and Cosmin Munteanu. 2022. “Rewind to the Jiggling Meat Part”: Understanding Voice Control of Instructional Videos in Everyday Tasks. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3491102.3502036
[108]
Mingyuan Zhong, Gang Li, Peggy Chi, and Yang Li. 2021. HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based Instructions. In The 34th Annual ACM Symposium on User Interface Software and Technology(UIST ’21). Association for Computing Machinery, New York, NY, USA, 1144–1153. https://doi.org/10.1145/3472749.3474812
[109]
Chris Zimmerer, Philipp Krop, Martin Fischbach, and Marc Erich Latoschik. 2022. Reducing the Cognitive Load of Playing a Digital Tabletop Game with a Multimodal Interface. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3491102.3502062

Cited By

View all
  • (2024)Enhancing How People Learn Procedural Tasks Through How-to VideosAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686711(1-5)Online publication date: 13-Oct-2024

Index Terms

  1. ExpressEdit: Video Editing with Natural Language and Sketching

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IUI '24: Proceedings of the 29th International Conference on Intelligent User Interfaces
      March 2024
      955 pages
      ISBN:9798400705083
      DOI:10.1145/3640543
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 April 2024

      Check for updates

      Author Tags

      1. human-AI interaction
      2. multimodal input
      3. video editing

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-01347,Video Interaction Technologies Using Object-Oriented Video Modeling).

      Conference

      IUI '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 746 of 2,811 submissions, 27%

      Upcoming Conference

      IUI '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)535
      • Downloads (Last 6 weeks)116
      Reflects downloads up to 24 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enhancing How People Learn Procedural Tasks Through How-to VideosAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686711(1-5)Online publication date: 13-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media