[Issue]: the Text2Video, image2video and Stable Video Diffusion scripts are not working in directml #3342

rodrigoandrigo · 2024-07-16T23:13:46Z

Issue Description

I tried using the 3 video generator scripts with directml, but none of them worked

Text-to-Video in Text
models: Potat v1, ZeroScope v2 Dark, ModelScope 1.7b

Image-to-Video in Image
models: VGen

Stable Video Diffusion in Image
models: SVD XT 1.1

Version Platform Description

2024-07-16 12:38:07,164 | sd | INFO | launch | Starting SD.Next
2024-07-16 12:38:07,169 | sd | INFO | installer | Logger: file="C:\StabilityMatrix\Data\Packages\SD.Next\sdnext.log" level=INFO size=899852 mode=append
2024-07-16 12:38:07,171 | sd | INFO | installer | Python version=3.10.11 platform=Windows bin="C:\StabilityMatrix\Data\Packages\SD.Next\venv\Scripts\python.exe" venv="C:\StabilityMatrix\Data\Packages\SD.Next\venv"
2024-07-16 12:38:07,474 | sd | INFO | installer | Version: app=sd.next updated=2024-07-10 hash=2ec6e9ee branch=master url=https://github.com/vladmandic/automatic/tree/master ui=main
2024-07-16 12:38:08,050 | sd | INFO | launch | Platform: arch=AMD64 cpu=AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD system=Windows release=Windows-10-10.0.22631-SP0 python=3.10.11
2024-07-16 12:38:08,053 | sd | DEBUG | installer | Torch allocator: "garbage_collection_threshold:0.80,max_split_size_mb:512"
2024-07-16 12:38:08,054 | sd | DEBUG | installer | Torch overrides: cuda=False rocm=False ipex=False diml=True openvino=False
2024-07-16 12:38:08,054 | sd | DEBUG | installer | Torch allowed: cuda=False rocm=False ipex=False diml=True openvino=False
2024-07-16 12:38:08,054 | sd | INFO | installer | Using DirectML Backend
2024-07-16 09:35:37,397 | sd | DEBUG | launch | Starting module: <module 'webui' from 'C:\StabilityMatrix\Data\Packages\SD.Next\webui.py'>
2024-07-16 09:35:37,397 | sd | INFO | launch | Command line args: ['--medvram', '--autolaunch', '--use-directml'] medvram=True autolaunch=True use_directml=True
2024-07-16 09:35:37,399 | sd | DEBUG | launch | Env flags: []
2024-07-16 09:37:38,790 | sd | INFO | loader | Load packages: {'torch': '2.3.1+cpu', 'diffusers': '0.29.1', 'gradio': '3.43.2'}
2024-07-16 09:37:42,767 | sd | DEBUG | shared | Read: file="config.json" json=35 bytes=1548 time=0.000
2024-07-16 09:37:42,821 | sd | INFO | shared | Engine: backend=Backend.DIFFUSERS compute=directml device=privateuseone:0 attention="Dynamic Attention BMM" mode=no_grad
2024-07-16 09:37:42,979 | sd | INFO | shared | Device: device=AMD Radeon RX 6600M n=1 directml=0.2.2.dev240614
2024-07-16 09:37:42,987 | sd | DEBUG | shared | Read: file="html\reference.json" json=45 bytes=25986 time=0.006
2024-07-16 09:38:04,704 | sd | DEBUG | init | ONNX: version=1.18.1 provider=DmlExecutionProvider, available=['AzureExecutionProvider', 'CPUExecutionProvider']

Relevant log output

Text-to-Video
Model: Potat v1
12:47:35-275745 ERROR    Arguments: args=('task(c5jnmnvhq3xjo9w)', 'woman,    
                         sitting on couch, female curvy, detailed face,        
                         perfect face, correct eyes, hairstyles, detailed     
                         muzzle, detailed mouth, five fingers, proper hands,   
                         proper shading, proper lighting, detailed character,  
                         high quality,', 'worst quality, bad quality, (text),  
                         ((signature, watermark)), extra limb, deformed hands, 
                         deformed feet, multiple tails, deformed, disfigured,  
                         poorly drawn face, mutated, extra limb, ugly, face out
                         of frame, oversaturated, sketch, comic, no pupils,    
                         simple background, ((blurry)), mutation, intersex, bad
                         anatomy, disfigured,', [], 20, 0, 26, True, False,    
                         False, False, 1, 1, 6, 6, 0.7, 0, 0.5, 1, 1, -1.0,    
                         -1.0, 0, 0, 0, 512, 512, False, 0.3, 2, 'None', False,
                         20, 0, 0, 10, 0, '', '', 0, 0, 0, 0, False, 4, 0.95,  
                         False, 0.6, 1, '#000000', 0, [], 11, 1, 'None',       
                         'None', 'None', 'None', 0.5, 0.5, 0.5, 0.5, None,     
                         None, None, None, 0, 0, 0, 0, 1, 1, 1, 1, None, None, 
                         None, None, False, '', 'None', 16, 'None', 1, True,   
                         'None', 2, True, 1, 0, True, 'none', 3, 4, 0.25, 0.25,
                         3, 1, 1, 0.8, 8, 64, True, True, 0.5, 600.0, 1.0, 1,  
                         1, 0.5, 0.5, 'OpenGVLab/InternVL-14B-224px', False,   
                         False, 'positive', 'comma', 0, False, False, '',      
                         'None', '', 1, '', 'None', 1, True, 10, 'Potat v1',   
                         True, 24, 'GIF', 2, True, 1, 0, 0, '', [], 0, '', [], 
                         0, '', [], False, True, False, False, False, False, 0,
                         'None', [], 'FaceID Base', True, True, 1, 1, 1, 0.5,  
                         False, 'person', 1, 0.5, True) kwargs={}              
12:47:35-284260 ERROR    gradio call: AttributeError                           
┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\call_queue.py:31 in f      │
│                                                                             │
│   30 │   │   │   try:                                                       │
│ > 31 │   │   │   │   res = func(*args, **kwargs)                            │
│   32 │   │   │   │   progress.record_results(id_task, res)                  │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\txt2img.py:89 in txt2img   │
│                                                                             │
│   88 │   p.script_args = args                                               │
│ > 89 │   processed = scripts.scripts_txt2img.run(p, *args)                  │
│   90 │   if processed is None:                                              │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\scripts.py:483 in run      │
│                                                                             │
│   482 │   │   parsed = p.per_script_args.get(script.title(), args[script.ar │
│ > 483 │   │   processed = script.run(p, *parsed)                            │
│   484 │   │   s.record(script.title())                                      │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\scripts\text2video.py:88 in run    │
│                                                                             │
│    87 │   │   │   shared.opts.sd_model_checkpoint = checkpoint              │
│ >  88 │   │   │   sd_models.reload_model_weights(op='model')                │
│    89                                                                       │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\sd_models.py:1572 in reloa │
│                                                                             │
│   1571 │   from modules import lowvram, sd_hijack                           │
│ > 1572 │   checkpoint_info = info or select_checkpoint(op=op) # are we sele │
│   1573 │   next_checkpoint_info = info or select_checkpoint(op='dict' if lo │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\sd_models.py:248 in select │
│                                                                             │
│    247 │   │   return None                                                  │
│ >  248 │   checkpoint_info = get_closet_checkpoint_match(model_checkpoint)  │
│    249 │   if checkpoint_info is not None:                                  │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\sd_models.py:197 in get_cl │
│                                                                             │
│    196 def get_closet_checkpoint_match(search_string):                      │
│ >  197 │   if search_string.startswith('huggingface/'):                     │
│    198 │   │   model_name = search_string.replace('huggingface/', '')       │
└─────────────────────────────────────────────────────────────────────────────┘
AttributeError: 'CheckpointInfo' object has no attribute 'startswith'


Text-to-Video
Model: ZeroScope v2 Dark
12:50:00-451738 ERROR    Arguments: args=('task(yfgrwdtd3i1wg4r)', 'woman,        
                         sitting on couch, female curvy, detailed face,        
                         perfect face, correct eyes, hairstyles, detailed     
                         muzzle, detailed mouth, five fingers, proper hands,   
                         proper shading, proper lighting, detailed character,  
                         high quality,', 'worst quality, bad quality, (text),  
                         ((signature, watermark)), extra limb, deformed hands, 
                         deformed feet, multiple tails, deformed, disfigured,  
                         poorly drawn face, mutated, extra limb, ugly, face out
                         of frame, oversaturated, sketch, comic, no pupils,    
                         simple background, ((blurry)), mutation, intersex, bad
                         anatomy, disfigured,', [], 20, 7, 26, True, False,    
                         False, False, 1, 1, 6, 6, 0.7, 0, 0.5, 1, 1, -1.0,    
                         -1.0, 0, 0, 0, 512, 512, False, 0.3, 2, 'None', False,
                         20, 0, 0, 10, 0, '', '', 0, 0, 0, 0, False, 4, 0.95,  
                         False, 0.6, 1, '#000000', 0, [], 11, 1, 'None',       
                         'None', 'None', 'None', 0.5, 0.5, 0.5, 0.5, None,     
                         None, None, None, 0, 0, 0, 0, 1, 1, 1, 1, None, None, 
                         None, None, False, '', 'None', 16, 'None', 1, True,   
                         'None', 2, True, 1, 0, True, 'none', 3, 4, 0.25, 0.25,
                         3, 1, 1, 0.8, 8, 64, True, True, 0.5, 600.0, 1.0, 1,  
                         1, 0.5, 0.5, 'OpenGVLab/InternVL-14B-224px', False,   
                         False, 'positive', 'comma', 0, False, False, '',      
                         'None', '', 1, '', 'None', 1, True, 10, 'ZeroScope v2 
                         Dark', True, 24, 'GIF', 2, True, 1, 0, 0, '', [], 0,  
                         '', [], 0, '', [], False, True, False, False, False,  
                         False, 0, 'None', [], 'FaceID Base', True, True, 1, 1,
                         1, 0.5, False, 'person', 1, 0.5, True) kwargs={}      
12:50:00-459258 ERROR    gradio call: TypeError                                
┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\call_queue.py:31 in f      │
│                                                                             │
│   30 │   │   │   try:                                                       │
│ > 31 │   │   │   │   res = func(*args, **kwargs)                            │
│   32 │   │   │   │   progress.record_results(id_task, res)                  │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\txt2img.py:89 in txt2img   │
│                                                                             │
│   88 │   p.script_args = args                                               │
│ > 89 │   processed = scripts.scripts_txt2img.run(p, *args)                  │
│   90 │   if processed is None:                                              │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\scripts.py:483 in run      │
│                                                                             │
│   482 │   │   parsed = p.per_script_args.get(script.title(), args[script.ar │
│ > 483 │   │   processed = script.run(p, *parsed)                            │
│   484 │   │   s.record(script.title())                                      │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\scripts\text2video.py:75 in run    │
│                                                                             │
│    74 │   │                                                                 │
│ >  75 │   │   if model['path'] in shared.opts.sd_model_checkpoint:          │
│    76 │   │   │   shared.log.debug(f'Text2Video cached: model={shared.opts. │
└─────────────────────────────────────────────────────────────────────────────┘
TypeError: argument of type 'CheckpointInfo' is not iterable


Text-to-Video
Model: ModelScope 1.7b
13:02:06-745445 ERROR    Processing: args={'prompt': ['woman, sitting on      
                         couch, female curvy, detailed eyes, perfect eyes,     
                         detailed face, perfect face, perfectly rendered face, 
                         correct eyes, hairstyles, detailed muzzle, detailed   
                         mouth, five fingers, proper hands, proper shading,    
                         proper lighting, detailed character, high quality,'], 
                         'negative_prompt': ['worst quality, bad quality,      
                         (text), ((signature, watermark)), extra limb, deformed
                         hands, deformed feet, multiple tails, deformed,       
                         disfigured, poorly drawn face, mutated, extra limb,   
                         ugly, face out of frame, oversaturated, sketch, comic,
                         no pupils, simple background, ((blurry)), mutation,   
                         intersex, bad anatomy, disfigured,'],                 
                         'guidance_scale': 6, 'generator': [<torch._C.Generator
                         object at 0x0000017C89FBA530>], 'callback_steps': 1,  
                         'callback': <function diffusers_callback_legacy at    
                         0x0000017C8BF3ECB0>, 'num_inference_steps': 20, 'eta':
                         1.0, 'output_type': 'latent', 'width': 320, 'height': 
                         320, 'num_frames': 16} input must be 4-dimensional    
13:02:06-750699 ERROR    Processing: RuntimeError                              
┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\processing_diffusers.py:12 │
│                                                                             │
│   121 │   │   else:                                                         │
│ > 122 │   │   │   output = shared.sd_model(**base_args)                     │
│   123 │   │   if isinstance(output, dict):                                  │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\utils │
│                                                                             │
│   114 │   │   with ctx_factory():                                           │
│ > 115 │   │   │   return func(*args, **kwargs)                              │
│   116                                                                       │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\diffusers\p │
│                                                                             │
│   596 │   │   │   │   # predict the noise residual                          │
│ > 597 │   │   │   │   noise_pred = self.unet(                               │
│   598 │   │   │   │   │   latent_model_input,                               │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\nn\mo │
│                                                                             │
│   1531 │   │   else:                                                        │
│ > 1532 │   │   │   return self._call_impl(*args, **kwargs)                  │
│   1533                                                                      │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\nn\mo │
│                                                                             │
│   1540 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hook │
│ > 1541 │   │   │   return forward_call(*args, **kwargs)                     │
│   1542                                                                      │
│                                                                             │
│                          ... 12 frames hidden ...                           │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\nn\mo │
│                                                                             │
│   1540 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hook │
│ > 1541 │   │   │   return forward_call(*args, **kwargs)                     │
│   1542                                                                      │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\nn\mo │
│                                                                             │
│    609 │   def forward(self, input: Tensor) -> Tensor:                      │
│ >  610 │   │   return self._conv_forward(input, self.weight, self.bias)     │
│    611                                                                      │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\nn\mo │
│                                                                             │
│    604 │   │   │   )                                                        │
│ >  605 │   │   return F.conv3d(                                             │
│    606 │   │   │   input, weight, bias, self.stride, self.padding, self.dil │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\dml\amp\autocast_mode.py:4 │
│                                                                             │
│   42 │   │   op = getattr(resolved_obj, func_path[-1])                      │
│ > 43 │   │   setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: f │
│   44                                                                        │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\dml\amp\autocast_mode.py:1 │
│                                                                             │
│   14 │   if not torch.dml.is_autocast_enabled:                              │
│ > 15 │   │   return op(*args, **kwargs)                                     │
│   16 │   args = list(map(cast, args))                                       │
└─────────────────────────────────────────────────────────────────────────────┘
RuntimeError: input must be 4-dimensional


Image-to-Video
Model: VGen
13:08:33-673173 WARNING  Pipeline class change failed:                         
                         type=DiffusersTaskType.IMAGE_2_IMAGE                  
                         pipeline=I2VGenXLPipeline AutoPipeline can't find a   
                         pipeline linked to I2VGenXLPipeline for None          
13:08:34-378645 INFO     Base: class=I2VGenXLPipeline                          
13:08:47-883849 ERROR    Processing: args={'prompt': ['woman, sitting on      
                         couch, female curvy, detailed eyes, perfect eyes,     
                         detailed face, perfect face, perfectly rendered face, 
                         correct eyes, hairstyles, detailed muzzle, detailed   
                         mouth, five fingers, proper hands, proper shading,    
                         proper lighting, detailed character, high quality,'], 
                         'negative_prompt': ['worst quality, bad quality,      
                         (text), ((signature, watermark)), extra limb, deformed
                         hands, deformed feet, multiple tails, deformed,       
                         disfigured, poorly drawn face, mutated, extra limb,   
                         ugly, face out of frame, oversaturated, sketch, comic,
                         no pupils, simple background, ((blurry)), mutation,   
                         intersex, bad anatomy, disfigured,'],                 
                         'guidance_scale': 6, 'generator': [<torch._C.Generator
                         object at 0x0000026E161C7150>], 'num_inference_steps':
                         20, 'eta': 1.0, 'output_type': 'pil', 'width': 512,   
                         'height': 512, 'image': <PIL.Image.Image image        
                         mode=RGB size=512x512 at 0x26E118AE500>, 'num_frames':
                         16, 'target_fps': 8, 'decode_chunk_size': 8} the      
                         dimesion of at::Tensor must be 4 or lower, but got 5  
13:08:47-888378 ERROR    Processing: RuntimeError                              
┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\processing_diffusers.py:12 │
│                                                                             │
│   121 │   │   else:                                                         │
│ > 122 │   │   │   output = shared.sd_model(**base_args)                     │
│   123 │   │   if isinstance(output, dict):                                  │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\utils │
│                                                                             │
│   114 │   │   with ctx_factory():                                           │
│ > 115 │   │   │   return func(*args, **kwargs)                              │
│   116                                                                       │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\diffusers\p │
│                                                                             │
│   639 │   │   image = self.video_processor.preprocess(resized_image).to(dev │
│ > 640 │   │   image_latents = self.prepare_image_latents(                   │
│   641 │   │   │   image,                                                    │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\diffusers\p │
│                                                                             │
│   465 │   │   # duplicate image_latents for each generation per prompt, usi │
│ > 466 │   │   image_latents = image_latents.repeat(num_videos_per_prompt, 1 │
│   467                                                                       │
└─────────────────────────────────────────────────────────────────────────────┘
RuntimeError: the dimesion of at::Tensor must be 4 or lower, but got 5


Stable Video Diffusion
Model: SVD XT 1.1
13:12:03-607975 ERROR    Processing: args={'generator': <torch._C.Generator    
                         object at 0x000001F873C34810>, 'callback_on_step_end':
                         <function diffusers_callback at 0x000001F84F665D80>,  
                         'callback_on_step_end_tensor_inputs': ['latents'],    
                         'num_inference_steps': 20, 'output_type': 'pil',      
                         'image': <PIL.Image.Image image mode=RGB size=1024x576
                         at 0x1F8531FC610>, 'width': 1024, 'height': 576,      
                         'num_frames': 14, 'decode_chunk_size': 6,             
                         'motion_bucket_id': 128, 'noise_aug_strength': 0.1,   
                         'min_guidance_scale': 1, 'max_guidance_scale': 3} the 
                         dimesion of at::Tensor must be 4 or lower, but got 5  
13:12:03-611978 ERROR    Processing: RuntimeError                              
┌───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\StabilityMatrix\Data\Packages\SD.Next\modules\processing_diffusers.py:12 │
│                                                                             │
│   121 │   │   else:                                                         │
│ > 122 │   │   │   output = shared.sd_model(**base_args)                     │
│   123 │   │   if isinstance(output, dict):                                  │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\torch\utils │
│                                                                             │
│   114 │   │   with ctx_factory():                                           │
│ > 115 │   │   │   return func(*args, **kwargs)                              │
│   116                                                                       │
│                                                                             │
│ C:\StabilityMatrix\Data\Packages\SD.Next\venv\lib\site-packages\diffusers\p │
│                                                                             │
│   523 │   │   # image_latents [batch, channels, height, width] ->[batch, nu │
│ > 524 │   │   image_latents = image_latents.unsqueeze(1).repeat(1, num_fram │
│   525                                                                       │
└─────────────────────────────────────────────────────────────────────────────┘
RuntimeError: the dimesion of at::Tensor must be 4 or lower, but got 5
13:12:03-690490 WARNING  Pipeline class change failed:                         
                         type=DiffusersTaskType.TEXT_2_IMAGE                   
                         pipeline=StableVideoDiffusionPipeline AutoPipeline    
                         can't find a pipeline linked to                       
                         StableVideoDiffusionPipeline for None

Backend

Diffusers

UI

Standard

Branch

Master

Model

StableDiffusion 1.5

Acknowledgements

I have read the above and searched for existing issues
I confirm that this is classified correctly and its not an extension issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: the Text2Video, image2video and Stable Video Diffusion scripts are not working in directml #3342

[Issue]: the Text2Video, image2video and Stable Video Diffusion scripts are not working in directml #3342

rodrigoandrigo commented Jul 16, 2024

[Issue]: the Text2Video, image2video and Stable Video Diffusion scripts are not working in directml #3342

[Issue]: the Text2Video, image2video and Stable Video Diffusion scripts are not working in directml #3342

Comments

rodrigoandrigo commented Jul 16, 2024

Issue Description

Version Platform Description

Relevant log output

Backend

UI

Branch

Model

Acknowledgements