Azure AI Video Indexer insights

Warning

Over the past year, Azure AI Video Indexer (VI) announced the removal of its dependency on Azure Media Services (AMS) due to its retirement. Features adjustments and changes were announced and a migration guide was provided.

The deadline to complete migration was June 30, 2024. VI has extended the update/migrate deadline so you can update your VI account and opt in to the AMS VI asset migration through July 15th, 2024. To use the AMS VI asset migration, you also must extend your AMS account through July. Navigate to your AMS account in the Azure portal and select Click here to extend.

However, after June 30, if you have not updated your VI account, you won't be able to index new videos nor will you be able to play any videos that have not been migrated. If you update your account after June 30, you can resume indexing immediately but you won't be able to play videos indexed before the account update until they are migrated through the AMS VI migration.

When a video is indexed, Azure AI Video Indexer analyzes the video and audio content by running 30+ AI models, generating JSON containing the video insights including transcripts, optical character recognition elements (OCRs), face, topics, emotions, etc. Each insight type includes instances of time ranges that show when the insight appears in the video.

Use the links in the insights table to learn how to get each insight JSON response in the web portal and using the API.

Insights

Insight Description
Audio effects detection Audio effects detection detects acoustic events and classifies them into categories such as laughter, crowd reactions, alarms and/or sirens.
Face detection Face detection detects faces in a media file, and then aggregates instances of similar faces into groups.Face detection insights are generated as a categorized list in a JSON file that includes a thumbnail and either a name or an ID for each face. In the web portal, selecting a face’s thumbnail displays information like the name of the person (if they were recognized), the percentage of the video that the person appears, and the person's biography, if they're a celebrity. You can also scroll between instances in the video where the person appears.
Keywords extraction Keywords extraction detects insights on the different keywords discussed in media files. It extract insights in both single language and multi-language media files.
Labels identification Labels identification is an Azure AI Video Indexer AI feature that identifies visual objects like sunglasses or actions like swimming, appearing in the video footage of a media file. There are many labels identification categories and once extracted, labels identification instances are displayed in the Insights tab and can be translated into over 50 languages. Clicking a Label opens the instance in the media file, select Play Previous or Play Next to see more instances.
Media transcription, translation, and language identification Transcription, translation, and language identification detects, transcribes, and translates the speech in media files into over 50 languages.
Named entities Named entities extraction uses Natural Language Processing (NLP) to extract insights on the locations, people, and brands appearing in audio and images in media files. The named entities extraction insight uses transcription and optical character recognition (OCR).
Object detection Azure AI Video Indexer detects objects in videos such as cars, handbags and backpacks, and laptops.
OCR OCR extracts text from images like pictures, street signs and products in media files to create insights.
Post-production: clapper board detection Clapper board detection detects clapper boards used during filming that also provides the information detected on the clapper board as metadata, for example, production, roll, scene, take, etc. Clapper board is part of the post-production insights that you can select in the web portal advanced settings when you upload and index the file.
Post-production: digital patterns Digital patterns detection detects color bars used during filming. Digital patterns is part of the post-production insights that you can select in the web portal advanced settings when you upload and index the file.
Text-based emotion detection Emotions detection detects emotions in video's transcript lines. Each sentence can either be detected as Anger, Fear, Joy, Sad, None if no other emotion was detected.
Topics inference Topics inference creates inferred insights derived from the transcribed audio, OCR content in visual text, and celebrities recognized in the video using the Video Indexer facial recognition model.