Skip to content

Commit

Permalink
Merge pull request #12331 from MicrosoftDocs/learn-build-service-prod…
Browse files Browse the repository at this point in the history
…bot/docutune-autopr-20240715-050657-5972465-ignore-build

[DocuTune-Remediation] - Scheduled execution to fix known issues in Azure Architecture Center articles (part 2)
  • Loading branch information
v-dirichards authored Jul 15, 2024
2 parents 247674b + 15feb14 commit 7767b45
Show file tree
Hide file tree
Showing 7 changed files with 70 additions and 73 deletions.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

This article describes an architecture for processing files that contain multiple documents of various types. It uses the Durable Functions extension of Azure Functions to implement the pipelines that process the files.
This article describes an architecture for processing documents of various types. It uses the Durable Functions extension of Azure Functions to implement the pipelines that process the documents.

## Architecture

Expand Down Expand Up @@ -42,11 +42,11 @@ This article describes an architecture for processing files that contain multipl

### Scenario details

This article describes an architecture that uses Durable Functions to implement automated pipelines for processing files that contain multiple documents of various types. The pipelines identify the documents in a document file, classify them by type, and store information that can be used in subsequent processing.
The pipelines identify the documents in a document file, classify them by type, and store information that can be used in subsequent processing.

Many companies need to manage and process documents that have been scanned in bulk and that can contain several different document types. Typically, the files are PDFs or multi-page TIFF images. These files usually originate from outside the organization, and the receiving company doesn't control the content.
Many companies need to manage and process documents that have been scanned in bulk and that can contain several different document types. Typically, the documents are PDFs or multi-page TIFF images. These documents might originate from outside the organization, and the receiving company doesn't control the format.

Given these constraints, organizations have been forced to build their own document parsing solutions that can include custom technology and manual processes. A solution can include human intervention for splitting out individual document types into their own files and adding classifications qualifiers for each document.
Given these constraints, organizations have been forced to build their own document parsing solutions that can include custom technology and manual processes. A solution can include human intervention for splitting out individual document types and adding classifications qualifiers for each document.

Many of these custom solutions are based on the state machine workflow pattern and use database systems for persisting workflow state, with polling services that check for the states that they're responsible for processing. Maintaining and enhancing such solutions can be difficult and time consuming.

Expand Down Expand Up @@ -84,7 +84,7 @@ For reliability information about solution components, see the following resourc

Cost optimization is about reducing unnecessary expenses and improving operational efficiencies. For more information, see [Overview of the cost optimization pillar](/azure/architecture/framework/cost/overview).

The most significant costs for this architecture will potentially come from the storage of image files in the storage account, Azure AI services image processing, and index capacity requirements in the Azure AI Search service.
The most significant costs for this architecture will potentially come from the storage of images in the storage account, Azure AI services image processing, and index capacity requirements in the Azure AI Search service.

Costs can be optimized by [right sizing](/azure/architecture/framework/services/storage/storage-accounts/cost-optimization) the storage account by using reserved capacity and lifecycle policies, proper [Azure AI Search planning](/azure/search/search-sku-manage-costs) for regional deployments and operational scale up scheduling, and using [commitment tier pricing](/azure/cognitive-services/commitment-tier) that's available for the Computer Vision – OCR service to manage [predictable costs](/azure/cognitive-services/plan-manage-costs).

Expand Down
12 changes: 6 additions & 6 deletions docs/ai-ml/architecture/search-blob-metadata-content.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
This article demonstrates how to create a search service that enables users to search for documents based on document content in addition to any metadata that's associated with the files.
This article demonstrates how to create a search service that enables users to search for documents based on document content in addition to any metadata that's associated with the document.

You can implement this service by using [multiple indexers](/azure/search/search-indexer-overview#indexer-scenarios-and-use-cases) in [Azure AI Search](/azure/search/search-what-is-azure-search).

This article uses an example workload to demonstrate how to create a single [search index](/azure/search/search-what-is-an-index) that's based on files in [Azure Blob Storage](/azure/storage/blobs/storage-blobs-overview). The file metadata is stored in [Azure Table Storage](/azure/storage/tables/table-storage-overview).
This article uses an example workload to demonstrate how to create a single [search index](/azure/search/search-what-is-an-index) that's based on documents in [Azure Blob Storage](/azure/storage/blobs/storage-blobs-overview). The file metadata is stored in [Azure Table Storage](/azure/storage/tables/table-storage-overview).

## Architecture

Expand All @@ -12,15 +12,15 @@ This article uses an example workload to demonstrate how to create a single [sea

### Dataflow

1. Files are stored in Blob Storage, possibly together with a limited amount of metadata (for example, the document's author).
1. Documents are stored in Blob Storage, possibly together with a limited amount of metadata (for example, the document's author).
2. Additional metadata is stored in Table Storage, which can store significantly more information for each document.
3. An indexer reads the contents of each file, together with any blob metadata, and stores the data in the search index.
4. Another indexer reads the additional metadata from the table and stores it in the same search index.
5. A search query is sent to the search service. The query returns matching documents, based on both document content and document metadata.

### Components

- [Blob Storage](https://azure.microsoft.com/products/storage/blobs/) provides cost-effective cloud storage for file data, including data in formats like PDF, HTML, and CSV, and in Microsoft 365 files.
- [Blob Storage](https://azure.microsoft.com/products/storage/blobs/) provides cost-effective cloud storage for file data, including data in formats like PDF, HTML, and CSV, and in Microsoft 365 documents.
- [Table Storage](https://azure.microsoft.com/products/storage/tables/) provides storage for nonrelational structured data. In this scenario, it's used to store the metadata for each document.
- [Azure AI Search](https://azure.microsoft.com/products/search/) is a fully managed search service that provides infrastructure, APIs, and tools for building a rich search experience.

Expand All @@ -40,11 +40,11 @@ This solution enables users to search for documents based on both file content a

[Azure AI Search](/azure/search/search-what-is-azure-search) is a fully managed search service that can create [search indexes](/azure/search/search-what-is-an-index) that contain the information you want to allow users to search for.

Because the files that are searched in this scenario are binary documents, you can store them in [Blob Storage](/azure/storage/blobs/storage-blobs-overview). If you do, you can use the built-in [Blob Storage indexer](/azure/search/search-howto-indexing-azure-blob-storage) in Azure AI Search to automatically extract text from the files and add their content to the search index.
Because the files that are searched in this scenario are binary documents, you can store them in [Blob Storage](/azure/storage/blobs/storage-blobs-overview). If you do, you can use the built-in [Blob Storage indexer](/azure/search/search-howto-indexing-azure-blob-storage) in Azure AI Search to automatically extract text from the document and add their content to the search index.

### Searching file metadata

If you want to include additional information about the files, you can directly associate [metadata](/azure/storage/blobs/storage-blob-properties-metadata) with the blobs, without using a separate store. The built-in [Blob Storage search indexer can even read this metadata](/azure/search/search-howto-indexing-azure-blob-storage#indexing-blob-metadata) and place it in the search index. This enables users to search for metadata along with the file content. However, the [amount of metadata is limited to 8 KB per blob](/rest/api/storageservices/Setting-and-Retrieving-Properties-and-Metadata-for-Blob-Resources#Subheading1), so the amount of information that you can place on each blob is fairly small. You might choose to store only the most critical information directly on the blobs. In this scenario, only the document's *author* is stored on the blob.
If you want to include additional information about the document, you can directly associate [metadata](/azure/storage/blobs/storage-blob-properties-metadata) with the blobs, without using a separate store. The built-in [Blob Storage search indexer can even read this metadata](/azure/search/search-howto-indexing-azure-blob-storage#indexing-blob-metadata) and place it in the search index. This enables users to search for metadata along with the file content. However, the [amount of metadata is limited to 8 KB per blob](/rest/api/storageservices/Setting-and-Retrieving-Properties-and-Metadata-for-Blob-Resources#Subheading1), so the amount of information that you can place on each blob is fairly small. You might choose to store only the most critical information directly on the blobs. In this scenario, only the document's *author* is stored on the blob.

To overcome this storage limitation, you can place additional metadata in another [data source that has a supported indexer](/azure/search/search-indexer-overview#supported-data-sources), like [Table Storage](/azure/storage/tables/table-storage-overview). You can add the document type, business impact, and other metadata values as separate columns in the table. If you configure the built-in [Table Storage indexer](/azure/search/search-howto-indexing-azure-tables) to target the same search index as the blob indexer, the blob and table storage metadata is combined for each document in the search index.

Expand Down
Loading

0 comments on commit 7767b45

Please sign in to comment.