Solved: Fabric Pipeline. Data flow. Removing items from a ...

DebbieE · ‎07-05-2024

OK so I have a Bronze Datalake and a Silver Data Lake

In Silver I have a Parquet File of processed file names e.g.

Proja.csv

Projb.csv

Projc.csv

Projd.csv

And in the dataflow I have a Get MetaData activity conntected to the childitems in my Bronze datalake. So its finding files

lookup.csv

Proja.csv

Projb.csv

Projc.csv

Projd.csv

Proje.csv (Which is the new file)

I then have a filter to remove the Lookup.csv file

@and(equals(item().type,'File'),startswith(item().name,'Proj'))

And now I want to get a list of everything in GetMetadata that doesnt exist in the lookup

Which would leave me with Proje.csv

The hope is that I can use this to run a notebook so it only uses these files (Not sure how to do that yet but Im concentrating on the first bit)

I thought I could add another lookup. Connect it to the Filter (Source Files) and the Lookup (Processed Files) But Im really stuck at this point.

Should I be using a Lookup and what code should I use to

Get All Items from Filter where Not in Lookup?

frithjof_v · ‎07-25-2024

Perhaps you can do similar like below:

I have a Bronze Lakehouse and a Silver Lakehouse.

The files in my Bronze Lakehouse are as follows:

The files in my Silver Lakehouse are as follows:

I made a pipeline like this:

The Get Metadata activities get the Child items metadata from the File folder in Bronze lakehouse and Silver lakehouse, respectively.

The Filter activity removes the lookup.csv file from the output of the metadata activity from Bronze lakehouse:

Items: @activity('Get Metadata Bronze').output.childItems

Condition: @not(equals(item().name, 'lookup.csv'))

The Items in the ForEach activity is the output from the Filter activity:

Items: @activity('Filter Away Lookup file').output.Value

The If Condition inside the ForEach activity:

Expression: @contains(activity('Get Metadata Silver').output.childItems, item())

The Copy activity if the If Condition is False:

After I run the pipeline, the Proje.csv file has been copied to Silver:

I don't know if Fabric Data Pipeline has any limits (like output size, number of items in collection, number of items in foreach activity, etc.) which needs to be taken into consideration or it can result in pipeline failure or unexpected results if the number of files in any of the folders grow above the limits.

View solution in original post

frithjof_v · ‎07-25-2024

Perhaps you can do similar like below:

I have a Bronze Lakehouse and a Silver Lakehouse.

The files in my Bronze Lakehouse are as follows:

The files in my Silver Lakehouse are as follows:

I made a pipeline like this:

The Get Metadata activities get the Child items metadata from the File folder in Bronze lakehouse and Silver lakehouse, respectively.

The Filter activity removes the lookup.csv file from the output of the metadata activity from Bronze lakehouse:

Items: @activity('Get Metadata Bronze').output.childItems

Condition: @not(equals(item().name, 'lookup.csv'))

The Items in the ForEach activity is the output from the Filter activity:

Items: @activity('Filter Away Lookup file').output.Value

The If Condition inside the ForEach activity:

Expression: @contains(activity('Get Metadata Silver').output.childItems, item())

The Copy activity if the If Condition is False:

After I run the pipeline, the Proje.csv file has been copied to Silver:

I don't know if Fabric Data Pipeline has any limits (like output size, number of items in collection, number of items in foreach activity, etc.) which needs to be taken into consideration or it can result in pipeline failure or unexpected results if the number of files in any of the folders grow above the limits.

frithjof_v · ‎07-25-2024

If there is a more efficient way to compare the two collections of child items from Get Metadata Silver and Get Metadata Bronze and return the items which only exist in the Get Metadata Bronze, then I would like to know.

(I am thinking if there exists some kind of anti join functionality, or similar?
Perhaps some way to do one array minus another array, which keeps only the items which are only in the first array?)

In my solution, I am using the ForEach activity with an IF condition inside to achieve a similar effect.

frithjof_v · ‎07-25-2024

If you want to use the lookup.csv file to lookup which files don't need to be processed again (instead of using the file names in the Silver lakehouse directory for this purpose):

In my case, the lookup.csv file has the following content:

The 'ForEach LookupFileRow' activity:

Items: @activity('Get Lookup File Content').output.value

The 'Append varLookupFileNames' activity inside the 'ForEach LookupFileRow' activity:

The 'IF Condition' inside the 'ForEach' activity:

Expression: @contains(variables('varLookupFileNames'), item().name)

Otherwise similar like the previous example pipeline.

I don't know if Fabric Data Pipeline has any size limits (like output size, number of items in collection, number of items in foreach activity, result size in lookup activity, etc.) which needs to be taken into consideration or it can result in pipeline failure or unexpected results if the number of files in any of the folders grow above the limits.

frithjof_v · ‎07-25-2024

For example, the Lookup activity has some limitations:

Lookup activity - Microsoft Fabric | Microsoft Learn

v-shex-msft · ‎07-07-2024

HI @DebbieE,

I think you need a template list or query result that used to compare with current items, or you can't define which not exist and use to filter.

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

DebbieE · ‎07-10-2024

I would need some specific information to work with here for how I would go about that. This is all in a fabric pipeline

v-shex-msft · ‎07-24-2024

Hi @DebbieE,

Here is the document link about use dataflow in data pipeline, you can use M query editor to operation with query table records:

Use a dataflow in a pipeline - Microsoft Fabric | Microsoft Learn

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

Fabric Pipeline. Data flow. Removing items from a Filter with GetMetaData and Lookup

Helpful resources

Fabric Monthly Update - September 2024

Microsoft Fabric & AI Learning Hackathon

Fabric Community Update - September 2024