Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
DebbieE
Community Champion
Community Champion

Fabric Pipeline. Data flow. Removing items from a Filter with GetMetaData and Lookup

OK so I have a Bronze Datalake and a Silver Data Lake

In Silver I have a Parquet File of processed file names e.g.

Proja.csv

Projb.csv

Projc.csv

Projd.csv

 

And in the dataflow I have a Get MetaData activity conntected to the childitems in my Bronze datalake. So its finding files

lookup.csv

Proja.csv

Projb.csv

Projc.csv

Projd.csv

Proje.csv (Which is the new file)

 

I then have a filter to remove the Lookup.csv file 

@and(equals(item().type,'File'),startswith(item().name,'Proj'))
 
And now I want to get a list of everything in GetMetadata that doesnt exist in the lookup
Which would leave me with Proje.csv 
The hope is that I can use this to run a notebook so it only uses these files (Not sure how to do that yet but Im concentrating on the first bit)
 
I thought I could add another lookup. Connect it to the Filter (Source Files) and the Lookup (Processed Files) But Im really stuck at this point.
 
Should I be using a Lookup and what code should I use to
 
Get All Items from Filter where Not in Lookup?

 

1 ACCEPTED SOLUTION
frithjof_v
Resident Rockstar
Resident Rockstar

Perhaps you can do similar like below:

 

I have a Bronze Lakehouse and a Silver Lakehouse.

 

The files in my Bronze Lakehouse are as follows:

frithjof_v_0-1721915655333.png

 

 

The files in my Silver Lakehouse are as follows:

frithjof_v_1-1721915907699.png

 

I made a pipeline like this:

frithjof_v_2-1721915943800.png

 

 

The Get Metadata activities get the Child items metadata from the File folder in Bronze lakehouse and Silver lakehouse, respectively.

frithjof_v_3-1721916050920.png

 

The Filter activity removes the lookup.csv file from the output of the metadata activity from Bronze lakehouse:

frithjof_v_4-1721916166056.png

Items: @activity('Get Metadata Bronze').output.childItems

Condition: @not(equals(item().name, 'lookup.csv'))

 

 

The Items in the ForEach activity is the output from the Filter activity:

frithjof_v_5-1721916215603.png

Items: @activity('Filter Away Lookup file').output.Value

 

 

The If Condition inside the ForEach activity:

frithjof_v_6-1721916273266.png

Expression: @contains(activity('Get Metadata Silver').output.childItems, item())

 

 

The Copy activity if the If Condition is False:

frithjof_v_7-1721916333614.png

frithjof_v_8-1721916358371.png

After I run the pipeline, the Proje.csv file has been copied to Silver:

frithjof_v_9-1721916448145.png

 

 

 

I don't know if Fabric Data Pipeline has any limits (like output size, number of items in collection, number of items in foreach activity, etc.) which needs to be taken into consideration or it can result in pipeline failure or unexpected results if the number of files in any of the folders grow above the limits.

View solution in original post

7 REPLIES 7
frithjof_v
Resident Rockstar
Resident Rockstar

Perhaps you can do similar like below:

 

I have a Bronze Lakehouse and a Silver Lakehouse.

 

The files in my Bronze Lakehouse are as follows:

frithjof_v_0-1721915655333.png

 

 

The files in my Silver Lakehouse are as follows:

frithjof_v_1-1721915907699.png

 

I made a pipeline like this:

frithjof_v_2-1721915943800.png

 

 

The Get Metadata activities get the Child items metadata from the File folder in Bronze lakehouse and Silver lakehouse, respectively.

frithjof_v_3-1721916050920.png

 

The Filter activity removes the lookup.csv file from the output of the metadata activity from Bronze lakehouse:

frithjof_v_4-1721916166056.png

Items: @activity('Get Metadata Bronze').output.childItems

Condition: @not(equals(item().name, 'lookup.csv'))

 

 

The Items in the ForEach activity is the output from the Filter activity:

frithjof_v_5-1721916215603.png

Items: @activity('Filter Away Lookup file').output.Value

 

 

The If Condition inside the ForEach activity:

frithjof_v_6-1721916273266.png

Expression: @contains(activity('Get Metadata Silver').output.childItems, item())

 

 

The Copy activity if the If Condition is False:

frithjof_v_7-1721916333614.png

frithjof_v_8-1721916358371.png

After I run the pipeline, the Proje.csv file has been copied to Silver:

frithjof_v_9-1721916448145.png

 

 

 

I don't know if Fabric Data Pipeline has any limits (like output size, number of items in collection, number of items in foreach activity, etc.) which needs to be taken into consideration or it can result in pipeline failure or unexpected results if the number of files in any of the folders grow above the limits.

If there is a more efficient way to compare the two collections of child items from Get Metadata Silver and Get Metadata Bronze and return the items which only exist in the Get Metadata Bronze, then I would like to know.

 

(I am thinking if there exists some kind of anti join functionality, or similar?
Perhaps some way to do one array minus another array, which keeps only the items which are only in the first array?)

 

 

In my solution, I am using the ForEach activity with an IF condition inside to achieve a similar effect.

If you want to use the lookup.csv file to lookup which files don't need to be processed again (instead of using the file names in the Silver lakehouse directory for this purpose):

 

frithjof_v_1-1721920576261.png

 

In my case, the lookup.csv file has the following content:

frithjof_v_2-1721920667584.png

 

The 'ForEach LookupFileRow' activity:

Items: @activity('Get Lookup File Content').output.value

 

The 'Append varLookupFileNames' activity inside the 'ForEach LookupFileRow' activity:

frithjof_v_3-1721920804950.png

 

The 'IF Condition' inside the 'ForEach' activity:

Expression: @contains(variables('varLookupFileNames'), item().name)

 

 

Otherwise similar like the previous example pipeline.

 

 

I don't know if Fabric Data Pipeline has any size limits (like output size, number of items in collection, number of items in foreach activity, result size in lookup activity, etc.) which needs to be taken into consideration or it can result in pipeline failure or unexpected results if the number of files in any of the folders grow above the limits.

For example, the Lookup activity has some limitations:

Lookup activity - Microsoft Fabric | Microsoft Learn

v-shex-msft
Community Support
Community Support

HI @DebbieE,

I think you need a template list or query result that used to compare with current items, or you can't define which not exist and use to filter.

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

I would need some specific information to work with here for how I would go about that. This is all in a fabric pipeline

Hi @DebbieE,

Here is the document link about use dataflow in data pipeline, you can use M query editor to operation with query table records:

Use a dataflow in a pipeline - Microsoft Fabric | Microsoft Learn

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

Helpful resources

Announcements
Sept Fabric Carousel

Fabric Monthly Update - September 2024

Check out the September 2024 Fabric update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

Sept NL Carousel

Fabric Community Update - September 2024

Find out what's new and trending in the Fabric Community.