Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV-264: Added DuplicateHeaderMode for flexibility with header strictness. #114

Merged
merged 2 commits into from
Feb 19, 2022

Conversation

SethFalco
Copy link
Contributor

Instead of only having a boolean true/false for how duplicate header values are handled, this uses an enum instead.

Previously the only possibilities were:

  • true: To allow duplicates.
  • false: To disallow duplicates, except empty cells which were allowed to be duplicates.

This pull request makes an enum with three options:

  • ALLOW_ALL: To always allow duplicates. (Same as true previously.)
  • ALLOW_EMPTY: To allow duplicates only if they're empty cells. (Same as false previously.)
  • DISALLOW: To disallow all duplicates.

This provides a little more flexibility in the strictness for the parser, and makes what the options do clearer.

Jira Issue: https://issues.apache.org/jira/browse/CSV-264

@aherbert
Copy link
Contributor

aherbert commented Oct 2, 2020

Thanks for the PR. You have removed some public API methods. This is not allowed and would fail binary compatibility tests. I have enabled these on the CI environments. If you rebase on master the errors should appear in the checks.

You can run:

mvn

on the project to see the full checks on the code.

The mvn clirr:check goal will fail and indicate the public methods that have been removed. These should be supported in addition to the new methods with the new duplicate header names enum.

@coveralls
Copy link

coveralls commented Oct 2, 2020

Coverage Status

Coverage increased (+0.02%) to 98.327% when pulling 1be64a6 on SethFalco:CSV-264 into 46bae4b on apache:master.

@SethFalco
Copy link
Contributor Author

Sorry about that, I rebased master and applied the changes requested.
I believe this should be good now, though I'll take a peek at the unsuccessful checks now.

I usually use git pull and git merge to obtain changes, I'm a bit new to using rebase however.
I'm not 100% sure if I was supposed to use git push --force-with-lease..., but the usual git push... was throwing warnings, please let me know if that's a problem.

@garydgregory
Copy link
Member

You can force push to your branch all you want; -) we usually do not rewrite history for long lived branches like master and release.

@SethFalco
Copy link
Contributor Author

Not sure if you guys are interested in merging this PR, but figured I'd fix the merge conflicts that appeared from recent commits.

@garydgregory
Copy link
Member

@SethiPandi
Please rebase on master.
TY!
Gary

@SethFalco SethFalco force-pushed the CSV-264 branch 2 times, most recently from 3268d4e to a22ace9 Compare December 13, 2020 15:23
@SethFalco
Copy link
Contributor Author

Just a heads up that this is rebased with master now!

@garydgregory
Copy link
Member

garydgregory commented Jul 5, 2021

@SethFalco
Please accept my apologies for the delay in further reviewing this PR.

I feel there is something important missing here:

  • What happens/should happen for each enum value when you call CSVRecord.get(String).
  • What happens/should happen for each enum value when you call CSVRecord.toMap().

For example, if duplicate headers are allowed, does the first column win? Should it? Does the last column win? Certainly, it should not be random or undocumented.

Also, would you mind rebasing on master please? I've recently deprecated the with methods in favor of a Builder. New settings should be added through the Builder, not with methods.

@SethFalco
Copy link
Contributor Author

SethFalco commented Jul 9, 2021

It's worth noting this change doesn't impact how data is resolved internally or how headers are referenced, but just how if duplicate empty headers are allowed. I can peek into the current behavior anyway and try to document it if it's not already.

Meanwhile, I've rebased with master.
I also squashed my commits to make rebasing a bit easier.

I also updated an exception to no longer recommend using one of the deprecated methods.

Edit: Sorry, forgot to add @ Deprecated to the getter.

@SethFalco SethFalco force-pushed the CSV-264 branch 2 times, most recently from d74d1cc to 1be64a6 Compare July 9, 2021 01:17
@SethFalco
Copy link
Contributor Author

I didn't see it documented, figured the easiest way to find out would be to just test it.
It currently uses the last column with the given name.

What happens/should happen for each enum value when you call CSVRecord.get(String).

A A
1 2
3 4
List<CSVRecord> records = parser.getRecords();

for (CSVRecord record : records)
    System.out.println(record.get("A"));

// 2
// 4

What happens/should happen for each enum value when you call CSVRecord.toMap().

A A B B
1 2 5 6
3 4 7 8
List<CSVRecord> records = parser.getRecords();

for (CSVRecord record : records) {
    Map<String, String> map = record.toMap();
    System.out.println(map);
}

// {A=2, B=6}
// {A=4, B=8}

I think we could add a docstring on CSVRecord#get and CSVRecord#toMap to clarify this.
We could also add something on CSVFormat#setDuplicateHeaderMode?

It might be a good idea to add test cases to cover this scenario, if it'll become documented behavior.

Mind if I do this in a separate PR?
I'd say it counts as a separate change from this one, so it'll keep things tidier.

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SethFalco
Thank you for the update. Please see my comments scattered throughout.
Gary

@SethFalco SethFalco force-pushed the CSV-264 branch 2 times, most recently from b6b801b to adf02b2 Compare July 9, 2021 14:27
@SethFalco
Copy link
Contributor Author

Sorry about the imports, I'm guessing that due to a setting in IntelliJ setting back when I was using that.
I've rebased with master and done the requested changes. (I hope!)

@garydgregory
Copy link
Member

@SethFalco
TY, the PR looks good to me. I'd like feedback from the community as well.

@SethFalco
Copy link
Contributor Author

Just rebased this to resolve merge conflicts.

I'd like feedback from the community as well.

Seems no one wants to comment on this. ^-^'

Any chance this could be merged soon, or still waiting on feedback? (Note, I'm in no rush.)

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a good fix for the issue described in JIRA. Only adds one more setting, giving more flexibility for users.

+1 and thanks for the contribution and patience @SethFalco !

-Bruno

@garydgregory garydgregory merged commit 4fdfc59 into apache:master Feb 19, 2022
asfgit pushed a commit that referenced this pull request Feb 19, 2022
@SethFalco SethFalco deleted the CSV-264 branch February 26, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants