-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: parser_source as a filename with multibyte characters in Windows(non utf-8 filesystem) #6807
Conversation
thanks can u write a test that fails without this change and works with it ? |
@hshimizu77 tests? |
I have added a test but I'm wondering it's right place or not. |
@@ -535,7 +535,10 @@ cdef class TextReader: | |||
|
|||
if isinstance(source, basestring): | |||
if not isinstance(source, bytes): | |||
source = source.encode('utf-8') | |||
if sys.getfilesystemencoding() is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do:
source = source.encode(sys.getfilesystemencoding() or 'utf-8')
that's the correct place |
@TomAugspurger can you test this on OSX? @hshimizu77 ok...let me test on windows; I think we may put this in w/o the test (but don't change the test ATM). as I don't think pandas will install on some systems because of the encoding of the filename (in theory it should always work, but don't want to bork older systems). |
The test errors
Seems to work from the interpreter though: In [1]: from test_parsers import *
In [2]: path = tm.get_data_path(u'日本語ファイル名テスト_read_csv_in_win_filesystem.csv')
In [3]: pd.read_csv(path)
Out[3]:
A B C
0 100 200 300
1 aaa bbb ccc
[2 rows x 3 columns] |
I think you want |
for this
use |
merged via 79df67a thanks! |
fopen() in Windows doesn't accept utf-8 encoded filename with multibyte characters, so need to convert it to filesystem encoding.
Set 'utf-8' as default in case sys.getfilesystemencoding() return None.
sys.getfilesystemencoding() will return 'mbcs' in Windows, and will 'utf-8' or user setting in other systems.