Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when writing non-ascii allowed characters to Stata dta #7286

Closed
bquistorff opened this issue May 30, 2014 · 4 comments
Closed

Error when writing non-ascii allowed characters to Stata dta #7286

bquistorff opened this issue May 30, 2014 · 4 comments
Labels
Bug IO Stata read_stata, to_stata Unicode Unicode strings
Milestone

Comments

@bquistorff
Copy link
Contributor

When trying to write a Stata dataset with strings containing upper latin-1 characters (which are allowed by the Stata format), I get an encoding error.

import pandas.io.stata as sta
sr = sta.StataReader('pandas/pandas/io/tests/data/stata1_encoding.dta')
df = sr.data()
sw = sta.StataWriter('stata1_encoding_dup.dta', df)
sw.write_file()

I get the following output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\pandas\io\stata.py", line 1242, in write_file
    self._write_data_nodates()
  File "C:\Python27\lib\site-packages\pandas\io\stata.py", line 1326, in _write_data_nodates
    self._write(var)
  File "C:\Python27\lib\site-packages\pandas\io\stata.py", line 1104, in _write
    self._file.write(to_write)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1:ordinal not in range(128)

Machine info:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 6 Stepping 3, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.0
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.3
bottleneck: 0.8.0
tables: None
numexpr: None
matplotlib: None
openpyxl: 1.8.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented May 30, 2014

cc @bashtage

hmm. it seems it not being encoded to the passed encoding, odd (and is not tested).

the reading however works

FYI, you can/should use:

df = pd.read_stata(.......,encoding='latin-1')
df.to_stata(......,encoding='latin-1')
(thought this last seems to be a bug)

@jreback jreback added this to the 0.14.1 milestone May 30, 2014
@jreback jreback added Stata and removed Data IO labels Jun 5, 2014
@bashtage
Copy link
Contributor

A simple fix - the encoding was never used when writing a file.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2014

assume that stata itself can read the encoded file?

iIIRC they always encode in cp1252 or something like that

@bashtage
Copy link
Contributor

The original files opens fine in stata, so latin1 seems to be OK. Unicode isn't supported though.

On Jun 13, 2014 6:38 PM, jreback notifications@github.com wrote:

assume that stata itself can read the encoded file?

iIIRC they always encode in cp1252 or something like that


Reply to this email directly or view it on GitHubhttps://github.com//issues/7286#issuecomment-46067490.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Stata read_stata, to_stata Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

3 participants