Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv with iterator=True and engine = "python" -> error in if self.usecols #12546

Closed
DSLituiev opened this issue Mar 6, 2016 · 3 comments
Closed
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@DSLituiev
Copy link

Code Sample, a copy-pastable example if possible

Following example returns error if run with engine = "python":

  ite = pd.read_csv(infile, sep = '\t', index_col = False, 
              # dtype=pd.np.float32, na_filter=False,low_memory=False,
                usecols = pd.np.arange(nindexcolumns, ncol), 
                iterator = True,   chunksize = chunksize  , engine = "python" )
 print(list(ite)[0])

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-e39439e21b50> in <module>()
     17                     iterator = True,   chunksize = chunksize  , engine = "python" )
     18 
---> 19 print(list(ite)[0])

/home/.usr/py3/lib/pandas-0.18.0rc1+80.g820e110.dirty-py3.3-linux-x86_64.egg/pandas/io/parsers.py in __next__(self)
    739 
    740     def __next__(self):
--> 741         return self.get_chunk()
    742 
    743     def _make_engine(self, engine='c'):

/home/.usr/py3/lib/pandas-0.18.0rc1+80.g820e110.dirty-py3.3-linux-x86_64.egg/pandas/io/parsers.py in get_chunk(self, size)
    780         if size is None:
    781             size = self.chunksize
--> 782         return self.read(nrows=size)
    783 
    784 

/home/.usr/py3/lib/pandas-0.18.0rc1+80.g820e110.dirty-py3.3-linux-x86_64.egg/pandas/io/parsers.py in read(self, nrows)
    759                 raise ValueError('skip_footer not supported for iteration')
    760 
--> 761         ret = self._engine.read(nrows)
    762 
    763         if self.options.get('as_recarray'):

/home/.usr/py3/lib/pandas-0.18.0rc1+80.g820e110.dirty-py3.3-linux-x86_64.egg/pandas/io/parsers.py in read(self, rows)
   1617             content = content[1:]
   1618 
-> 1619         alldata = self._rows_to_cols(content)
   1620         data = self._exclude_implicit_index(alldata)
   1621 

/home/.usr/py3/lib/pandas-0.18.0rc1+80.g820e110.dirty-py3.3-linux-x86_64.egg/pandas/io/parsers.py in _rows_to_cols(self, content)
   1997             raise ValueError(msg)
   1998 
-> 1999         if self.usecols:
   2000             if self._implicit_index:
   2001                 zipped_content = [

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.3.2.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-573.12.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0rc1+80.g820e110.dirty
nose: 1.3.0
pip: 8.0.2
setuptools: 0.9.8
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.12.1
statsmodels: None
xarray: None
IPython: 4.1.0-dev
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.7.dev0
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None

@jreback
Copy link
Contributor

jreback commented Mar 6, 2016

can you show using supplied data and StringIO, e.g. a copy-pastable example

@jreback jreback added Bug IO CSV read_csv, to_csv and removed Bug labels Mar 6, 2016
@ivannz
Copy link
Contributor

ivannz commented Jul 30, 2016

The cause if the problem is, as far as I know, that use_cols takes a python's list as a value, and the semantics of if self.use_cols: in this case is to check for emptiness. Now in the supplied scant example the user passes a numpy.array instead, which has different (ambiguous) semantics when put as the condition of an if statement, and it definitely does not check for emptiness. So, passing anything but a list is incorrect usage according to the current code. However, in the documentation for read_csv it states

usecols : array-like, default None

Which actually make this use case plausible and correct according to docs.

I think we should either change the documentation to list and add a strict type check, or inject something in the lines of

use_cols = np.asarray(use_cols, dtype=int).ravel().tolist()

Into _parser_f of read_csv

@gfyoung
Copy link
Member

gfyoung commented Aug 20, 2016

Here's how we can reproduce this:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> import numpy as np
>>> usecols = np.array(['a', 'b'])
>>> data = 'a,b,c\n1,2,3'
>>> read_csv(StringIO(data), usecols=usecols, engine='python')
...
ValueError: The truth value of an array with more than one element is ambiguous
. Use a.any() or a.all()

Note that the C engine is perfectly happy.

@jreback jreback added this to the Next Major Release milestone Aug 20, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 20, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, Next Major Release Aug 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants